Different ways to create multicategory variables

Coding of multicategory variables can be done in many different ways. The easiest solution is to use the commands generate and replace to code one category at a time. This works fine for a few categories. You then start by coding a value using generate, and then use replace commands to moderate values ​​based on conditions (one command line for each value). The disadvantage of this is that you risk ending up with many command lines and long scripts that require resources and take a long time to run.

If you want to code many categories, possibly using complicated conditions, then it is recommended to use the command recode(). This can be used to set up all the code conditions in a single command statement, making scripts more compact and faster to run. Through recode() among other things, you can enter value intervals and create associated value labels (so that you don’t have to do this afterwards through the commands define-labels and assign-labels).

A third set of tools for setting up code expressions for multicategory variables are the functions inlist() and inrange(). These are ideal if you want to create extensive code conditions, usually in combination with generate and replace, e.g. if you want to make a rough grouping of municipalities, where you need to list larger sets of municipality codes.

For those who want to set up code expressions for many categories, there is a fourth option: Automatic generation of recoding expressions by uploading a punctuation-separated recoding file. For more on this, click here.

You will find more information about the use of generate, replace, recode(), inlist(), inrange() and automatic transcoding in User guide, chapters 3.1 – 3.2.

This script demonstrates the different ways to code multicategory variables:

require no.ssb.fdb:19 as ds

create-dataset demo

//Create multiple categories through generate and replace
import ds/INNTEKT_BRUTTOFORM 2019-12-31 as wealth

generate wealthint = 1
replace wealthint = 2 if wealth > 500000
replace wealthint = 3 if wealth > 1000000
replace wealthint = 4 if wealth > 1500000

tabulate wealthint



//Create multiple categories (world regions) through recode()
create-dataset population
import ds/BEFOLKNING_STATUSKODE 2020-01-01 as statuscode
keep if statuscode == '1'

import ds/BEFOLKNING_FODELAND as birth_country
tabulate birth_country

destring birth_country
recode birth_country (111 120 138 139 140 148 155 156 159/164 = 2 'European countries outside EU') (101/141 144/158 = 1 'EU/EEC') (203/393 = 3 'Africa') (143 404/578 = 4 'Asia inkl. Tyrkia') (612 684 = 5 'North-America') (601/775 = 6 'South- and Pan-Amerika') (802/840 = 7 'Osceania') (980 = 8 'Statesless') (990 = 9 'Unknown')
tabulate birth_country



//Create dummy code for big cities based on municipality codes for the four largest municipalities in Norway using inlist()
import ds/BOSATTEFDT_BOSTED 2021-12-31 as municipality
generate big_city = 0
replace big_city = 1 if inlist(municipality,'0301','4601','1103','5001')
tabulate municipality if big_city, rowsort()



//Group yearly wage into six intervals using inrange()
import ds/INNTEKT_LONN 2020-12-31 as wage

generate wage_gr = 0
replace wage_gr = 1 if inrange(wage,1,200000)
replace wage_gr = 2 if inrange(wage,200001,400000)
replace wage_gr = 3 if inrange(wage,400001,600000)
replace wage_gr = 4 if inrange(wage,600001,800000)
replace wage_gr = 5 if wage > 800000

define-labels wage_int 0 '0 kr' 1 '1 - 200 000 kr' 2 '200001 - 400 000 kr' 3 '400 001 - 600 000 kr' 4 '600 001 - 800 000 kr' 5 '800 000 kr ->'
assign-labels wage_gr wage_int
tabulate wage_gr

//Alternative way of coding wage intervals by using recode()
replace wage = 0 if sysmiss(wage)
recode wage (1/200000 = 1)(200001/400000 = 2)(400001/600000 = 3)(600001/800000 = 4)(800001/max = 5)
assign-labels wage wage_int
tabulate wage