Advanced logistic regression analysis
The example illustrates how to proceed in order to analyse the probability of getting a job and earning over NOK 500,000 one year after one is in a state without a job. The age group we are looking at is 16-60.
The analysis checks for various demographic characteristics as well as status on the labour market (unemployed, ordinary labour market measures, vocationally disabled, other jobseeker conditions, as well as work disability).
Some descriptive statistics are first created, and finally a logit analysis including marginal effects is run (the option mfx(dydx)
used for this).
As can be seen, all explanatory variables have significant estimated coefficient values. The value of Pseudo R2 shows that the model explains approx. 18% of the total variation for the dependent variable. Such values are not unusual in socio-economic analyses.
//Connect to database
require no.ssb.fdb:23 as db
//Create population of persons 16-60 år without a job in November 2018, and resident in Norway per 1. Januar 2019
create-dataset demographydata
import db/BEFOLKNING_FOEDSELS_AAR_MND as birth_year_month
import db/BEFOLKNING_STATUSKODE 2019-01-01 as regstat
import db/REGSYS_ARB_ARBMARK_STATUS 2018-11-16 as labourstat
generate age = 2018 - int(birth_year_month / 100)
generate job = 0
replace job = 1 if labourstat == '1' | labourstat == '2'
keep if age >= 16 & age <= 60 & regstat == '1' & job == 0
histogram age, discrete
//Import relevant variables (demography data are mostly measured per 1/1 each year)
import db/BEFOLKNING_KJOENN as gender
import db/BEFOLKNING_INVKAT as imm_cat
import db/SIVSTANDFDT_SIVSTAND 2018-11-16 as civstat
import db/BEFOLKNING_BARN_I_REGSTAT_FAMNR 2019-01-01 as children
import db/NUDB_BU 2018-11-16 as edu
import db/NUDB_SOSBAK as social_background
import db/BEFOLKNING_KOMMNR_FAKTISK 2019-01-01 as municipality
import db/ARBSOEK2001FDT_HOVED 2018-11-16 as work_seeker_stat
import db/UFOERP2011FDT_GRAD 2018-11-16 as disability_level
import db/INNTEKT_BRUTTOFORM 2018-12-31 as wealth
import db/INNTEKT_WYRKINNT 2019-12-31 as work_income19
//Create a dependent variable with two outcomes (dummy variable): High work income vs. low work income
histogram work_income19, width(100000) freq
summarize work_income19
generate high_income = 0
replace high_income = 1 if work_income19 > 500000
piechart high_income
//Adapt the independent variables so that they suit the statistical model (most of them need to be tranformed into dummy variables)
generate male = 0
replace male = 1 if gender == '1'
piechart male
destring civstat
generate married = 0
replace married = 1 if civstat == 2
replace married = civstat if sysmiss(civstat)
piechart married
generate immigrant = 0
replace immigrant = 1 if imm_cat == 'B'
piechart immigrant
tabulate children, missing
generate child = 0
replace child = 1 if children == 1
generate more_children = 0
replace more_children = 1 if children > 1
destring edu
generate high_edu = 0
replace high_edu = 1 if edu >= 700000 & edu < 900000
replace high_edu = edu if sysmiss(edu)
piechart high_edu
generate high_edu_parents = 0
replace high_edu_parents = 1 if social_background == '1'
piechart high_edu_parents
generate oslo = 0
replace oslo = 1 if municipality == '0301'
generate bergen = 0
replace bergen = 1 if municipality == '1201'
generate stavanger = 0
replace stavanger = 1 if municipality == '1103'
generate trondheim = 0
replace trondheim = 1 if municipality == '5001'
barchart(sum) oslo bergen stavanger trondheim
destring work_seeker_stat
tabulate work_seeker_stat, missing
generate unempl = 0
replace unempl = 1 if work_seeker_stat == 1
generate measure = 0
replace measure = 1 if work_seeker_stat == 3
generate voc_disabled = 0
replace voc_disabled = 1 if work_seeker_stat == 5 | work_seeker_stat >= 10
generate other_workseekers = 0
replace other_workseekers = 1 if work_seeker_stat == 2 | work_seeker_stat == 4 | work_seeker_stat == 7
generate disabled = 1
replace disabled = 0 if sysmiss(disability_level)
barchart(sum) unempl measure voc_disabled other_workseekers disabled
histogram wealth, width(100000) freq
summarize wealth
generate wealth_high = 0
replace wealth_high = 1 if wealth > 1000000
replace wealth_high = wealth if sysmiss(wealth)
piechart wealth_high
//Use sankey diagram to show transitions between states
sankey work_seeker_stat high_income
sankey high_edu high_income
//Run logit analysis where the dependent variable is allways listed first (needs to be dummy)
logit high_income male married age immigrant child more_children high_edu high_edu_parents oslo bergen stavanger trondheim unempl measure voc_disabled other_workseekers disabled wealth_high, mfx(dydx)