How to prepare data for survival analysis

There are several ways to calculate “time” (the number of time units from the start of the measurement period until a specific event occurs) in survival analysis. We will demonstrate two methods here:

a) Event based import and use of starting date variables

b) Ready-to-use date variables with fixed values ​​per unit/individual

The script below demonstrates how to make the adaptations for the two options. There are some similarities, but also some differences:

require no.ssb.fdb:23 as ds

A) Use of eventbased variables and collapse(min)

//Create dataset with relevant eventbased variable and define measurement period
create-dataset unemployed
import-event ds/ARBSOEK2001FDT_HOVED 2010-01-01 to 2019-12-15 as workseeker_status

//Keep all events where workseeker status = fully unemployed and date >= 2010
keep if workseeker_status == '1' & START@workseeker_status > date(2010,01,01)

//Retrieve the first event and aggregate to individual level data
collapse (min) START@workseeker_status, by(PERSONID_1)

//Run the analysis on a small random sample (optional)
sample 10000 3245

//Calculate the number of days from the start of the measurement period to the first occurence of the event
generate days = START@workseeker_status - date(2010,01,01)
replace days = 0 if days < 0
summarize days
histogram days

Create the variable event which takes the value 1 for everyone with a value of the number of days. Those who haven't
value for the number of days or which has an event date after the measurement period has passed,
gets the value 0 (people with the value 0 are called censored cases in the technical language).

generate event = 1 if sysmiss(days) == 0
replace event = 0 if sysmiss(days) | START@workseeker_status > date(2019,12,15)

Set the number of days to the maximum value for people during which the event has not occurred in
the measurement period. These are people who have gone through the entire measurement period without the event
happening. These also get event = 0 set through the step above.

replace days = date(2019,12,15) - date(2010,01,01) if sysmiss(days)

tabulate event, summarize(days) mean freq

//Create a year variable to use number of years instead of days
generate year = int(days/365.24)
tabulate year, missing
histogram year, discrete
summarize year event
histogram days

Import relevant variables in order to compare survival rates between groups of the population

import ds/BEFOLKNING_KJOENN as gender
import ds/BEFOLKNING_INVKAT as imm_cat
import ds/BEFOLKNING_FOEDSELS_AAR_MND as birth_year_month

generate age2010 = 2010 - int(birth_year_month/100)
generate agegroup = 1
replace agegroup = 2 if age2010 > 30
replace agegroup = 3 if age2010 > 50
define-labels agelabel 1 "age 0-30" 2 "age 31-50" 3 "age 51 ->"
assign-labels agegroup agelabel

generate norwegian = 0
replace norwegian = 1 if imm_cat == 'A'

kaplan-meier event year
kaplan-meier event days

kaplan-meier event year, by(gender)
kaplan-meier event days, by(gender)

kaplan-meier event year, by(agegroup)
kaplan-meier event days, by(agegroup)

summarize norwegian
tabulate event norwegian
tabulate year norwegian

define-labels norwegianlabel 0 "Foreign origin" 1 "Norwegian origin"
assign-labels norwegian norwegianlabel

kaplan-meier event year, by(norwegian)
kaplan-meier event days, by(norwegian)

kaplan-meier event year, by(imm_cat)
kaplan-meier event days, by(imm_cat)
tabulate year imm_cat

cox event year norwegian age2010 i.gender
cox event year norwegian age2010 i.gender, hazard
cox event days norwegian age2010 i.gender
cox event days norwegian age2010 i.gender, hazard

B) Cross-sectional dataset with dates collected from fixed variables

//Create dataset of persons over 70 years who are residents in Norway per 2010-01-01
create-dataset elder
import ds/BEFOLKNING_FOEDSELS_AAR_MND as birth_year_month
import ds/BEFOLKNING_STATUSKODE 2010-01-01 as regstat
generate age = 2010 - int(birth_year_month/100)
keep if age > 70 & regstat == '1'

Import a ready-to-use date variable (fixed information): Date of death. Do some operations
to be able to create the standard UnixTime format on dates through the date() function

import ds/BEFOLKNING_DOEDS_DATO as death_date
summarize death_date
replace death_date = string(death_date)
generate yyyy = substr(death_date,1,4)
generate mm = substr(death_date,5,2)
generate dd = substr(death_date,7,2)
destring yyyy
destring mm
destring dd
generate death_date2 = date(yyyy,mm,dd)
summarize death_date2

//Calculate number of days measured from 2010-01-01 until death date
generate days = death_date2 - date(2010,01,01)
replace days = 0 if days < 0 

Set event = 1 if death date has a value greater than 2010-01-01.
Uses 2023-01-01 as maximum measurement date. Others get event = 0.

generate event = 0
replace event = 1 if sysmiss(death_date) == 0 & death_date2 >= date(2010,01,01) & death_date2 <= date(2023,01,01)

Set number of days to max value if no death date or death date happens after last measurement date

replace days = date(2023,01,01) - date(2010,01,01) if sysmiss(days) | death_date2 > date(2023,01,01)

tabulate event, summarize(days) mean freq

//Generate a year variable for measurement in number of years
generate year = int(days/365.24)
tabulate year

kaplan-meier event year
kaplan-meier event days

//Import gender to compare survival rates between genders
import ds/BEFOLKNING_KJOENN as gender

kaplan-meier event year, by(gender)
kaplan-meier event days, by(gender)

cox event year age i.gender
cox event year age i.gender, hazard
cox event days age i.gender
cox event days age i.gender, hazard