Biostat III exercises in R

Laboratory exercise 8

Suggested solutions by

Author: Annika Tillander, 2014-01-30
Edited: Andreas Karlsson, 2015-03-01, 2016-03-09
Benedicte Delcoigne, 2015-03-03
Johan Zetterqvist, 2015-03-04

Diet data: Using Poisson regression to study the effect of energy intake adjusting for confounders on two different timescales

Use Poisson regression to study the association between energy intake (hieng) and CHD adjusted for potential confounders (job, BMI). We know that people who expend a lot of energy (i.e., are physically active) require a higher energy intake. We do not have data on physical activity but we are hoping that occupation (job) will serve as a surrogate measure of work-time physical activity (conductors on London double-decker busses expend energy walking up and down the stairs all day).

Fit models both without adjusting for ‘time’ and by adjusting for attained age (you will need to split the data) and time-since-entry and compare the results.


Load the diet data using time-on-study as the timescale.

You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.

require(survival) # for Surv and survfit
require(foreign)  # for reading data set from Stata
require(ggplot2)
require(dplyr)    # for data manipulation
require(muhaz)    # for hazard estimates


###########################################
### A help function to calculate ###
### and print incidence (hazard) ratios
### from a fitted poisson regression
### from glm
###########################################

IRR <- function(fit){
    summfit <- summary(fit )$coefficients
    IRfit <- exp( cbind( summfit[, 1:2], summfit[, 1] - 1.96*summfit[, 2], summfit[, 1] +
                        1.96*summfit[, 2] ) )
    colnames(IRfit) <- c("IRR", "Std. err", "CI_lower", "CI_upper")
    print(IRfit)
}

Load diet data and explore it.

diet <- read.dta("http://biostat3.net/download/diet.dta")
head(diet)
summary(diet)

Rates can be modelled on different timescales, e.g., attained age, time-since-entry, calendar time. Plot the CHD incidence rates both by attained age and by time-since-entry. Is there a difference? Do the same for CHD hazard by different energy intakes (hieng).

(a1) Plot the CHD incidence rates by attained age.Do the same for CHD hazard by different energy intakes (hieng). The first step will be to create the variable “att_age” i.e. attained age.

## Creating a variable for attained age
diet <- diet %>% mutate(att_age = as.numeric(dox - dob) / 365.24)
summary(diet$att_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42.57   58.97   64.32   63.13   69.05   70.00

Plot the CHD incidence rates by attained age.

(a2) Plot the CHD incidence rates by time-since-entry. Do the same for CHD hazard by different energy intakes (hieng).

diet <- diet %>% mutate(t_entry = as.numeric(dox - doe) / 365.24)
summary(diet$t_entry)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2875 10.7800 15.4600 13.6600 17.0400 20.0400
dietHiengEntry <- diet %>%  group_by(hieng) %>%
    do( h = muhaz(times = .$t_entry, delta = .$chd,
            min.time = min(.$t_entry[as.logical(.$chd)]), max.time = max(.$t_entry[as.logical(.$chd)]))) %>%
    do( data.frame(Hazard = .$h$haz.est, Time = .$h$est.grid, Strata = .$hieng))
ggplot(dietHiengEntry, aes(x=Time, y=Hazard, colour= Strata)) +
    geom_line() + ggtitle("Smoothed Hazard estimates") + theme(legend.position="bottom")

(b) Fit a poisson model to find the incidence rate ratio for the high energy group compared to the low energy group without adjusting for any time scale.

diet <- mutate(diet, y1k = y / 1000)
poisson8b <- glm( chd ~ hieng + offset( log( y1k ) ), family=poisson, data=diet)
summary(poisson8b)
## 
## Call:
## glm(formula = chd ~ hieng + offset(log(y1k)), family = poisson, 
##     data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7382  -0.6337  -0.4899  -0.3891   3.0161  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.6098     0.1890  13.811   <2e-16 ***
## hienghigh    -0.6532     0.3021  -2.162   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 258.00  on 335  degrees of freedom
## AIC: 354
## 
## Number of Fisher Scoring iterations: 6
exp(cbind(coef(poisson8b),confint(poisson8b)))
##                            2.5 %     97.5 %
## (Intercept) 13.5959916 9.1614552 19.2715805
## hienghigh    0.5203599 0.2829171  0.9328392

(c) Adjust for BMi and job. Is the effect of energy intake on CHD confounded by BMI and job?

## Create BMI variable
diet$bmi <- diet$weight/((diet$height/100)^2)

## Create orderly varable instead of categorical, start at zero
diet <- diet %>% mutate(jobNumber = match(job, c("driver","conductor","bank")) -1)

poisson8c <- glm( chd ~ hieng + jobNumber + bmi + offset( log( y1k ) ), family=poisson, data=diet)
summary(poisson8c)
## 
## Call:
## glm(formula = chd ~ hieng + jobNumber + bmi + offset(log(y1k)), 
##     family = poisson, data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8376  -0.6007  -0.4815  -0.3919   3.0614  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.49616    1.18209   1.266   0.2056  
## hienghigh   -0.69995    0.30986  -2.259   0.0239 *
## jobNumber   -0.08706    0.17170  -0.507   0.6121  
## bmi          0.05091    0.04757   1.070   0.2845  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 255.01  on 331  degrees of freedom
## Residual deviance: 249.03  on 328  degrees of freedom
##   (5 observations deleted due to missingness)
## AIC: 347.03
## 
## Number of Fisher Scoring iterations: 6
exp(cbind(coef(poisson8c),confint(poisson8c)))
##                           2.5 %    97.5 %
## (Intercept) 4.4644913 0.4223726 43.679658
## hienghigh   0.4966098 0.2663418  0.905103
## jobNumber   0.9166234 0.6563978  1.292278
## bmi         1.0522316 0.9579153  1.154651
##########################################################

(d) Now fit the model for CHD, both without and with the adjustment for job and bmi. Is the effect of hieng on CHD confounded by age, BMI or job? Write the linear predictors using pen and paper.

Firstly, let’s adjust for the timescale attained age. To do this in Poisson regression you must split the data on timescale age. The risktime variable contains the correct amount of risktime for each timeband.

## Create a variable for age at entry
diet <- mutate(diet, entry_age = as.numeric( doe - dob ) / 365.24)

## Split time at 30,50,60 and 72 with time scale age at entry to attained age, the zero=0 makes output be in absolute age
diet.spl.dob <- survSplit(diet, cut=c(30,50,60,72), end="att_age", start="entry_age", event="chd", zero = 0)

## Tabulate ageband
diet.spl.dob %>% select(id, entry_age, att_age, y) %>% filter(id<=3) %>% arrange(id, entry_age)
##   id entry_age  att_age        y
## 1  1  49.61669 50.00000 12.29295
## 2  1  50.00000 60.00000 12.29295
## 3  1  60.00000 61.90998 12.29295
## 4  2  50.53937 60.00000 11.95893
## 5  2  60.00000 62.49863 11.95893
## 6  3  58.78600 60.00000 11.04175
## 7  3  60.00000 69.82806 11.04175
## Create an agespan variable
diet.spl.dob <- mutate(diet.spl.dob,
                       agespan = NA,
                       agespan = ifelse( entry_age>=30 & entry_age <50 , 30, agespan),
                       agespan = ifelse( entry_age>=50 & entry_age <60 , 50, agespan),
                       agespan = ifelse( entry_age>=60 & entry_age <72, 60, agespan))

## Make the numeric variables factors since we want to model them with dummie variables and calculate time at risk
diet.spl.dob <- mutate(diet.spl.dob,
                       agespan = as.factor(agespan),
                       jobNumber = as.factor(jobNumber),
                       risk_time = (att_age - entry_age))

## Tabulate ageband including risk_time
diet.spl.dob %>% select(id,  entry_age, att_age, y,risk_time) %>% filter(id<=3) %>% arrange(id, entry_age)
##   id entry_age  att_age        y  risk_time
## 1  1  49.61669 50.00000 12.29295  0.3833096
## 2  1  50.00000 60.00000 12.29295 10.0000000
## 3  1  60.00000 61.90998 12.29295  1.9099770
## 4  2  50.53937 60.00000 11.95893  9.4606286
## 5  2  60.00000 62.49863 11.95893  2.4986310
## 6  3  58.78600 60.00000 11.04175  1.2139963
## 7  3  60.00000 69.82806 11.04175  9.8280583
## Tabulate number of events per agespan
diet.spl.dob %>%
    group_by(agespan) %>%
    summarise(noFailure=sum(chd==0), Failure=sum(chd), Total=n())
## Source: local data frame [3 x 4]
## 
##   agespan noFailure Failure Total
## 1      30       190       6   196
## 2      50       275      18   293
## 3      60       218      22   240

Fitting the model for CHD, without adjustment for job and bmi.

poisson8d <- glm( chd ~ hieng + agespan + offset( log( risk_time) ),
                 family=poisson,
                 data=diet.spl.dob)

summary(poisson8d)
## 
## Call:
## glm(formula = chd ~ hieng + agespan + offset(log(risk_time)), 
##     family = poisson, data = diet.spl.dob)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6253  -0.4312  -0.3246  -0.2050   2.8938  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.7798     0.4320 -11.065   <2e-16 ***
## hienghigh    -0.6233     0.3026  -2.060   0.0394 *  
## agespan50     0.3025     0.4721   0.641   0.5217    
## agespan60     0.8451     0.4613   1.832   0.0670 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 321.05  on 728  degrees of freedom
## Residual deviance: 311.40  on 725  degrees of freedom
## AIC: 411.4
## 
## Number of Fisher Scoring iterations: 6
IRR(poisson8d)
##                     IRR Std. err    CI_lower   CI_upper
## (Intercept) 0.008397598 1.540327 0.003601089 0.01958287
## hienghigh   0.536168916 1.353438 0.296266536 0.97033270
## agespan50   1.353254649 1.603338 0.536451590 3.41372489
## agespan60   2.328213936 1.586165 0.942628010 5.75049762

Fitting the model for CHD, with adjustment for job and bmi.

poisson8d <- glm( chd ~ hieng + agespan + jobNumber + bmi + offset( log( risk_time) ),
                 family=poisson,
                 data=diet.spl.dob)

summary(poisson8d)
## 
## Call:
## glm(formula = chd ~ hieng + agespan + jobNumber + bmi + offset(log(risk_time)), 
##     family = poisson, data = diet.spl.dob)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7714  -0.4027  -0.3083  -0.1940   3.0174  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.79183    1.31326  -5.172 2.32e-07 ***
## hienghigh   -0.71303    0.31387  -2.272   0.0231 *  
## agespan50    0.53692    0.50868   1.056   0.2912    
## agespan60    1.07421    0.49668   2.163   0.0306 *  
## jobNumber1   0.43510    0.40670   1.070   0.2847    
## jobNumber2  -0.13791    0.37184  -0.371   0.7107    
## bmi          0.07388    0.04852   1.523   0.1278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 313.24  on 718  degrees of freedom
## Residual deviance: 298.77  on 712  degrees of freedom
##   (10 observations deleted due to missingness)
## AIC: 402.77
## 
## Number of Fisher Scoring iterations: 6
IRR(poisson8d)
##                     IRR Std. err     CI_lower   CI_upper
## (Intercept) 0.001122909 3.718282 8.559996e-05 0.01473044
## hienghigh   0.490157708 1.368719 2.649476e-01 0.90680021
## agespan50   1.710737260 1.663102 6.312223e-01 4.63643632
## agespan60   2.927691811 1.643254 1.105972e+00 7.75008735
## jobNumber1  1.545112410 1.501852 6.962588e-01 3.42885793
## jobNumber2  0.871175509 1.450398 4.203306e-01 1.80559478
## bmi         1.076677746 1.049711 9.790139e-01 1.18408426

The effect of high energy intake is somewhat confounded by age, but also confounded by job and bmi. What assumption is being made about the shape of the baseline hazard (HINT: the baseline hazard takes the shape of the timescale)?

(e) Secondly, do the same analysis, but now adjust for the timescale time-since-entry.

diet.spl.t_entry <- survSplit(diet, cut=c(0, 5, 10, 15, 22), end="t_entry", start="start", event="chd")

##Tabulate ageband
diet.spl.t_entry %>% select(id, start, t_entry, y) %>% filter(id<=3) %>% arrange(id, t_entry)
##   id start  t_entry        y
## 1  1     0  5.00000 12.29295
## 2  1     5 10.00000 12.29295
## 3  1    10 12.29329 12.29295
## 4  2     0  5.00000 11.95893
## 5  2     5 10.00000 11.95893
## 6  2    10 11.95926 11.95893
## 7  3     0  5.00000 11.04175
## 8  3     5 10.00000 11.04175
## 9  3    10 11.04205 11.04175
diet.spl.t_entry <- mutate(diet.spl.t_entry,
                           fu = as.factor(start) ,
                           risk_time = (t_entry-start))

##Tabulate ageband including risk_time
diet.spl.t_entry %>% select(id, start, t_entry, y, risk_time) %>% filter(id<=3) %>% arrange(id, t_entry)
##   id start  t_entry        y risk_time
## 1  1     0  5.00000 12.29295  5.000000
## 2  1     5 10.00000 12.29295  5.000000
## 3  1    10 12.29329 12.29295  2.293287
## 4  2     0  5.00000 11.95893  5.000000
## 5  2     5 10.00000 11.95893  5.000000
## 6  2    10 11.95926 11.95893  1.959260
## 7  3     0  5.00000 11.04175  5.000000
## 8  3     5 10.00000 11.04175  5.000000
## 9  3    10 11.04205 11.04175  1.042055
poisson8e1 <- glm( chd ~ fu + hieng + offset( log( risk_time) ),
                 family=poisson,
                 data=diet.spl.t_entry )

summary(poisson8e1)
## 
## Call:
## glm(formula = chd ~ fu + hieng + offset(log(risk_time)), family = poisson, 
##     data = diet.spl.t_entry)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3994  -0.3344  -0.2717  -0.2417   4.2937  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.25958    0.26933 -15.815   <2e-16 ***
## fu5         -0.23369    0.37705  -0.620   0.5354    
## fu10         0.12151    0.36841   0.330   0.7415    
## fu15        -0.05012    0.55562  -0.090   0.9281    
## hienghigh   -0.64923    0.30212  -2.149   0.0316 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 391.17  on 1099  degrees of freedom
## Residual deviance: 385.52  on 1095  degrees of freedom
## AIC: 487.52
## 
## Number of Fisher Scoring iterations: 6
IRR(poisson8e1)
##                    IRR Std. err    CI_lower   CI_upper
## (Intercept) 0.01412826 1.309093 0.008333485 0.02395249
## fu5         0.79160514 1.457978 0.378056495 1.65752659
## fu10        1.12919965 1.445431 0.548498785 2.32469404
## fu15        0.95111381 1.743027 0.320093214 2.82610640
## hienghigh   0.52244901 1.352725 0.288983643 0.94452739

Compare the results with the analysis adjusted for attained age. Are there any differences? Why (or why not)? Go back to the graphs at the beginning of the exercise and look for explanations.