Exercise 8. Diet data: Using Poisson regression to study the effect of energy intake adjusting for confounders on two different timescales

Use Poisson regression to study the association between energy intake (hieng) and CHD adjusted for potential confounders (job, BMI). We know that people who expend a lot of energy (i.e., are physically active) require a higher energy intake. We do not have data on physical activity but we are hoping that occupation (job) will serve as a surrogate measure of work-time physical activity (conductors on London double-decker busses expend energy walking up and down the stairs all day).

Fit models both without adjusting for ‘time’ and by adjusting for attained age (you will need to split the data) and time-since-entry and compare the results.

Load the diet data using time-on-study as the timescale.

You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.

library(biostat3) 
library(bshazard)  # for bshazard
library(knitr)     # kable
library(broom)     # tidy

Load diet data and explore it.

diet <- transform(biostat3::diet, bmi = weight/((height/100)^2))
head(diet) |> kable("html")

id	chd	y	hieng	energy	job	month	height	weight	doe	dox	dob	yoe	yox	yob	bmi
127	0	16.791239	low	2023.25	conductor	2	173.9900	61.46280	1960-02-16	1976-12-01	1910-09-27	1960.126	1976.917	1910.737	20.30316
200	0	19.958933	low	2448.68	bank	12	177.8000	73.48320	1956-12-16	1976-12-01	1909-06-18	1956.958	1976.917	1909.460	23.24473
198	0	19.958933	low	2281.38	bank	12	NA	NA	1956-12-16	1976-12-01	1910-06-30	1956.958	1976.917	1910.493	NA
222	0	15.394935	low	2467.95	bank	2	158.7500	58.24224	1957-02-16	1972-07-10	1902-07-11	1957.126	1972.523	1902.523	23.11057
305	1	1.494867	low	2362.93	bank	1	NA	NA	1960-01-16	1961-07-15	1913-06-30	1960.041	1961.534	1913.493	NA
173	0	15.958932	low	2380.11	conductor	12	164.4904	79.01712	1960-12-16	1976-12-01	1915-06-28	1960.958	1976.917	1915.487	29.20385

The rates plotted on timescale attained age show a clear increasing trend as age increases, which is to be expected (older persons are more likely to suffer from CHD). The rates plotted on timescale time-since-entry are almost constant (if you have some imagination you can see that the rates are flat).

(a)

scale <- 365.24
plot(bshazard(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ 1,
              verbose=FALSE,
              data=subset(diet,hieng=="low")),
     ylim=c(0,0.03), conf.int=FALSE, xlab="Attained age (years)")
lines(bshazard(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ 1,
               verbose=FALSE,
               data=subset(diet,hieng=="high")),
      col="red", conf.int=FALSE)
legend("topleft", legend=c('hieng=="low"','hieng=="high"'), col=1:2, lty=1, bty="n")

par(mfrow=1:2)
for (level in levels(diet$hieng))
    plot(bshazard(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ 1,
                  verbose=FALSE,
                  data=subset(diet,hieng==level)),
         ylim=c(0,0.03), xlab="Attained age (years)",
         main=sprintf('hieng == "%s"', level))

if (requireNamespace("muhaz")) {
    scale <- 365.24
    plot(muhaz2(Surv((dox - doe)/scale, chd) ~ hieng, data=diet), lty=2,
         xlab="Time since study entry (years)", ylim=c(0,0.025),
         legend=FALSE)
    lines(bshazard(Surv((dox - doe)/scale, chd) ~ 1, v=F, data=subset(diet,hieng=="low")),
          conf.int=FALSE)
    lines(bshazard(Surv((dox - doe)/scale, chd) ~ 1, v=F, data=subset(diet,hieng=="high")),
          col="red", conf.int=FALSE)
    legend("topleft", legend=c('hieng=="low" (bshazard)','hieng=="high" (bshazard)',
                               'hieng=="low" (muhaz)','hieng=="high" (muhaz)'),
           col=1:2, lty=c(1,1,2,2), bty="n")
}

scale <- 365.24
par(mfrow=1:2)
for (level in levels(diet$hieng))
    plot(bshazard(Surv((dox - doe)/scale, chd) ~ 1,
                  verbose=FALSE,
                  data=subset(diet,hieng==level)),
         ylim=c(0,0.03), xlab="Time since study entry (years)",
         main=sprintf('hieng == "%s"', level))

(b)

Patients with high energy intake have 48% less CHD rate. The underlying shape of the rates is assumed to be constant (i.e. the baseline is flat) over time.

poisson8b <- glm(chd ~ hieng + offset(log(y/1000)), family=poisson, data=diet)
summary(poisson8b)

## 
## Call:
## glm(formula = chd ~ hieng + offset(log(y/1000)), family = poisson, 
##     data = diet)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.6098     0.1890  13.811   <2e-16 ***
## hienghigh    -0.6532     0.3021  -2.162   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 258.00  on 335  degrees of freedom
## AIC: 354
## 
## Number of Fisher Scoring iterations: 6

broom::tidy(poisson8b, conf.int=TRUE, exponentiate=TRUE) # tidyverse (profile-based CIs)

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   13.6       0.189     13.8  2.20e-43    9.16     19.3  
## 2 hienghigh      0.520     0.302     -2.16 3.06e- 2    0.283     0.933

biostat3::eform(poisson8b) # Wald-based CIs

##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh    0.5203599 0.2878432  0.9407012

(c)

The effect of high energy intake is slightly confounded by bmi and job, since the point estimate changes a little.

## Create BMI variable
diet <- transform(diet, bmi=weight/((height/100)^2))

## Create orderly varable instead of categorical, start at zero
levels(diet$job)

## [1] "driver"    "conductor" "bank"

diet <- transform(diet, jobNumber = unclass(job) - 1)

poisson8c <- glm(chd ~ hieng + jobNumber + bmi + offset(log(y/1000)),
                 family=poisson, data=diet)
summary(poisson8c)

## 
## Call:
## glm(formula = chd ~ hieng + jobNumber + bmi + offset(log(y/1000)), 
##     family = poisson, data = diet)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.49616    1.18209   1.266   0.2056  
## hienghigh   -0.69995    0.30986  -2.259   0.0239 *
## jobNumber   -0.08706    0.17170  -0.507   0.6121  
## bmi          0.05091    0.04757   1.070   0.2845  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 255.01  on 331  degrees of freedom
## Residual deviance: 249.03  on 328  degrees of freedom
##   (5 observations deleted due to missingness)
## AIC: 347.03
## 
## Number of Fisher Scoring iterations: 6

eform(poisson8c)

##             exp(beta)     2.5 %     97.5 %
## (Intercept) 4.4644913 0.4401212 45.2868008
## hienghigh   0.4966098 0.2705588  0.9115256
## jobNumber   0.9166234 0.6546985  1.2833364
## bmi         1.0522316 0.9585556  1.1550622

##########################################################

(d)

The y variable is not correct since it is kept for all splitted records, and contains the complete follow-up rather than the risktime in that specific timeband. The risktime variable contains the correct amount of risktime for each timeband. The event variable chd is not correct since it is kept constant for all splitted records, while it should only be 1 for the last record (if the person has the event). For all other records (timebands) for that person it should be 0.

## Split time at 30,50,60 and 72 with time scale age at entry to attained age
scale <- 365.24
age.cuts <- c(30,50,60,72)
diet.spl.dob <- survSplit(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ .,
                          data=diet, cut=age.cuts,start="tstart",end="tstop")


## Tabulate ageband
diet.spl.dob |> subset(id<=3, select=c(id, tstart, tstop, y))

##     id   tstart    tstop        y
## 60   2 50.53937 60.00000 11.95893
## 61   2 60.00000 62.49863 11.95893
## 316  3 58.78600 60.00000 11.04175
## 317  3 60.00000 69.82806 11.04175
## 460  1 49.61669 50.00000 12.29295
## 461  1 50.00000 60.00000 12.29295
## 462  1 60.00000 61.90998 12.29295

## Create an agespan variable
diet.spl.dob <- transform(diet.spl.dob,
                          agespan = cut(tstop, age.cuts))

## Make the numeric variables factors since we want to model them with dummie variables and calculate time at risk
diet.spl.dob <- transform(diet.spl.dob,
                          jobNumber = as.factor(jobNumber),
                          risk_time = (tstop-tstart))

## Tabulate ageband including risk_time
diet.spl.dob |> subset(id<=3, select=c(id,  tstart, tstop, y,risk_time))

##     id   tstart    tstop        y  risk_time
## 60   2 50.53937 60.00000 11.95893  9.4606286
## 61   2 60.00000 62.49863 11.95893  2.4986310
## 316  3 58.78600 60.00000 11.04175  1.2139963
## 317  3 60.00000 69.82806 11.04175  9.8280583
## 460  1 49.61669 50.00000 12.29295  0.3833096
## 461  1 50.00000 60.00000 12.29295 10.0000000
## 462  1 60.00000 61.90998 12.29295  1.9099770

## Tabulate number of events per agespan
xtabs(~agespan+chd,diet.spl.dob)

##          chd
## agespan     0   1
##   (30,50] 190   6
##   (50,60] 275  18
##   (60,72] 218  22

The effect of high energy intake is somewhat confounded by age, but also confounded by job and bmi.

poisson8d <- glm( chd ~ hieng + agespan + offset( log( risk_time) ),
                 family=poisson,
                 data=diet.spl.dob)

summary(poisson8d)

## 
## Call:
## glm(formula = chd ~ hieng + agespan + offset(log(risk_time)), 
##     family = poisson, data = diet.spl.dob)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.7798     0.4320 -11.065   <2e-16 ***
## hienghigh       -0.6233     0.3026  -2.060   0.0394 *  
## agespan(50,60]   0.3025     0.4721   0.641   0.5217    
## agespan(60,72]   0.8451     0.4613   1.832   0.0670 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 321.05  on 728  degrees of freedom
## Residual deviance: 311.40  on 725  degrees of freedom
## AIC: 411.4
## 
## Number of Fisher Scoring iterations: 6

eform(poisson8d)

##                  exp(beta)       2.5 %     97.5 %
## (Intercept)    0.008397598 0.003601145 0.01958257
## hienghigh      0.536168916 0.296269765 0.97032212
## agespan(50,60] 1.353254649 0.536460711 3.41366685
## agespan(60,72] 2.328213936 0.942643672 5.75040208

Fitting the model for CHD, with adjustment for job and bmi.

poisson8d <- glm( chd ~ hieng + agespan + jobNumber + bmi + offset( log( risk_time) ),
                 family=poisson,
                 data=diet.spl.dob)

summary(poisson8d)

## 
## Call:
## glm(formula = chd ~ hieng + agespan + jobNumber + bmi + offset(log(risk_time)), 
##     family = poisson, data = diet.spl.dob)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.79183    1.31326  -5.172 2.32e-07 ***
## hienghigh      -0.71303    0.31387  -2.272   0.0231 *  
## agespan(50,60]  0.53692    0.50868   1.056   0.2912    
## agespan(60,72]  1.07421    0.49668   2.163   0.0306 *  
## jobNumber1      0.43510    0.40670   1.070   0.2847    
## jobNumber2     -0.13791    0.37184  -0.371   0.7107    
## bmi             0.07388    0.04852   1.523   0.1278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 313.24  on 718  degrees of freedom
## Residual deviance: 298.77  on 712  degrees of freedom
##   (10 observations deleted due to missingness)
## AIC: 402.77
## 
## Number of Fisher Scoring iterations: 6

eform(poisson8d)

##                  exp(beta)        2.5 %     97.5 %
## (Intercept)    0.001122909 8.560401e-05 0.01472974
## hienghigh      0.490157708 2.649506e-01 0.90678996
## agespan(50,60] 1.710737260 6.312339e-01 4.63635138
## agespan(60,72] 2.927691811 1.105992e+00 7.74994872
## jobNumber1     1.545112410 6.962690e-01 3.42880771
## jobNumber2     0.871175509 4.203362e-01 1.80557060
## bmi            1.076677746 9.790156e-01 1.18408219

(e)

time.cuts <- c(0, 5, 10, 15, 22)
diet.spl.t_entry <- survSplit(Surv((dox-doe)/365.24, chd) ~ ., data=diet,
                              cut=time.cuts, end="tstop", start="tstart", event="chd")

##Tabulate ageband
diet.spl.t_entry |> subset(id<=3, select=c(id, tstart, tstop, y))

##     id tstart    tstop        y
## 93   2      0  5.00000 11.95893
## 94   2      5 10.00000 11.95893
## 95   2     10 11.95926 11.95893
## 468  3      0  5.00000 11.04175
## 469  3      5 10.00000 11.04175
## 470  3     10 11.04205 11.04175
## 687  1      0  5.00000 12.29295
## 688  1      5 10.00000 12.29295
## 689  1     10 12.29329 12.29295

diet.spl.t_entry <-
    transform(diet.spl.t_entry,
              fu = cut(tstop, time.cuts),
              risk_time = (tstop-tstart))

##Tabulate ageband including risk_time
diet.spl.t_entry |> subset(id<=3, select=c(id, fu, tstart, y, risk_time))

##     id      fu tstart        y risk_time
## 93   2   (0,5]      0 11.95893  5.000000
## 94   2  (5,10]      5 11.95893  5.000000
## 95   2 (10,15]     10 11.95893  1.959260
## 468  3   (0,5]      0 11.04175  5.000000
## 469  3  (5,10]      5 11.04175  5.000000
## 470  3 (10,15]     10 11.04175  1.042055
## 687  1   (0,5]      0 12.29295  5.000000
## 688  1  (5,10]      5 12.29295  5.000000
## 689  1 (10,15]     10 12.29295  2.293287

poisson8e1 <- glm(chd ~ fu + hieng + offset(log(risk_time)),
                 family=poisson,
                 data=diet.spl.t_entry)
summary(poisson8e1)

## 
## Call:
## glm(formula = chd ~ fu + hieng + offset(log(risk_time)), family = poisson, 
##     data = diet.spl.t_entry)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.25958    0.26933 -15.815   <2e-16 ***
## fu(5,10]    -0.23369    0.37705  -0.620   0.5354    
## fu(10,15]    0.12151    0.36841   0.330   0.7415    
## fu(15,22]   -0.05012    0.55562  -0.090   0.9281    
## hienghigh   -0.64923    0.30212  -2.149   0.0316 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 391.17  on 1099  degrees of freedom
## Residual deviance: 385.52  on 1095  degrees of freedom
## AIC: 487.52
## 
## Number of Fisher Scoring iterations: 6

eform(poisson8e1)

##              exp(beta)       2.5 %     97.5 %
## (Intercept) 0.01412826 0.008333566 0.02395225
## fu(5,10]    0.79160514 0.378061629 1.65750408
## fu(10,15]   1.12919965 0.548506063 2.32466319
## fu(15,22]   0.95111381 0.320099619 2.82604985
## hienghigh   0.52244901 0.288986787 0.94451712

There seems to be no confounding by time-since-entry, but there is confounding by BMI and job.

poisson8e2 <- glm(chd ~ fu + hieng + job + bmi + offset(log(risk_time)),
                 family=poisson,
                 data=diet.spl.t_entry)
summary(poisson8e2)

## 
## Call:
## glm(formula = chd ~ fu + hieng + job + bmi + offset(log(risk_time)), 
##     family = poisson, data = diet.spl.t_entry)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.02331    1.26284  -4.770 1.85e-06 ***
## fu(5,10]     -0.16826    0.38194  -0.441   0.6595    
## fu(10,15]     0.21932    0.37484   0.585   0.5585    
## fu(15,22]     0.13312    0.56448   0.236   0.8136    
## hienghigh    -0.71425    0.31171  -2.291   0.0219 *  
## jobconductor  0.46008    0.40646   1.132   0.2577    
## jobbank      -0.13790    0.37261  -0.370   0.7113    
## bmi           0.06876    0.04872   1.411   0.1581    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 383.36  on 1083  degrees of freedom
## Residual deviance: 374.22  on 1076  degrees of freedom
##   (16 observations deleted due to missingness)
## AIC: 480.22
## 
## Number of Fisher Scoring iterations: 6

eform(poisson8e2)

##                exp(beta)        2.5 %     97.5 %
## (Intercept)  0.002421633 0.0002037867 0.02877669
## fu(5,10]     0.845132733 0.3997784786 1.78661277
## fu(10,15]    1.245226372 0.5972864235 2.59605552
## fu(15,22]    1.142385874 0.3778543787 3.45383184
## hienghigh    0.489559573 0.2657536762 0.90184482
## jobconductor 1.584205119 0.7142218469 3.51390240
## jobbank      0.871181912 0.4197020639 1.80832545
## bmi          1.071175374 0.9736261715 1.17849819

Biostatistics III in R

Code by Annika Tillander, Andreas Karlsson, Benedicte Delcoigne, Johan Zetterqvist and Mark Clements

Exercise 8. Diet data: Using Poisson regression to study the effect of energy intake adjusting for confounders on two different timescales

(a)

(b)

(c)

(d)

(e)