Exercise 8. Diet data: Using Poisson regression to study the effect of energy intake adjusting for confounders on two different timescales
Use Poisson regression to study the association between energy intake (hieng) and CHD adjusted for potential confounders (job, BMI). We know that people who expend a lot of energy (i.e., are physically active) require a higher energy intake. We do not have data on physical activity but we are hoping that occupation (job) will serve as a surrogate measure of work-time physical activity (conductors on London double-decker busses expend energy walking up and down the stairs all day).
Fit models both without adjusting for ‘time’ and by adjusting for attained age (you will need to split the data) and time-since-entry and compare the results.
Load the diet data using time-on-study as the timescale.
You may have to install the required packages the first time you use
them. You can install a package by
install.packages("package_of_interest")
for each package
you require.
Load diet data and explore it.
id | chd | y | hieng | energy | job | month | height | weight | doe | dox | dob | yoe | yox | yob | bmi |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
127 | 0 | 16.791239 | low | 2023.25 | conductor | 2 | 173.9900 | 61.46280 | 1960-02-16 | 1976-12-01 | 1910-09-27 | 1960.126 | 1976.917 | 1910.737 | 20.30316 |
200 | 0 | 19.958933 | low | 2448.68 | bank | 12 | 177.8000 | 73.48320 | 1956-12-16 | 1976-12-01 | 1909-06-18 | 1956.958 | 1976.917 | 1909.460 | 23.24473 |
198 | 0 | 19.958933 | low | 2281.38 | bank | 12 | NA | NA | 1956-12-16 | 1976-12-01 | 1910-06-30 | 1956.958 | 1976.917 | 1910.493 | NA |
222 | 0 | 15.394935 | low | 2467.95 | bank | 2 | 158.7500 | 58.24224 | 1957-02-16 | 1972-07-10 | 1902-07-11 | 1957.126 | 1972.523 | 1902.523 | 23.11057 |
305 | 1 | 1.494867 | low | 2362.93 | bank | 1 | NA | NA | 1960-01-16 | 1961-07-15 | 1913-06-30 | 1960.041 | 1961.534 | 1913.493 | NA |
173 | 0 | 15.958932 | low | 2380.11 | conductor | 12 | 164.4904 | 79.01712 | 1960-12-16 | 1976-12-01 | 1915-06-28 | 1960.958 | 1976.917 | 1915.487 | 29.20385 |
The rates plotted on timescale attained age show a clear increasing trend as age increases, which is to be expected (older persons are more likely to suffer from CHD). The rates plotted on timescale time-since-entry are almost constant (if you have some imagination you can see that the rates are flat).
(a)
scale <- 365.24
plot(bshazard(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ 1,
verbose=FALSE,
data=subset(diet,hieng=="low")),
ylim=c(0,0.03), conf.int=FALSE, xlab="Attained age (years)")
lines(bshazard(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ 1,
verbose=FALSE,
data=subset(diet,hieng=="high")),
col="red", conf.int=FALSE)
legend("topleft", legend=c('hieng=="low"','hieng=="high"'), col=1:2, lty=1, bty="n")
par(mfrow=1:2)
for (level in levels(diet$hieng))
plot(bshazard(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ 1,
verbose=FALSE,
data=subset(diet,hieng==level)),
ylim=c(0,0.03), xlab="Attained age (years)",
main=sprintf('hieng == "%s"', level))
if (requireNamespace("muhaz")) {
scale <- 365.24
plot(muhaz2(Surv((dox - doe)/scale, chd) ~ hieng, data=diet), lty=2,
xlab="Time since study entry (years)", ylim=c(0,0.025),
legend=FALSE)
lines(bshazard(Surv((dox - doe)/scale, chd) ~ 1, v=F, data=subset(diet,hieng=="low")),
conf.int=FALSE)
lines(bshazard(Surv((dox - doe)/scale, chd) ~ 1, v=F, data=subset(diet,hieng=="high")),
col="red", conf.int=FALSE)
legend("topleft", legend=c('hieng=="low" (bshazard)','hieng=="high" (bshazard)',
'hieng=="low" (muhaz)','hieng=="high" (muhaz)'),
col=1:2, lty=c(1,1,2,2), bty="n")
}
(b)
Patients with high energy intake have 48% less CHD rate. The underlying shape of the rates is assumed to be constant (i.e. the baseline is flat) over time.
##
## Call:
## glm(formula = chd ~ hieng + offset(log(y/1000)), family = poisson,
## data = diet)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.6098 0.1890 13.811 <2e-16 ***
## hienghigh -0.6532 0.3021 -2.162 0.0306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 258.00 on 335 degrees of freedom
## AIC: 354
##
## Number of Fisher Scoring iterations: 6
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.6 0.189 13.8 2.20e-43 9.16 19.3
## 2 hienghigh 0.520 0.302 -2.16 3.06e- 2 0.283 0.933
## exp(beta) 2.5 % 97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh 0.5203599 0.2878432 0.9407012
(c)
The effect of high energy intake is slightly confounded by bmi and job, since the point estimate changes a little.
## Create BMI variable
diet <- transform(diet, bmi=weight/((height/100)^2))
## Create orderly varable instead of categorical, start at zero
levels(diet$job)
## [1] "driver" "conductor" "bank"
diet <- transform(diet, jobNumber = unclass(job) - 1)
poisson8c <- glm(chd ~ hieng + jobNumber + bmi + offset(log(y/1000)),
family=poisson, data=diet)
summary(poisson8c)
##
## Call:
## glm(formula = chd ~ hieng + jobNumber + bmi + offset(log(y/1000)),
## family = poisson, data = diet)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.49616 1.18209 1.266 0.2056
## hienghigh -0.69995 0.30986 -2.259 0.0239 *
## jobNumber -0.08706 0.17170 -0.507 0.6121
## bmi 0.05091 0.04757 1.070 0.2845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 255.01 on 331 degrees of freedom
## Residual deviance: 249.03 on 328 degrees of freedom
## (5 observations deleted due to missingness)
## AIC: 347.03
##
## Number of Fisher Scoring iterations: 6
## exp(beta) 2.5 % 97.5 %
## (Intercept) 4.4644913 0.4401212 45.2868008
## hienghigh 0.4966098 0.2705588 0.9115256
## jobNumber 0.9166234 0.6546985 1.2833364
## bmi 1.0522316 0.9585556 1.1550622
(d)
The y variable is not correct since it is kept for all splitted records, and contains the complete follow-up rather than the risktime in that specific timeband. The risktime variable contains the correct amount of risktime for each timeband. The event variable chd is not correct since it is kept constant for all splitted records, while it should only be 1 for the last record (if the person has the event). For all other records (timebands) for that person it should be 0.
## Split time at 30,50,60 and 72 with time scale age at entry to attained age
scale <- 365.24
age.cuts <- c(30,50,60,72)
diet.spl.dob <- survSplit(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ .,
data=diet, cut=age.cuts,start="tstart",end="tstop")
## Tabulate ageband
diet.spl.dob |> subset(id<=3, select=c(id, tstart, tstop, y))
## id tstart tstop y
## 60 2 50.53937 60.00000 11.95893
## 61 2 60.00000 62.49863 11.95893
## 316 3 58.78600 60.00000 11.04175
## 317 3 60.00000 69.82806 11.04175
## 460 1 49.61669 50.00000 12.29295
## 461 1 50.00000 60.00000 12.29295
## 462 1 60.00000 61.90998 12.29295
## Create an agespan variable
diet.spl.dob <- transform(diet.spl.dob,
agespan = cut(tstop, age.cuts))
## Make the numeric variables factors since we want to model them with dummie variables and calculate time at risk
diet.spl.dob <- transform(diet.spl.dob,
jobNumber = as.factor(jobNumber),
risk_time = (tstop-tstart))
## Tabulate ageband including risk_time
diet.spl.dob |> subset(id<=3, select=c(id, tstart, tstop, y,risk_time))
## id tstart tstop y risk_time
## 60 2 50.53937 60.00000 11.95893 9.4606286
## 61 2 60.00000 62.49863 11.95893 2.4986310
## 316 3 58.78600 60.00000 11.04175 1.2139963
## 317 3 60.00000 69.82806 11.04175 9.8280583
## 460 1 49.61669 50.00000 12.29295 0.3833096
## 461 1 50.00000 60.00000 12.29295 10.0000000
## 462 1 60.00000 61.90998 12.29295 1.9099770
## chd
## agespan 0 1
## (30,50] 190 6
## (50,60] 275 18
## (60,72] 218 22
The effect of high energy intake is somewhat confounded by age, but also confounded by job and bmi.
poisson8d <- glm( chd ~ hieng + agespan + offset( log( risk_time) ),
family=poisson,
data=diet.spl.dob)
summary(poisson8d)
##
## Call:
## glm(formula = chd ~ hieng + agespan + offset(log(risk_time)),
## family = poisson, data = diet.spl.dob)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.7798 0.4320 -11.065 <2e-16 ***
## hienghigh -0.6233 0.3026 -2.060 0.0394 *
## agespan(50,60] 0.3025 0.4721 0.641 0.5217
## agespan(60,72] 0.8451 0.4613 1.832 0.0670 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 321.05 on 728 degrees of freedom
## Residual deviance: 311.40 on 725 degrees of freedom
## AIC: 411.4
##
## Number of Fisher Scoring iterations: 6
## exp(beta) 2.5 % 97.5 %
## (Intercept) 0.008397598 0.003601145 0.01958257
## hienghigh 0.536168916 0.296269765 0.97032212
## agespan(50,60] 1.353254649 0.536460711 3.41366685
## agespan(60,72] 2.328213936 0.942643672 5.75040208
Fitting the model for CHD, with adjustment for job and bmi.
poisson8d <- glm( chd ~ hieng + agespan + jobNumber + bmi + offset( log( risk_time) ),
family=poisson,
data=diet.spl.dob)
summary(poisson8d)
##
## Call:
## glm(formula = chd ~ hieng + agespan + jobNumber + bmi + offset(log(risk_time)),
## family = poisson, data = diet.spl.dob)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.79183 1.31326 -5.172 2.32e-07 ***
## hienghigh -0.71303 0.31387 -2.272 0.0231 *
## agespan(50,60] 0.53692 0.50868 1.056 0.2912
## agespan(60,72] 1.07421 0.49668 2.163 0.0306 *
## jobNumber1 0.43510 0.40670 1.070 0.2847
## jobNumber2 -0.13791 0.37184 -0.371 0.7107
## bmi 0.07388 0.04852 1.523 0.1278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 313.24 on 718 degrees of freedom
## Residual deviance: 298.77 on 712 degrees of freedom
## (10 observations deleted due to missingness)
## AIC: 402.77
##
## Number of Fisher Scoring iterations: 6
## exp(beta) 2.5 % 97.5 %
## (Intercept) 0.001122909 8.560401e-05 0.01472974
## hienghigh 0.490157708 2.649506e-01 0.90678996
## agespan(50,60] 1.710737260 6.312339e-01 4.63635138
## agespan(60,72] 2.927691811 1.105992e+00 7.74994872
## jobNumber1 1.545112410 6.962690e-01 3.42880771
## jobNumber2 0.871175509 4.203362e-01 1.80557060
## bmi 1.076677746 9.790156e-01 1.18408219
(e)
time.cuts <- c(0, 5, 10, 15, 22)
diet.spl.t_entry <- survSplit(Surv((dox-doe)/365.24, chd) ~ ., data=diet,
cut=time.cuts, end="tstop", start="tstart", event="chd")
##Tabulate ageband
diet.spl.t_entry |> subset(id<=3, select=c(id, tstart, tstop, y))
## id tstart tstop y
## 93 2 0 5.00000 11.95893
## 94 2 5 10.00000 11.95893
## 95 2 10 11.95926 11.95893
## 468 3 0 5.00000 11.04175
## 469 3 5 10.00000 11.04175
## 470 3 10 11.04205 11.04175
## 687 1 0 5.00000 12.29295
## 688 1 5 10.00000 12.29295
## 689 1 10 12.29329 12.29295
diet.spl.t_entry <-
transform(diet.spl.t_entry,
fu = cut(tstop, time.cuts),
risk_time = (tstop-tstart))
##Tabulate ageband including risk_time
diet.spl.t_entry |> subset(id<=3, select=c(id, fu, tstart, y, risk_time))
## id fu tstart y risk_time
## 93 2 (0,5] 0 11.95893 5.000000
## 94 2 (5,10] 5 11.95893 5.000000
## 95 2 (10,15] 10 11.95893 1.959260
## 468 3 (0,5] 0 11.04175 5.000000
## 469 3 (5,10] 5 11.04175 5.000000
## 470 3 (10,15] 10 11.04175 1.042055
## 687 1 (0,5] 0 12.29295 5.000000
## 688 1 (5,10] 5 12.29295 5.000000
## 689 1 (10,15] 10 12.29295 2.293287
poisson8e1 <- glm(chd ~ fu + hieng + offset(log(risk_time)),
family=poisson,
data=diet.spl.t_entry)
summary(poisson8e1)
##
## Call:
## glm(formula = chd ~ fu + hieng + offset(log(risk_time)), family = poisson,
## data = diet.spl.t_entry)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.25958 0.26933 -15.815 <2e-16 ***
## fu(5,10] -0.23369 0.37705 -0.620 0.5354
## fu(10,15] 0.12151 0.36841 0.330 0.7415
## fu(15,22] -0.05012 0.55562 -0.090 0.9281
## hienghigh -0.64923 0.30212 -2.149 0.0316 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 391.17 on 1099 degrees of freedom
## Residual deviance: 385.52 on 1095 degrees of freedom
## AIC: 487.52
##
## Number of Fisher Scoring iterations: 6
## exp(beta) 2.5 % 97.5 %
## (Intercept) 0.01412826 0.008333566 0.02395225
## fu(5,10] 0.79160514 0.378061629 1.65750408
## fu(10,15] 1.12919965 0.548506063 2.32466319
## fu(15,22] 0.95111381 0.320099619 2.82604985
## hienghigh 0.52244901 0.288986787 0.94451712
There seems to be no confounding by time-since-entry, but there is confounding by BMI and job.
poisson8e2 <- glm(chd ~ fu + hieng + job + bmi + offset(log(risk_time)),
family=poisson,
data=diet.spl.t_entry)
summary(poisson8e2)
##
## Call:
## glm(formula = chd ~ fu + hieng + job + bmi + offset(log(risk_time)),
## family = poisson, data = diet.spl.t_entry)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.02331 1.26284 -4.770 1.85e-06 ***
## fu(5,10] -0.16826 0.38194 -0.441 0.6595
## fu(10,15] 0.21932 0.37484 0.585 0.5585
## fu(15,22] 0.13312 0.56448 0.236 0.8136
## hienghigh -0.71425 0.31171 -2.291 0.0219 *
## jobconductor 0.46008 0.40646 1.132 0.2577
## jobbank -0.13790 0.37261 -0.370 0.7113
## bmi 0.06876 0.04872 1.411 0.1581
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 383.36 on 1083 degrees of freedom
## Residual deviance: 374.22 on 1076 degrees of freedom
## (16 observations deleted due to missingness)
## AIC: 480.22
##
## Number of Fisher Scoring iterations: 6
## exp(beta) 2.5 % 97.5 %
## (Intercept) 0.002421633 0.0002037867 0.02877669
## fu(5,10] 0.845132733 0.3997784786 1.78661277
## fu(10,15] 1.245226372 0.5972864235 2.59605552
## fu(15,22] 1.142385874 0.3778543787 3.45383184
## hienghigh 0.489559573 0.2657536762 0.90184482
## jobconductor 1.584205119 0.7142218469 3.51390240
## jobbank 0.871181912 0.4197020639 1.80832545
## bmi 1.071175374 0.9736261715 1.17849819