Exercise 6. Diet data: tabulating incidence rates and modelling with Poisson regression

You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.

library(biostat3) # diet dataset
library(dplyr)    # for data manipulation
library(knitr)    # kable()
library(broom)    # tidy()

Load the diet data using time-on-study as the timescale. We look at the first six rows of the data using head and look at a summary for the variables using summary:

head(diet) |> kable("html")

id	chd	y	hieng	energy	job	month	height	weight	doe	dox	dob	yoe	yox	yob
127	0	16.791239	low	2023.25	conductor	2	173.9900	61.46280	1960-02-16	1976-12-01	1910-09-27	1960.126	1976.917	1910.737
200	0	19.958933	low	2448.68	bank	12	177.8000	73.48320	1956-12-16	1976-12-01	1909-06-18	1956.958	1976.917	1909.460
198	0	19.958933	low	2281.38	bank	12	NA	NA	1956-12-16	1976-12-01	1910-06-30	1956.958	1976.917	1910.493
222	0	15.394935	low	2467.95	bank	2	158.7500	58.24224	1957-02-16	1972-07-10	1902-07-11	1957.126	1972.523	1902.523
305	1	1.494867	low	2362.93	bank	1	NA	NA	1960-01-16	1961-07-15	1913-06-30	1960.041	1961.534	1913.493
173	0	15.958932	low	2380.11	conductor	12	164.4904	79.01712	1960-12-16	1976-12-01	1915-06-28	1960.958	1976.917	1915.487

summary(diet) |> kable("html")

id	chd	y	hieng	energy	job	month	height	weight	doe	dox	dob	yoe	yox	yob
Min. : 1	Min. :0.0000	Min. : 0.2875	low :155	Min. :1748	driver :102	Min. : 1.000	Min. :152.4	Min. : 46.72	Min. :1956-11-16	Min. :1958-08-29	Min. :1892-01-10	Min. :1957	Min. :1959	Min. :1892
1st Qu.: 85	1st Qu.:0.0000	1st Qu.:10.7762	high:182	1st Qu.:2537	conductor: 84	1st Qu.: 3.000	1st Qu.:168.9	1st Qu.: 64.64	1st Qu.:1959-01-16	1st Qu.:1972-09-29	1st Qu.:1906-01-18	1st Qu.:1959	1st Qu.:1973	1st Qu.:1906
Median :169	Median :0.0000	Median :15.4606	NA	Median :2803	bank :151	Median : 6.000	Median :173.0	Median : 72.80	Median :1960-02-16	Median :1976-12-01	Median :1911-02-25	Median :1960	Median :1977	Median :1911
Mean :169	Mean :0.1365	Mean :13.6607	NA	Mean :2829	NA	Mean : 6.231	Mean :173.4	Mean : 72.54	Mean :1960-06-22	Mean :1974-02-19	Mean :1911-01-04	Mean :1960	Mean :1974	Mean :1911
3rd Qu.:253	3rd Qu.:0.0000	3rd Qu.:17.0431	NA	3rd Qu.:3110	NA	3rd Qu.:10.000	3rd Qu.:177.8	3rd Qu.: 79.83	3rd Qu.:1961-06-16	3rd Qu.:1976-12-01	3rd Qu.:1915-01-30	3rd Qu.:1961	3rd Qu.:1977	3rd Qu.:1915
Max. :337	Max. :1.0000	Max. :20.0411	NA	Max. :4396	NA	Max. :12.000	Max. :190.5	Max. :106.14	Max. :1966-09-16	Max. :1976-12-01	Max. :1930-09-19	Max. :1967	Max. :1977	Max. :1931
NA	NA	NA	NA	NA	NA	NA	NA’s :5	NA’s :4	NA	NA	NA	NA	NA	NA

(a)

diet <- biostat3::diet

diet.ir6a <- survRate(Surv(y/1000,chd) ~ hieng, data=diet)
kable(diet.ir6a, "html")

	hieng	tstop	event	rate	lower	upper
hieng=low	low	2.059430	28	13.595992	9.034438	19.64999
hieng=high	high	2.544238	18	7.074809	4.192980	11.18125

## or
diet |>
    group_by(hieng) |>
    summarise(Event = sum(chd), Time = sum(y/1000), Rate = Event/Time,      # group sums
              lower.ci = poisson.test(Event,Time)$conf.int[1],
              upper.ci = poisson.test(Event,Time)$conf.int[2]) |> kable("html")

hieng	Event	Time	Rate	lower.ci	upper.ci
low	28	2.059430	13.595992	9.034438	19.64999
high	18	2.544238	7.074809	4.192980	11.18125

## IRR
with(diet.ir6a, poisson.test(event,tstop))

## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 28, expected count1 = 20.578, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  1.026173 3.688904
## sample estimates:
## rate ratio 
##   1.921747

with(diet.ir6a, poisson.test(rev(event),rev(tstop)))

## 
##  Comparison of Poisson rates
## 
## data:  rev(event) time base: rev(tstop)
## count1 = 18, expected count1 = 25.422, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.2710832 0.9744943
## sample estimates:
## rate ratio 
##  0.5203599

We see that individuals with a high energy intake have a lower CHD incidence rate. The estimated crude incidence rate ratio is 0.52 (95% CI: 0.27, 0.97).

(b)

The code is:

poisson6b <- glm(chd ~ hieng + offset( log(y/1000)), family=poisson, data=diet)
summary(poisson6b)

## 
## Call:
## glm(formula = chd ~ hieng + offset(log(y/1000)), family = poisson, 
##     data = diet)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.6098     0.1890  13.811   <2e-16 ***
## hienghigh    -0.6532     0.3021  -2.162   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 258.00  on 335  degrees of freedom
## AIC: 354
## 
## Number of Fisher Scoring iterations: 6

eform(poisson6b)

##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh    0.5203599 0.2878432  0.9407012

eform(poisson6b, method="Profile")

## Waiting for profiling to be done...

##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.1614552 19.2715805
## hienghigh    0.5203599 0.2829171  0.9328392

## using tidyverse...
tidy(poisson6b, conf.int=TRUE, exponentiate=TRUE)

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   13.6       0.189     13.8  2.20e-43    9.16     19.3  
## 2 hienghigh      0.520     0.302     -2.16 3.06e- 2    0.283     0.933

The point estimate for the IRR calculated by the Poisson regression is the same as the IRR calculated in 6(a). The regression model can be defined by:

\[\begin{align*} E(\text{chd}) &= \frac{y}{1000}\exp\left(\beta_0 + \beta_1I(\text{hieng}="high") \right) \\ &= \exp\left(\beta_0 + \beta_1I(\text{hieng}="high") + \log(y/1000) \right) \end{align*}\]

where \(E(\text{chd})\) is the expected count for CHD, \(\beta_0\) is the intercept parameter for the log rate and \(\beta_1\) is the parameter for the log rate ratio for high energy diets. We have also used \(I(\text{condition})\) as an indicator function, which will take a value of 1 when the condition is true and 0 otherwise.

A theoretical observation: if we consider the data as being cross-classified solely by hieng then the Poisson regression model with one parameter is a saturated model (that is, the number of observations is equal to the number of parameters) so the IRR estimated from the model will be identical to the ‘observed’ IRR. That is, the model is a perfect fit.

(c)

hist(diet$energy, breaks=25, probability=TRUE, xlab="Energy (units)")
curve(dnorm(x, mean=mean(diet$energy), sd=sd(diet$energy)), col = "red", add=TRUE)
lines(density(diet$energy), col = "blue")
legend("topright", legend=c("Normal density","Smoothed density"),
       lty=1, col=c("red", "blue"), bty="n")

quantile(diet$energy, probs=c(0.01,0.05,0.1,0.25,0.5,0.75,0.90,0.95,0.99))

##       1%       5%      10%      25%      50%      75%      90%      95% 
## 1887.268 2177.276 2314.114 2536.690 2802.980 3109.660 3365.644 3588.178 
##      99% 
## 4046.820

# For kurtosis and skewness, see package e1071

The histogram and density curve gives us an idea of the distribution of energy intake. The normal curve suggests that a normal approximation may be acceptable. We can also tabulate moments and percentiles of the distribution.

(d)

diet <- transform(diet, eng3=cut(energy, breaks=c(1500,2500,3000,4500),
                                 labels=c("low","medium","high"), 
                                 right = FALSE))
cbind(Freq=table(diet$eng3),
      Prop=proportions(table(diet$eng3))) |> kable("html")

	Freq	Prop
low	75	0.2225519
medium	150	0.4451039
high	112	0.3323442

This could also be done using dplyr::mutate.

(e)

## rates and IRRs
diet.ir6e <- survRate(Surv(y/1000,chd) ~ eng3, data=diet)
print(diet.ir6e)

##               eng3     tstop event      rate    lower     upper
## eng3=low       low 0.9466338    16 16.901995 9.660951 27.447781
## eng3=medium medium 2.0172621    22 10.905871 6.834651 16.511619
## eng3=high     high 1.6397728     8  4.878725 2.106287  9.613033

# calculate IRR and confidence intervals
with(diet.ir6e, rate[eng3=="medium"] / rate[eng3=="low"])

## [1] 0.6452416

with(diet.ir6e[c(2,1),], { # compare second row with first row
  poisson.test(event, tstop)
})

## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 22, expected count1 = 25.863, p-value = 0.2221
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.3237007 1.3143509
## sample estimates:
## rate ratio 
##  0.6452416

with(diet.ir6e, rate[eng3=="high"] / rate[eng3=="low"])

## [1] 0.2886479

with(diet.ir6e[c(3,1),], { # compare third row with first row
  poisson.test(event, tstop)
})

## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 8, expected count1 = 15.216, p-value = 0.004579
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.1069490 0.7148284
## sample estimates:
## rate ratio 
##  0.2886479

We see that the CHD incidence rate decreases as the level of total energy intake increases. These calculations are usually better done using a regression model as they are less fiddly and can be adjusted for other covariates.

(f)

diet <- transform(diet, 
                  X1 = as.numeric(eng3 == "low"),
                  X2 = as.numeric(eng3 == "medium"),
                  X3 = as.numeric(eng3 == "high"))
## or (names eng3low, eng3medium and eng3high)
diet = cbind(diet, model.matrix(~eng3+0, diet))
colSums(diet[c("X1","X2","X3","eng3low","eng3medium","eng3high")])

    X1         X2         X3    eng3low eng3medium   eng3high 
    75        150        112         75        150        112

(g)

subset(diet, eng3=="low", select=c(energy,eng3,X1,X2,X3)) |> head() |> kable("html")

energy	eng3	X1
2023.25	low	1
2448.68	low	1
2281.38	low	1
2467.95	low	1
2362.93	low	1
2380.11	low	1

subset(diet, eng3=="medium", select=c(energy,eng3,X1,X2,X3)) |> head() |> kable("html")

	energy	eng3	X2
76	2664.64	medium	1
77	2533.33	medium	1
78	2854.08	medium	1
79	2673.77	medium	1
80	2766.88	medium	1
81	2586.69	medium	1

subset(diet, eng3=="high", select=c(energy,eng3,X1,X2,X3)) |> head() |> kable("html")

	energy	eng3	X3
226	3067.36	high	1
227	3298.95	high	1
228	3147.60	high	1
229	3180.47	high	1
230	3045.81	high	1
231	3060.03	high	1

(h)

poisson6h <- glm(chd ~ X2 + X3 + offset(log(y/1000)), family=poisson, data=diet)
summary(poisson6h)

## 
## Call:
## glm(formula = chd ~ X2 + X3 + offset(log(y/1000)), family = poisson, 
##     data = diet)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.8274     0.2500  11.312  < 2e-16 ***
## X2           -0.4381     0.3285  -1.334  0.18233    
## X3           -1.2425     0.4330  -2.870  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6

tidy(poisson6h, conf.int=TRUE, exponentiate=TRUE)

## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   16.9       0.250     11.3  1.15e-29    9.91     26.6  
## 2 X2             0.645     0.329     -1.33 1.82e- 1    0.341     1.25 
## 3 X3             0.289     0.433     -2.87 4.11e- 3    0.117     0.656

Level 1 of the categorized total energy is the reference category. The estimated rate ratio comparing level 2 to level 1 is 0.6452 and the estimated rate ratio comparing level 3 to level 1 is 0.2886.

The regression equation can be represented by:

\[\begin{align*} E(\text{chd}) &= \frac{y}{1000}\exp\left(\beta_0 + \beta_2 X_2 + \beta_3 X_3 \right) \\ &= \exp\left(\beta_0 + \beta_2 X_2 + \beta_3 X_3 + \log(y/1000) \right) \end{align*}\] where \(\beta_2\) and \(\beta_3\) are the log rate ratios for \(X_2=\text{X2}\) and \(X_3=\text{X3}\), respectively.

(i)

poisson6i <- glm(chd ~ X1 + X3 + offset(log(y/1000)), family=poisson, data=diet)
# or 
poisson6i <- glm(chd ~ I(eng3=="low") + I(eng3=="high") + offset(log(y/1000)),
                 family=poisson, data=diet)
summary(poisson6i)

## 
## Call:
## glm(formula = chd ~ I(eng3 == "low") + I(eng3 == "high") + offset(log(y/1000)), 
##     family = poisson, data = diet)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.3893     0.2132  11.207   <2e-16 ***
## I(eng3 == "low")TRUE    0.4381     0.3285   1.334   0.1823    
## I(eng3 == "high")TRUE  -0.8044     0.4129  -1.948   0.0514 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6

tidy(poisson6i, conf.int=TRUE, exponentiate=TRUE)

## # A tibble: 3 × 7
##   term                  estimate std.error statistic  p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 "(Intercept)"           10.9       0.213     11.2  3.77e-29    6.96     16.1  
## 2 "I(eng3 == \"low\")T…    1.55      0.329      1.33 1.82e- 1    0.800     2.94 
## 3 "I(eng3 == \"high\")…    0.447     0.413     -1.95 5.14e- 2    0.187     0.966

Now use level 2 as the reference (by omitting X2 but including X1 and X3). The estimated rate ratio comparing level 1 to level 2 is 1.5498 and the estimated rate ratio comparing level 3 to level 2 is 0.4473.

(j)

poisson6j <- glm(chd ~ eng3 + offset(log(y/1000)), family=poisson, data=diet)
summary(poisson6j)

## 
## Call:
## glm(formula = chd ~ eng3 + offset(log(y/1000)), family = poisson, 
##     data = diet)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.8274     0.2500  11.312  < 2e-16 ***
## eng3medium   -0.4381     0.3285  -1.334  0.18233    
## eng3high     -1.2425     0.4330  -2.870  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6

tidy(poisson6j, conf.int=TRUE, exponentiate=TRUE)

## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   16.9       0.250     11.3  1.15e-29    9.91     26.6  
## 2 eng3medium     0.645     0.329     -1.33 1.82e- 1    0.341     1.25 
## 3 eng3high       0.289     0.433     -2.87 4.11e- 3    0.117     0.656

The estimates are identical (as we would hope) when we have R create indicator variables for us.

(k)

Somehow (there are many different alternatives) you’ll need to calculate the total number of events and the total person-time at risk and then calculate the incidence rate as events/person-time. As examples,

summarise(diet, rate = sum(chd) / sum(y))

##          rate
## 1 0.009992031

## or
survRate(Surv(y/1000,chd)~1, data=diet)

##      tstop event     rate    lower    upper
## 1 4.603669    46 9.992031 7.315422 13.32797

The estimated incidence rate is 0.00999 events per person-year.

Biostatistics III in R

Code by Johan Zetterqvist, Andreas Karlsson and Mark Clements

Exercise 6. Diet data: tabulating incidence rates and modelling with Poisson regression

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)