Exercise 10. Examining the proportional hazards hypothesis (localised melanoma)
Load the diet data using time-on-study as the timescale with a maximum of 10 years follow-up.
You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest")
for each package you require.
library(biostat3)
library(dplyr) # for data manipulation
library(ggplot2)
library(car) # car::linearHypothesis -> biostat3::lincom
Load melanoma data and explore it.
localised <- dplyr::filter(biostat3::melanoma, stage == "Localised") %>%
dplyr::mutate(death_cancer = ifelse( status == "Dead: cancer" & surv_mm <= 120, 1, 0), #censoring for > 120 months
trunc_yy = pmin(surv_mm/12,10)) #scale to years and truncate to 10 years
(a)
For the localised melanoma data with 10 years follow-up, plot the instantaneous cause-specific hazard for each calendar period.
# Using muhaz2 to smooth the Kaplan-Meier hazards by strata
hazDiaDate <- muhaz2(Surv(trunc_yy,death_cancer)~year8594, data=localised)
hazDiaDateDf <- as.data.frame(hazDiaDate)
## Comparing hazards
plot(hazDiaDate, haz.scale=1000,
xlab="Time since diagnosis (years)",
ylab="Hazard per 1000 person-years")
# or using ggplot2
ggplot(hazDiaDateDf, aes(x=x, y=y*1000, colour= strata)) + geom_line() +
xlab("Time since diagnosis (years)") +
ylab("Hazard per 1000 person-years")
(b)
Now plot the instantaneous cause-specific hazard for each calendar period using a log scale for the y axis (use the option yscale(log)). What would you expect to see if a proportional hazards assumption was appropriate? Do you see it?
(c)
Another graphical way of checking the proportional hazards assumption is to plot the log cumulative cause specific hazard function for each calendar period. These plots were not given extensive coverage in the lectures, so attempt this if you like or continue to part (d).
## Calculating -log cumulative hazards per strata
survfit1 <- survfit(Surv(trunc_yy,death_cancer)~year8594, data=localised)
plot(survfit1, col=1:2, fun=function(S) -log(-log(S)), log="x",
xlab="log(time)", ylab="-log(H)")
legend("topright",legend=levels(localised$year8594),col=1:2,lty=1)
## or we can use
biostat3::survPHplot(Surv(trunc_yy,death_cancer)~year8594, data=localised)
(d)
Compare your estimated hazard ratio from part (a) with the one from a fitted Cox model with calendar period as the only explanatory variable. Are they similar?
# Cox regression with time-since-entry as the timescale
# Note that R uses the Efron method for approximating the likelihood in the presence of ties
# whereas Stata (and some other software) use the Breslow method
cox1 <- coxph(Surv(trunc_yy, death_cancer==1) ~ year8594, data=localised)
summary(cox1)
(e)
Now fit a more complex model and use graphical methods to explore the assumption of proportional hazards by calendar period.
cox2 <- coxph(Surv(trunc_yy, death_cancer==1) ~ sex + year8594 + agegrp, data=localised)
summary(cox2)
## Plot of the scaled Schoenfeld residuals for calendar period 1985–94.
## The smooth line shows the estimated log hazard ratio as a function of time.
cox2.phtest <- cox.zph(cox2, transform="identity") #Stata appears to be using 'identity'
plot(cox2.phtest[2],resid=TRUE, se=TRUE, main="Schoenfeld residuals", ylim=c(-4,4))
What do you conclude?
(f)
Do part (a)–(e) but now for the variable agegrp. What are your conclusions regarding the assumption of proportional hazards? No written solutions for this part.
(g)
Now formally test the assumption of proportional hazards.
Are your conclusions from the test coherent with your conclusions from the graphical assessments?
(h)
Estimate separate age effects for the first two years of follow-up (and separate estimates for the remainder of the follow-up) while controlling for sex and period. Do the estimates for the effect of age differ between the two periods of follow-up? Write out the regression equation.
melanoma2p8Split <- survSplit(localised, cut=c(2), end="trunc_yy", start="start",
event="death_cancer", episode="fu") %>%
mutate(fu = as.factor(fu))
##Tabulate ageband including risk_time
melanoma2p8Split %>% select(id, start, trunc_yy) %>% filter(id<=3) %>% arrange(id, trunc_yy)
head(melanoma2p8Split)
cox2p8Split1 <- coxph(Surv(start, trunc_yy, death_cancer) ~ sex + year8594 + agegrp*fu,
data=melanoma2p8Split)
summary(cox2p8Split1)
cox2p8Split1b <- coxph(Surv(start, trunc_yy, death_cancer) ~ sex + year8594 + agegrp +
I(agegrp=="45-59" & fu=="2") + I(agegrp=="60-74" & fu=="2") +
I(agegrp=="75+" & fu=="2"), data=melanoma2p8Split)
summary(cox2p8Split1b)
We see effects of age (i.e., the hazard ratios) for the period 0–2 years subsequent to diagnosis along with the interaction effects. An advantage of the default parameterisation is that one can easily test the statistical significance of the interaction effects. Before going further, test whether the age*follow-up interaction is statistically significant (using a Wald and/or LR test).
(i)
Often we wish to see the effects of exposure (age) for each level of the modifier (time since diagnosis). That is, we would like to complete the table below with relevant hazard ratios. To get the effects of age for the period 2+ years after diagnosis, using the default parametrization, we must multiply the hazard ratios for 0–2 years by the appropriate interaction effect. Now let’s reparameterise the model to directly estimate the effects of age for each level of time since diagnosis.
0–2 years | 2+ years | |
---|---|---|
Agegrp0-44 | 1.00 | 1.00 |
Agegrp45-59 | – | – |
Agegrp60-74 | – | – |
Agegrp75+ | – | – |
cox2p8Split2 <- coxph(Surv(start, trunc_yy, death_cancer) ~ sex + year8594 + fu + fu:agegrp, data=melanoma2p8Split)
summary(cox2p8Split2)
We can also use the tt
argument in coxph
for modelling for time-varying effects:
## Alternative approach using tt():
## http://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf
cox2p8tvc2 <- coxph(Surv(trunc_yy, death_cancer) ~ sex + year8594 + agegrp +
tt(agegrp=="45-59") + tt(agegrp=="60-74") + tt(agegrp=="75+"),
data=localised,
tt = function(x, t, ...) x*(t>=2))
summary(cox2p8tvc2)
## The tt labels do not play nicely with lincom:(
cox2p8tvct <- coxph(Surv(trunc_yy, death_cancer) ~ sex + year8594 + agegrp + tt(agegrp),
data=localised,
tt = function(x, t, ...) cbind(`45-59`=(x=="45-59")*t,
`60-74`=(x=="60-74")*t,
`75+`=(x=="75+")*t))
summary(cox2p8tvct)
lincom(cox2p8tvct, "agegrp75+",eform=TRUE) # t=0
lincom(cox2p8tvct, "agegrp75+ + tt(agegrp)75+",eform=TRUE) # t=1
lincom(cox2p8tvct, "agegrp75+ + 2*tt(agegrp)75+",eform=TRUE) # t=2
cox2p8tvclogt <- coxph(Surv(trunc_yy, death_cancer) ~ sex + year8594 + agegrp +
tt(agegrp),
data=localised,
tt = function(x, t, ...) cbind(`45-59`=(x=="45-59")*log(t),
`60-74`=(x=="60-74")*log(t),
`75+`=(x=="75+")*log(t)))
summary(cox2p8tvclogt)
lincom(cox2p8tvclogt, "agegrp75+ - 0.6931472*tt(agegrp)75+",eform=TRUE) # t=0.5 => log(t)=-0.6931472
lincom(cox2p8tvclogt, "agegrp75+", eform=TRUE) # t=1 => log(t)=0
lincom(cox2p8tvclogt, "agegrp75+ + 0.6931472*tt(agegrp)75+",eform=TRUE) # t=2 => log(t)=0.6931472
Write out the regression equations for models cox2p8Split2
and cox2p8tvct
. Based on model cox2p8tvct
, write out a formula for the hazard ratio for those aged 75 years and over compared with those aged less than 45 years as a function of time.
(j)
ADVANCED: Fit an analogous Poisson regression model. Are the parameter estimates similar? HINT: You will need to split the data by time since diagnosis. Calculate and plot the rates.