Biostat III exercises in R

Laboratory exercise 10

Suggested solutions by

Author: Annika Tillander, Andreas Karlsson, 2015-03-01

Examining the proportional hazards hypothesis (localised melanoma)


Load the diet data using time-on-study as the timescale.

You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.

require(foreign)  # for reading data set from Stata
require(survival) # for Surv and survfit
require(muhaz) #for hazard estimates
require(dplyr)    # for data manipulation
require(ggplot2)

Load diet data and explore it.

melanoma_raw <- read.dta("http://biostat3.net/download/melanoma.dta")
melanoma <- melanoma_raw %>%
    filter(stage == "Localised") %>%
    mutate(death_cancer = ifelse( status == "Dead: cancer" & surv_mm <= 120, 1, 0), #censuring for > 120 monts
           trunk_yy = ifelse(surv_mm <=  120, surv_mm/12, 10))  #scale to years and trunkate to 10 years

(a) For the localised melanoma data with 10 years follow-up, plot the instantaneous cause-specific hazard for each calendar period.

# Using muhaz to smooth the kaplan-meier hazards by strata the time and bandwidth options were selected based on smoother performance.
hazDiaDate <- melanoma %>%  group_by(year8594) %>%
    do( h = muhaz(times = .$trunk_yy, delta = .$death_cancer, min.time = min(.$trunk_yy[.$death_cancer==1]), max.time = max(.$trunk_yy[.$death_cancer==1]), bw.method="g", bw.grid=5)) %>%
    do( data.frame(Hazard = .$h$haz.est, Time = .$h$est.grid, Strata = .$year8594))

## Max hazard ratio
maxHaz <- hazDiaDate %>% group_by(Strata) %>%
    summarise(stratMax=max(Hazard))
print(maxHaz)
## Source: local data frame [2 x 2]
## 
##            Strata   stratMax
## 1 Diagnosed 75-84 0.04018423
## 2 Diagnosed 85-94 0.03036224
maxHaz$stratMax[2]/maxHaz$stratMax[1]
## [1] 0.755576
## Comparing hazards
ggplot(hazDiaDate, aes(x=Time, y=Hazard, colour= Strata)) +
    geom_line() + ggtitle("Smoothed Hazard estimates")

(b) Now plot the instantaneous cause-specific hazard for each calendar period using a log scale for the y axis (use the option yscale(log)). What would you expect to see if a proportional hazards assumption was appropriate? Do you see it?

## Comparing hazards on a log scales
ggplot(hazDiaDate, aes(x=Time, y=Hazard, colour= Strata)) +
    geom_line() + ggtitle("Smoothed Hazard estimates") + scale_y_continuous(trans='log')

(c) Another graphical way of checking the proportional hazards assumption is to plot the log cumulative cause specific hazard function for each calendar period. These plots were not given extensive coverage in the lectures, so attempt this if you like or continue to part (d).

## Calculating cumulative hazard per strata
hazDiaDate <- group_by(hazDiaDate, Strata) %>%
    mutate(cumHaz=cumsum(Hazard))

ggplot(hazDiaDate, aes(log(Time), -log(cumHaz), color=Strata)) + geom_line() + geom_point()

(d) Compare your estimated hazard ratio from part (a) with the one from a fitted Cox model with calendar period as the only explanatory variable. Are they similar?

# Cox regression with time-since-entry as the timescale
# Note that R uses the Efron method for approximating the likelihood in the presence
# whereas Stata (and most other software) use the Breslow method
cox1 <- coxph(Surv(trunk_yy, death_cancer==1) ~ year8594, method=c("breslow"), data=melanoma)
summary(cox1)
## Call:
## coxph(formula = Surv(trunk_yy, death_cancer == 1) ~ year8594, 
##     data = melanoma, method = c("breslow"))
## 
##   n= 5318, number of events= 960 
## 
##                             coef exp(coef) se(coef)      z Pr(>|z|)    
## year8594Diagnosed 85-94 -0.25254   0.77682  0.06579 -3.838 0.000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                         exp(coef) exp(-coef) lower .95 upper .95
## year8594Diagnosed 85-94    0.7768      1.287    0.6828    0.8837
## 
## Concordance= 0.533  (se = 0.008 )
## Rsquare= 0.003   (max possible= 0.949 )
## Likelihood ratio test= 14.78  on 1 df,   p=0.0001208
## Wald test            = 14.73  on 1 df,   p=0.0001238
## Score (logrank) test = 14.81  on 1 df,   p=0.0001191

(e) Now fit a more complex model and use graphical methods to explore the assumption of proportional hazards by calendar period.

cox2 <- coxph(Surv(trunk_yy, death_cancer==1) ~ sex + year8594 + agegrp, method=c("breslow"), data=melanoma)
summary(cox2)
## Call:
## coxph(formula = Surv(trunk_yy, death_cancer == 1) ~ sex + year8594 + 
##     agegrp, data = melanoma, method = c("breslow"))
## 
##   n= 5318, number of events= 960 
## 
##                             coef exp(coef) se(coef)      z Pr(>|z|)    
## sexFemale               -0.52964   0.58881  0.06545 -8.092 5.55e-16 ***
## year8594Diagnosed 85-94 -0.33284   0.71688  0.06618 -5.029 4.93e-07 ***
## agegrp45-59              0.28247   1.32640  0.09417  2.999   0.0027 ** 
## agegrp60-74              0.61914   1.85732  0.09088  6.813 9.56e-12 ***
## agegrp75+                1.21570   3.37265  0.10444 11.641  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                         exp(coef) exp(-coef) lower .95 upper .95
## sexFemale                  0.5888     1.6983    0.5179    0.6694
## year8594Diagnosed 85-94    0.7169     1.3949    0.6297    0.8162
## agegrp45-59                1.3264     0.7539    1.1028    1.5953
## agegrp60-74                1.8573     0.5384    1.5543    2.2194
## agegrp75+                  3.3727     0.2965    2.7484    4.1387
## 
## Concordance= 0.646  (se = 0.01 )
## Rsquare= 0.039   (max possible= 0.949 )
## Likelihood ratio test= 211.9  on 5 df,   p=0
## Wald test            = 217.1  on 5 df,   p=0
## Score (logrank) test = 225.9  on 5 df,   p=0
## Plot of the scaled Schoenfeld residuals for calendar period 1985–94.
## The smooth line shows the estimated log hazard ratio as a function of time.
cox2.phtest <- cox.zph(cox2, transform="identity") #Stata appears to be using 'identity'
plot(cox2.phtest[2],resid=TRUE, se=TRUE, main="Schoenfeld residuals", ylim=c(-4,4))

What do you conclude?

(f) Do part (a)–(e) but now for the variable agegrp. What are your conclusions regarding the assumption of proportional hazards? No written solutions for this part. No written solutions for this part.

(g) Now formally test the assumption of proportional hazards.

## The results from the previous proportional hazards assumption test
print(cox2.phtest)
##                             rho chisq        p
## sexFemale                0.0470  2.09 0.148200
## year8594Diagnosed 85-94  0.0488  2.28 0.130837
## agegrp45-59             -0.0443  1.89 0.169008
## agegrp60-74             -0.0825  6.48 0.010882
## agegrp75+               -0.1245 14.19 0.000165
## GLOBAL                       NA 18.29 0.002601

Are your conclusions from the test coherent with your conclusions from the graphical assessments?

(h) Estimate separate age effects for the first two years of follow-up (and separate estimates for the remainder of the follow-up) while controlling for sex and period. Do the estimates for the effect of age differ between the two periods of follow-up?

melanoma2p8Split <- survSplit(melanoma, cut=c(2), end="trunk_yy", start="start", event="death_cancer")
melanoma2p8Split <- mutate(melanoma2p8Split, fu = as.factor(start))

##Tabulate ageband including risk_time
melanoma2p8Split %>% select(id, start, trunk_yy) %>% filter(id<=3) %>% arrange(id, trunk_yy)
##   id start  trunk_yy
## 1  1     0  2.000000
## 2  1     2  2.208333
## 3  2     0  2.000000
## 4  2     2  4.625000
## 5  3     0  2.000000
## 6  3     2 10.000000
head(melanoma2p8Split)
##      sex age     stage mmdx yydx surv_mm surv_yy       status
## 1 Female  81 Localised   11 1981    26.5     2.5  Dead: other
## 2 Female  75 Localised   10 1975    55.5     4.5  Dead: other
## 3 Female  78 Localised    1 1978   177.5    14.5  Dead: other
## 4 Female  75 Localised   12 1975    19.5     1.5 Dead: cancer
## 5 Female  75 Localised   10 1975    65.5     5.5  Dead: other
## 6 Female  80 Localised    8 1980    50.5     4.5  Dead: other
##         subsite        year8594         dx       exit agegrp      bdate id
## 1 Head and Neck Diagnosed 75-84 1981-11-07 1984-01-22    75+ 1900-11-07  1
## 2 Head and Neck Diagnosed 75-84 1975-10-07 1980-05-22    75+ 1900-10-07  2
## 3         Limbs Diagnosed 75-84 1978-01-07 1992-10-22    75+ 1900-01-07  3
## 4         Trunk Diagnosed 75-84 1975-12-07 1977-07-22    75+ 1900-12-07  6
## 5 Head and Neck Diagnosed 75-84 1975-10-07 1981-03-22    75+ 1900-10-07  7
## 6 Head and Neck Diagnosed 75-84 1980-08-07 1984-10-22    75+ 1900-08-07  8
##   death_cancer trunk_yy start fu
## 1            0    2.000     0  0
## 2            0    2.000     0  0
## 3            0    2.000     0  0
## 4            1    1.625     0  0
## 5            0    2.000     0  0
## 6            0    2.000     0  0
cox2p8Split1 <- coxph(Surv(start, trunk_yy, death_cancer) ~ sex + year8594 + agegrp*fu, method=c("breslow"), data=melanoma2p8Split)
summary(cox2p8Split1)
## Call:
## coxph(formula = Surv(start, trunk_yy, death_cancer) ~ sex + year8594 + 
##     agegrp * fu, data = melanoma2p8Split, method = c("breslow"))
## 
##   n= 9856, number of events= 960 
## 
##                             coef exp(coef) se(coef)      z Pr(>|z|)    
## sexFemale               -0.52648   0.59068  0.06543 -8.047 8.88e-16 ***
## year8594Diagnosed 85-94 -0.33493   0.71539  0.06623 -5.057 4.26e-07 ***
## agegrp45-59              0.52995   1.69885  0.19634  2.699  0.00695 ** 
## agegrp60-74              0.89922   2.45767  0.18741  4.798 1.60e-06 ***
## agegrp75+                1.68631   5.39950  0.19175  8.794  < 2e-16 ***
## fu2                           NA        NA  0.00000     NA       NA    
## agegrp45-59:fu2         -0.32057   0.72573  0.22382 -1.432  0.15207    
## agegrp60-74:fu2         -0.36672   0.69300  0.21467 -1.708  0.08758 .  
## agegrp75+:fu2           -0.70699   0.49313  0.23207 -3.046  0.00232 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                         exp(coef) exp(-coef) lower .95 upper .95
## sexFemale                  0.5907     1.6930    0.5196    0.6715
## year8594Diagnosed 85-94    0.7154     1.3978    0.6283    0.8145
## agegrp45-59                1.6988     0.5886    1.1562    2.4962
## agegrp60-74                2.4577     0.4069    1.7022    3.5485
## agegrp75+                  5.3995     0.1852    3.7080    7.8627
## fu2                            NA         NA        NA        NA
## agegrp45-59:fu2            0.7257     1.3779    0.4680    1.1254
## agegrp60-74:fu2            0.6930     1.4430    0.4550    1.0555
## agegrp75+:fu2              0.4931     2.0279    0.3129    0.7771
## 
## Concordance= 0.645  (se = 0.01 )
## Rsquare= 0.022   (max possible= 0.799 )
## Likelihood ratio test= 221.8  on 8 df,   p=0
## Wald test            = 223.6  on 8 df,   p=0
## Score (logrank) test = 237.1  on 8 df,   p=0

We see effects of age (i.e., the hazard ratios) for the period 0–2 years subsequent to diagnosis along with the interaction effects. An advantage of the default parameterisation is that one can easily test the statistical significance of the interaction effects. Before going further, test whether the age*follow-up interaction is statistically significant (using a Wald and/or LR test).

(i) Often we wish to see the effects of exposure (age) for each level of the modifier (time since diagnosis). That is, we would like to complete the table below with relevant hazard ratios. To get the effects of age for the period 2+ years after diagnosis, using the default parametrization, we must multiply the hazard ratios for 0–2 years by the appropriate interaction effect. Now let’s reparameterise the model to directly estimate the effects of age for each level of time since diagnosis.

cox2p8Split2 <- coxph(Surv(start, trunk_yy, death_cancer) ~ sex + year8594 + fu + fu:agegrp, method=c("breslow"), data=melanoma2p8Split)
summary(cox2p8Split2)
## Call:
## coxph(formula = Surv(start, trunk_yy, death_cancer) ~ sex + year8594 + 
##     fu + fu:agegrp, data = melanoma2p8Split, method = c("breslow"))
## 
##   n= 9856, number of events= 960 
## 
##                             coef exp(coef) se(coef)      z Pr(>|z|)    
## sexFemale               -0.52648   0.59068  0.06543 -8.047 8.88e-16 ***
## year8594Diagnosed 85-94 -0.33493   0.71539  0.06623 -5.057 4.26e-07 ***
## fu2                           NA        NA  0.00000     NA       NA    
## fu0:agegrp45-59          0.52995   1.69885  0.19634  2.699  0.00695 ** 
## fu2:agegrp45-59          0.20938   1.23291  0.10774  1.943  0.05198 .  
## fu0:agegrp60-74          0.89922   2.45767  0.18741  4.798 1.60e-06 ***
## fu2:agegrp60-74          0.53250   1.70318  0.10479  5.082 3.74e-07 ***
## fu0:agegrp75+            1.68631   5.39950  0.19175  8.794  < 2e-16 ***
## fu2:agegrp75+            0.97932   2.66263  0.13158  7.443 9.85e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                         exp(coef) exp(-coef) lower .95 upper .95
## sexFemale                  0.5907     1.6930    0.5196    0.6715
## year8594Diagnosed 85-94    0.7154     1.3978    0.6283    0.8145
## fu2                            NA         NA        NA        NA
## fu0:agegrp45-59            1.6988     0.5886    1.1562    2.4962
## fu2:agegrp45-59            1.2329     0.8111    0.9982    1.5228
## fu0:agegrp60-74            2.4577     0.4069    1.7022    3.5485
## fu2:agegrp60-74            1.7032     0.5871    1.3870    2.0915
## fu0:agegrp75+              5.3995     0.1852    3.7080    7.8627
## fu2:agegrp75+              2.6626     0.3756    2.0574    3.4460
## 
## Concordance= 0.645  (se = 0.01 )
## Rsquare= 0.022   (max possible= 0.799 )
## Likelihood ratio test= 221.8  on 8 df,   p=0
## Wald test            = 223.6  on 8 df,   p=0
## Score (logrank) test = 237.1  on 8 df,   p=0
## Alternative approach that I did not get to work at:
## http://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf

(j) ADVANCED: Fit an analogous Poisson regression model. Are the parameter estimates similar? HINT: You will need to split the data by time since diagnosis. No written solutions for this part.