Biostatistics III in R

Code by Xing-Rong Liu and Mark Clements

Exercise 28. Flexible parametric survival models in R

Flexible parametric survival models use a smooth function to model for a transformation of survival. These models are also called Royston-Parmar models. For our model extensions, we also use the term generalized survival models.

Note that there are some differences in the Stata implementation (stpm2) and our implementation in R (rstpm2::stpm2 and rstpm::pstpm2). For the parametric models, Stata uses restricted cubic splines using truncated power basis, while we use B-splines. Moreover, we allow for most smooth functions to be used for the baseline function.

You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.

We start by reading the melanoma data, restrict to localised cancer, define cause-specific death through to at most 120 months as the event, express the time in years, and then define 1/0 varibles for age groups.

(a)

First, fit a flexible parametric model adjusting for calendar period of diagnosis (year8594) with four degrees of freedom for the baseline log cumulative hazard function, and use the summary method. How do you interpret the period effect?

Also fit a similar Cox model. How similar are the points estimates and standard errors for the parameters that are in both models?

For those parameters in the stpm2 model that are not in the Cox model, what do those parameters represent?

(b)

Using the fitted model, present plots for survival and hazards by calendar period. What do you observe?

Example ggplot2 code for survival is below:

Example code using base graphics through tinyplot is below:

(c)

Plotting the hazards on the log scale. Is there evidence that the difference in predictions is similar by time? Does the plot suggest that the model assumes proportionality?

(d)

Investigate varying the degrees of freedom for the baseline using the AIC and BIC criteria.

Re-fit the model with degrees of freedom varying between 1 and 6 degrees of freedom. Calculate the AIC and BIC and extract the beta and standard error for the period term. Which model has the smallest AIC and which has the smallest BIC?

Finally, is it clear why we use the AIC and BIC criteria rather than using likelihood ratio tests?

(e) Plots of predicted baseline survivals and hazards with varying degrees of freedom

We first plot survival by varying the degrees of freedom:

Adapt the above code to work with hazards.

(f) Adjust for sex and age (in categories)

Fit a flexible parametric survival model that adjusts for calendar period, sex and age group, with four degrees of freedom for the baseline. How do you interpret the fitted parameters?

Use a likelihod ratio test to assess whether age group contributes significantly to the model fit. How do you interpret the test?

(g)

Fit a similar model and undertake a similar test using Cox regression. Are the results similar? If so, can you suggest why?

(h)

Now allow for time-varying effects for the age groups, using 0–44 years as the reference group, with two degrees of freedom for each age-group:time interaction. Using a likelihood ratio test, assess whether the time-varying effects are significant.

We could also investigate the non-proportional effect of age with penalized models, where sp is the optimal smoothing parameters estimated from models without sp argument:

(i)

Plot of baseline hazard with time-varying effects.

Plot the hazard for the youngest age group for males diagnosed in 1975–1984.

(j) Hazard ratios between age groups

Now plot the hazard ratios for each of the older age groups compared with cases diagnosed aged 0-44 years. For ease of comparison, initially plot the hazard ratios without confidence intervals. Describe the form of the hazard ratios.

For the hazard ratio comparing those diagnosed aged 75 years and over with those aged 0–44 years, plot the hazard ratio and 95% confidence intervals.

Finally, as an extension, use ggplot to plot the hazard ratios with confidence intervals using alpha transparency. How do you interpret the hazard ratios?

(k) The hazard difference for the oldest group

There are a number of other target parameters that we can estimate. One is the difference in hazards between groups. Plot the hazard for those aged 75 years and over minus the hazard for those aged 0–44 years. Describe the general pattern.

(l) The survival difference for the oldest group

Similarly, plot the survival difference between those aged 75 years and over and those aged 0–44 years at diagnosis. How will the survival difference be related to the risk difference?

(m) Fit models with 1, 2 and 3 df for time-dependent effect of age

Adapting code from (e), fit models with a varying degrees of freedom for the time-dependent effects for age groups. Also plot the time-varying hazard ratios comparing the hazards for those aged 75 years and over with the hazard for those diagnosed aged 0–44 years.