Biostat III exercises in R
===========
Laboratory exercise 28
-----------
Flexible parametric models (fpm) implemented in Stata (stpm2) and R
(rstpm2).
One difference: Restricted cubic splines using truncated power basis
(stpm2, Stata) using B-spline basis (rstpm2, R).
### Suggested solutions by
Authors: Xing-Rong Liu and Mark Clements, 2016-03-06
```{r setup, cache=FALSE, message=FALSE, echo=FALSE}
library('knitr')
read_chunk('q28.R')
opts_chunk$set(cache=FALSE, fig.width=10, fig.height=6)
```
You may have to install the required packages the first time you use
them. You can install a package by
`install.packages("package_of_interest")` for each package you
require.
```{r loadDependecies, message=FALSE}
```
We start by reading the data and then define a 1/0 varible for the
events that we are interested in.
```{r loadPreprocess}
```
**(a)** Flexible parametric model with df=4.
```{r a_flex}
```
Making a cox model for comparison.
```{r a_cox}
```
The hazard ratio, 95% confidence interval and statistical significance
are very similar to the Cox model.
**(b)** Prediction and plots of survival and hazard by calendar
period.
```{r b_surv}
```
Localised skin melanoma. Predicted survival functions from a exible
parametric model.
```{r b_haz}
```
Localised skin melanoma. Predicted hazard functions from a exible
parametric model.
**(c)** Plotting the hazards on the log scale. There is a constant
difference as the predictions are from a proportional hazards model
and a multiplicative effect becomes additive on the log scale.
```{r c_haz_log}
```
Localised skin melanoma. Predicted hazard functions on log scale from
a exible parametric model.
**(d)** Investigate DF for the baseline with criteria AIC and BIC.
```{r d_AIC_BIC}
```
The log hazard ratios (and hence the hazard ratios) from 2 df and up
are similar and for 3 df they are very similar. The main difference
is for 1 df, which is equivalent to a Weibull model. The Weibull
model enforces a monotonic hazard function and as the hazard function
in the melanoma data has a turning point it is clearly inappropriate.
The lowest AIC is for the model with 5 df and for the BIC it is the
model with 2 df. The penalty term in the AIC is twice the number of
parameters (2 x k ) whereas in the BIC it is ln( D ) > k where D is
the number of events. Since ln( D ) > k the BIC penalizes extra
parameters much more strongly than AIC. Since we have a large data set
and there are no disadvantages to including extra parameters we would
use 5df for the baseline hazard.
**(e)** Plots of predicted baseline survivals and hazards with df=1:6.
```{r e_base_surv}
```
With the exception of 1 df (the Weibull model), the survival and
hazard functions show similar shapes, so as long we have enough knots
our conclusions would be very similar.
**(f)** Adjust for sex and age (in categories).
```{r f_sex_age}
```
The estimates are similar to those obtained from the Cox model. The
Wald test yields a very highly significant result, which is similar to
that obtained from the comparable test for the Cox model.
**(g)** The estimates are so similar because very similar models are
being fitted with exactly the same covariates. The two models differ
only in the manner in which they account for the baseline hazard. In
the Cox model it is assumed arbitrary and not directly estimated. In
the flexible parametric model the baseline hazard is modelled using
splines. The 5 df spline allows sufficient flexibility to make the
model estimates virtually identical.
**(h)** Change to time-varying effect of agegrp2:4.
```{r h_time_varying}
```
There is strong evidence of a non-proportional effect of age.
**(i)** Plot of baseline hazard with fpmh. This baseline is for the
youngest age group who are male and diagnosed in 1975-1984, i.e, when
all the covariates are equal to zero.
```{r i_plot_base_haz}
```
**(j)** Hazard ratios between age groups.
```{r j_age_HR}
```
The hazard ratios decrease as a function of follow-up time. The
hazard ratio is so high during the early years of follow-up because
the hazard in the reference group is close to zero.
```{r j_oldest}
```
The hazard ratio for the oldest age group with a 95% confidence
interval.
**(k)** The hazard difference for the oldest group.
```{r k_haz_diff}
```
Localised skin melanoma. Predicted difference in hazard rates (with a
95% confidence interval) for oldest age group from a fexible parametric
model (other covariates are set to zero).
The hazard difference is small early on, despite the hazard ratio
being large, because the underlying hazard is so low.
**(l)** The survival difference for the oldest group.
```{r l_surv_diff}
```
Localised skin melanoma. Predicted difference in survival functions
(with a 95% confidence interval) for oldest age group from a flexible
parametric model (for females diagnosed in 1984-1994).
**(m)** Fit models with 1, 2 and 3 df for time-dependent effect of age.
```{r m_time_dep_eff}
```
AIC selects 2 df and BIC selects 1 df. As discussed in the previous
part, the BIC imposes a stronger penalty on aditional parameters. The
fitted time-dependent effects are similar. We would suggest 2 df for
the time-varying effect of age.