Code by Thorgerdur Palsdottir (modified by Mark Clements)
Exercise: Prediction
Begin by loading all the necessary packages you need at top. Note that WebR does not currently support the nlpred and glmtoolbox packages. We have also removed the dependency on tidyverse.
Read in the original data and define derived variables. Define all variables here. Note that we have used base::within() to calculate the variables.
Create the analysis set only including the variables that you will use.
Create a complete case version
Extra exercise: Impute the missing data.
Since there is only a small fraction missing here, we will use a method called predictive mean matching and use one dataset (m=1). This is not included in the lectures. This is slow:(.
Here you can see how the missing values for cholesterol were replaced
1. Table 1
Create table 1. Describe all available variables.
Show both, the original data and the imputed data.
2. Overall risk or overall rate
a. What is the outcome we are interested in?
Answer: chd69 # coronary heart disease
b. What are the known risk factor for our outcome of interest?
age
systolic bp
cholesterol
bmi
smoking
arcus
c. Total number of persons
d. What is the overall risk or rate and the number of events
3. build the prediction model, choose model and predictors for the optimal model
a. the optimal model should be:
b. same model using age in agegroup variable and an interaction term with personality type
We will use model fit3.
c. create a risk prediction for all the persons in our data
4. discrimination
4.a ROC curve and AUC with 95% CI
4.b plot the ROC curve and find the best threshold and report the sens and spec at that threshold
or
c. AUC adjusted for optimism
Here we use the validate function to estimate the adjusted value for auc due to optimism via the bootstrap method.
Original and adjusted value
The AUC value doesnβt seem to be too inflated due to optimism.
d. crossvalidation
We are unable to run the commented code on WebR.
5. calibration
a. Plot the calibration curve, estimate the intercept and the slopt and use Hosmer Lemeshov goodness of fit to estimate the calibration
b. Hosmer Lemeshow goodness of fit
Please estimate the goodness of fit by the method of Hosmer and Lemeshow. The glmtoolbox package is not available on WebR.
c. Improvment in discrimination - difference in AUC
Create a predction model only using variable agegroup as a predictor and estimate the discrimination.
Please compare the discrimination to the model you used before with a statistical test and interpret the results.
d. compare discrimination
Significant difference in the AUCs, much better in model fit3.
e. plot both roc curves in one plot
The risk model is an improvement over the model just using agegroup. The difference is significant.
6. Decision Curve Analysis
a. plot the decision curve and estimate net benefits
b. what are the net benefits of this model
From risk thresholds of 3% the model is clinically useful and provides positive net benefits until a risk threshold from 25-35% where it has negative net benefits
c. add the model with agegroups to the plot and discuss
our model with multiple predictors has higher net benefit than the model using only agegroups until a risk threshold of 25-35% where model 1
has negative net benefit
clinical usefulness
Discussion points on if the clinicians are likely to use the model
How easy it is to implement the model in a clinical environment
Is there clinical value?
Next steps would be external validation studies and clinical utility studies to provide more evidence of the models usefulness