BDS: Biostatistics 1

Code by Thorgerdur Palsdottir (modified by Mark Clements)

Exercise: Prediction

Begin by loading all the necessary packages you need at top. Note that WebR does not currently support the nlpred and glmtoolbox packages. We have also removed the dependency on tidyverse.

Read in the original data and define derived variables. Define all variables here. Note that we have used base::within() to calculate the variables.

Create the analysis set only including the variables that you will use.

Create a complete case version

Extra exercise: Impute the missing data.

Since there is only a small fraction missing here, we will use a method called predictive mean matching and use one dataset (m=1). This is not included in the lectures. This is slow:(.

Here you can see how the missing values for cholesterol were replaced

1. Table 1

Create table 1. Describe all available variables.

Show both, the original data and the imputed data.

2. Overall risk or overall rate

a. What is the outcome we are interested in?

Answer: chd69 # coronary heart disease

b. What are the known risk factor for our outcome of interest?

age
systolic bp
cholesterol
bmi
smoking
arcus

c. Total number of persons

d. What is the overall risk or rate and the number of events

3. build the prediction model, choose model and predictors for the optimal model

a. the optimal model should be:

b. same model using age in agegroup variable and an interaction term with personality type

We will use model fit3.

c. create a risk prediction for all the persons in our data

4. discrimination

4.a ROC curve and AUC with 95% CI

4.b plot the ROC curve and find the best threshold and report the sens and spec at that threshold

c. AUC adjusted for optimism

Here we use the validate function to estimate the adjusted value for auc due to optimism via the bootstrap method.

Original and adjusted value

The AUC value doesn’t seem to be too inflated due to optimism.

d. crossvalidation

We are unable to run the commented code on WebR.

5. calibration

a. Plot the calibration curve, estimate the intercept and the slopt and use Hosmer Lemeshov goodness of fit to estimate the calibration

b. Hosmer Lemeshow goodness of fit

Please estimate the goodness of fit by the method of Hosmer and Lemeshow. The glmtoolbox package is not available on WebR.

c. Improvment in discrimination - difference in AUC

Create a predction model only using variable agegroup as a predictor and estimate the discrimination.
Please compare the discrimination to the model you used before with a statistical test and interpret the results.

d. compare discrimination

Significant difference in the AUCs, much better in model fit3.

e. plot both roc curves in one plot

The risk model is an improvement over the model just using agegroup. The difference is significant.

6. Decision Curve Analysis

a. plot the decision curve and estimate net benefits

b. what are the net benefits of this model

From risk thresholds of 3% the model is clinically useful and provides positive net benefits until a risk threshold from 25-35% where it has negative net benefits

c. add the model with agegroups to the plot and discuss

our model with multiple predictors has higher net benefit than the model using only agegroups until a risk threshold of 25-35% where model 1
has negative net benefit

clinical usefulness

Discussion points on if the clinicians are likely to use the model
How easy it is to implement the model in a clinical environment

Is there clinical value?

Next steps would be external validation studies and clinical utility studies to provide more evidence of the models usefulness