Biostat III exercises in R

Laboratory exercise 1b

Suggested solutions by

Author: Annika Tillander, 2014-01-30
Edited: Andreas Karlsson, 2015-02-27, 2016-03-01


You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.

require(foreign)  # for reading data set from Stata
require(dplyr)    # for data manipulation
require(KMsurv)   # for life-tables
require(survival) # for Surv and survfit

## Get the data for exercise 1
colon_sample <- read.dta("http://biostat3.net/download/colon_sample.dta")

## Create 0/1 outcome variable
colon <-colon_sample %>%
    mutate(death_cancer = ifelse( status == "Dead: cancer", 1, 0))

Number of events and number lost summarised by year.

colonByYear <- colon %>%
    mutate(year = floor(surv_yy)) %>%     # floor to whole years
    group_by(year) %>%                    # summarise within each year
    summarise(nevent = sum(death_cancer), # number of events
              nlost = n() - nevent)       # number lost to follow-up

Following are the life table estimates. Note that in the lectures, when we estimated all-cause survival, there were 8 deaths in the first interval. One of these died of a cause other than cancer so in the cause-specific survival analysis we see that there are 7 ‘deaths’ and 1 censoring (in lifetab we used the term ‘nlost’ for number lost to follow-up) in the first interval.

with(colonByYear,                           # using the colonByYear data
     lifetab(tis = c(year, tail(year,1)+1), # should be one element longer for the intervals
             ninit = nrow(colon),           # number of individuals at the start
             nlost = nlost,                 # number lost for each interval
             nevent = nevent)) %>%          # number of events for each interval
    round(2)
##      nsubs nlost nrisk nevent surv  pdf hazard se.surv se.pdf se.hazard
## 0-1     35     1  34.5      7 1.00 0.20   0.23    0.00   0.07      0.08
## 1-2     27     3  25.5      1 0.80 0.03   0.04    0.07   0.03      0.04
## 2-3     23     4  21.0      5 0.77 0.18   0.27    0.07   0.07      0.12
## 3-4     14     1  13.5      2 0.58 0.09   0.16    0.09   0.06      0.11
## 4-6     11     1  10.5      0 0.50 0.00   0.00    0.10    NaN       NaN
## 6-7     10     3   8.5      0 0.50 0.00   0.00    0.10    NaN       NaN
## 7-8      7     1   6.5      0 0.50 0.00   0.00    0.10    NaN       NaN
## 8-9      6     4   4.0      1 0.50 0.12   0.29    0.10   0.11      0.28
## 9-10     1     1   0.5      0 0.37   NA     NA    0.13     NA        NA

Following is a table of Kaplan-Meier estimates. It is not clear from the table, if the person censored at time 2 was at risk when the other person dies at time 2. Following the table is a graph of the survival function.

mfit <- survfit(Surv(surv_mm, death_cancer) ~ 1, data = colon) # make Kaplan-Meier estimates
summary(mfit)                                                  # print Kaplan-Meier table
## Call: survfit(formula = Surv(surv_mm, death_cancer) ~ 1, data = colon)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     2     35       1    0.971  0.0282        0.918        1.000
##     3     33       1    0.942  0.0398        0.867        1.000
##     5     32       1    0.913  0.0482        0.823        1.000
##     7     31       1    0.883  0.0549        0.782        0.998
##     8     30       1    0.854  0.0605        0.743        0.981
##     9     29       1    0.824  0.0652        0.706        0.962
##    11     28       1    0.795  0.0692        0.670        0.943
##    22     24       1    0.762  0.0738        0.630        0.921
##    27     22       1    0.727  0.0781        0.589        0.898
##    28     20       1    0.691  0.0823        0.547        0.872
##    32     19       2    0.618  0.0882        0.467        0.818
##    33     16       1    0.579  0.0908        0.426        0.788
##    43     13       1    0.535  0.0941        0.379        0.755
##    46     12       1    0.490  0.0962        0.334        0.720
##   102      4       1    0.368  0.1284        0.185        0.729
plot(mfit,                                                     # plot Kaplan-Meier curve
     ylab="S(t)",
     xlab="Time since diagnosis in months",
     main = "Kaplan−Meier estimates of cause−specific survival")