But what if response is time until event (eg. time of survival after surgery)?
Additional complication: event might not have happened at end of study (eg. patient still alive). But knowing that patient has “not died yet” presumably informative. Such data called censored.
… continued
Enter survival analysis, in particular the “Cox proportional hazards model”.
Explanatory variables in this context often called covariates.
Packages
Install package survival if not done. Also use broom and marginaleffects from earlier.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
12 women who have just started taking dancing lessons are followed for up to a year, to see whether they are still taking dancing lessons, or have quit. The “event” here is “quit”.
the combination of explanatory variables you are looking at
the time at which you are looking at them (when more time has passed, it is more likely that the event has happened, so the “survival probability” should be lower).
look at effect of age by comparing ages 20 and 40, and later look at the effect of treatment (values 1 and 0).
Also have to provide some times to predict for, in Months.
Effect of age
new <-datagrid(model = dance.1, Age =c(20, 40), Months =c(3, 5, 7))new
The estimated survival probabilities go down over time. For example a 20-year-old woman here has estimated probability 0.0293 of still dancing after 5 months.
A graph
We can plot the predictions over time for an experimental condition such as age. The key for plot_predictions is to put time first in the condition:
plot_predictions(dance.1, condition =c("Months", "Age"), type ="survival") +coord_cartesian(ylim =c(0, 1)) # y-axis from 0 to 1
Ignoring unknown labels:
• linetype : "Age"
Warning: Removed 60 rows containing missing values or values outside the scale range
(`geom_ribbon()`).
Comments
The plot picks some representative ages.
It is (usually) best to be up and to the right (has the highest chance of surviving longest).
Hence the oldest women have the best chance to still be dancing longest (the youngest women are most likely to quit soonest).
The effect of treatment
The same procedure will get predictions for women who did or did not go to the dance competition, at various times:
Women of this age have a high (0.879) chance of still dancing after 7 months if they went to the dance competition, but much lower (almost zero) if they did not.
A graph
Again, time first, effect of interest second (as colours):
plot_predictions(dance.1, condition =c("Months", "Treatment"), type ="survival") +coord_cartesian(ylim =c(0, 1)) -> g
The graph
g
Ignoring unknown labels:
• linetype : "Treatment"
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_ribbon()`).
Comments
The survival curve for Treatment 1 is higher all the way along
Hence at any time, the women who went to the dance competition have a higher chance of still dancing than those who did not.
The model summary again
summary(dance.1)
Call:
coxph(formula = Surv(Months, Quit) ~ Treatment + Age, data = dance)
n= 12, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
Treatment -4.44915 0.01169 2.60929 -1.705 0.0882 .
Age -0.36619 0.69337 0.15381 -2.381 0.0173 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
Treatment 0.01169 85.554 7.026e-05 1.9444
Age 0.69337 1.442 5.129e-01 0.9373
Concordance= 0.964 (se = 0.039 )
Likelihood ratio test= 21.68 on 2 df, p=2e-05
Wald test = 5.67 on 2 df, p=0.06
Score (logrank) test = 14.75 on 2 df, p=6e-04
Comments
The numbers in the coef column describe effect of that variable on log-hazard of quitting.
Both numbers are negative, so a higher value on both variables goes with a lower hazard of quitting:
an older woman is less likely to quit soon (more likely to be still dancing)
a woman who went to the dance competition (Treatment = 1) is less likely to quit soon vs. a woman who didn’t (more likely to be still dancing).
Model checking
With regression, usually plot residuals against fitted values.
Not quite same here (nonlinear model), but “martingale residuals” should have no pattern vs. “linear predictor”.
Use broom ideas to get them, in .resid and .fitted as below.
Martingale residuals can go very negative, so won’t always look normal.
Martingale residuals
dance.1%>%augment(dance) %>%ggplot(aes(x = .fitted, y = .resid)) +geom_point() +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
A more realistic example: lung cancer
When you load in an R package, get data sets to illustrate functions in the package.
One such is lung. Data set measuring survival in patients with advanced lung cancer.
Along with survival time, number of “performance scores” included, measuring how well patients can perform daily activities.
Sometimes high good, but sometimes bad!
Variables below, from the data set help file (?lung).
inst time status age
Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00
1st Qu.: 3.00 1st Qu.: 166.8 1st Qu.:1.000 1st Qu.:56.00
Median :11.00 Median : 255.5 Median :2.000 Median :63.00
Mean :11.09 Mean : 305.2 Mean :1.724 Mean :62.45
3rd Qu.:16.00 3rd Qu.: 396.5 3rd Qu.:2.000 3rd Qu.:69.00
Max. :33.00 Max. :1022.0 Max. :2.000 Max. :82.00
NA's :1
sex ph.ecog ph.karno pat.karno
Min. :1.000 Min. :0.0000 Min. : 50.00 Min. : 30.00
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 75.00 1st Qu.: 70.00
Median :1.000 Median :1.0000 Median : 80.00 Median : 80.00
Mean :1.395 Mean :0.9515 Mean : 81.94 Mean : 79.96
3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00
Max. :2.000 Max. :3.0000 Max. :100.00 Max. :100.00
NA's :1 NA's :1 NA's :3
meal.cal wt.loss
Min. : 96.0 Min. :-24.000
1st Qu.: 635.0 1st Qu.: 0.000
Median : 975.0 Median : 7.000
Mean : 928.8 Mean : 9.832
3rd Qu.:1150.0 3rd Qu.: 15.750
Max. :2600.0 Max. : 68.000
NA's :47 NA's :14
inst time status age
Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00
1st Qu.: 3.00 1st Qu.: 174.5 1st Qu.:1.000 1st Qu.:57.00
Median :11.00 Median : 268.0 Median :2.000 Median :64.00
Mean :10.71 Mean : 309.9 Mean :1.719 Mean :62.57
3rd Qu.:15.00 3rd Qu.: 419.5 3rd Qu.:2.000 3rd Qu.:70.00
Max. :32.00 Max. :1022.0 Max. :2.000 Max. :82.00
sex ph.ecog ph.karno pat.karno
Min. :1.000 Min. :0.0000 Min. : 50.00 Min. : 30.00
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 70.00 1st Qu.: 70.00
Median :1.000 Median :1.0000 Median : 80.00 Median : 80.00
Mean :1.383 Mean :0.9581 Mean : 82.04 Mean : 79.58
3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00
Max. :2.000 Max. :3.0000 Max. :100.00 Max. :100.00
meal.cal wt.loss
Min. : 96.0 Min. :-24.000
1st Qu.: 619.0 1st Qu.: 0.000
Median : 975.0 Median : 7.000
Mean : 929.1 Mean : 9.719
3rd Qu.:1162.5 3rd Qu.: 15.000
Max. :2600.0 Max. : 68.000
lung.1<-coxph(Surv(time, status ==2) ~ . - inst - time - status,data = lung.complete)
Warning in terms.formula(formula, specials = ss, data = data):
'varlist' has changed (from nvar=9) to new 11 after EncodeVars() --
should no longer happen!
Warning in terms.formula(formula, special, data = data): 'varlist'
has changed (from nvar=9) to new 11 after EncodeVars() -- should no
longer happen!
Warning in coxph(Surv(time, status == 2) ~ . - inst - time - status,
data = lung.complete): a variable appears on both the left and right
sides of the formula
# A tibble: 2 × 3
term estimate p.value
<chr> <dbl> <dbl>
1 sex -0.510 0.00958
2 ph.ecog 0.483 0.000266
summary(lung.3)
Call:
coxph(formula = Surv(time, status == 2) ~ sex + ph.ecog, data = lung.complete)
n= 167, number of events= 120
coef exp(coef) se(coef) z Pr(>|z|)
sex -0.5101 0.6004 0.1969 -2.591 0.009579 **
ph.ecog 0.4825 1.6201 0.1323 3.647 0.000266 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
sex 0.6004 1.6655 0.4082 0.8832
ph.ecog 1.6201 0.6172 1.2501 2.0998
Concordance= 0.641 (se = 0.031 )
Likelihood ratio test= 19.48 on 2 df, p=6e-05
Wald test = 19.35 on 2 df, p=6e-05
Score (logrank) test = 19.62 on 2 df, p=5e-05
Check whether that was OK
anova(lung.3, lung.2)
Analysis of Deviance Table
Cox model: response is Surv(time, status == 2)
Model 1: ~ sex + ph.ecog
Model 2: ~ sex + ph.ecog + ph.karno + wt.loss
loglik Chisq Df Pr(>|Chi|)
1 -498.38
2 -495.67 5.4135 2 0.06675 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Just OK.
Commentary
OK (just) to take out those two covariates.
Both remaining variables strongly significant.
Nature of effect on survival time? Consider later.
Picture?
Plotting survival probabilities
Assess (separately) the effect of sex and ph.ecog score using plot_predictions
Don’t forget to add time (here actually called time) to the condition.
Effect of sex:
plot_predictions(lung.3, condition =c("time", "sex"), type ="survival")
Ignoring unknown labels:
• linetype : "sex"
Females (sex = 2) have better survival than males.
This graph from a mean ph.ecog score, but the male-female comparison is the same for any score.
Effect of ph.ecog score:
plot_predictions(lung.3, condition =c("time", "ph.ecog"), type ="survival")
Ignoring unknown labels:
• linetype : "ph.ecog"
Comments
A lower ph.ecog score is better.
For example, a patient with a score of 0 has almost a 50-50 chance of living 500 days, but a patient with a score of 3 has almost no chance to survive that long.
Is this for males or females? See over. (The comparison of scores is the same for both.) How many males and females did we observe?
lung %>%count(sex)
sex n
1 1 138
2 2 90
Sex and ph.ecog score
plot_predictions(lung.3, condition =c("time", "ph.ecog", "sex"), type ="survival")
Ignoring unknown labels:
• linetype : "ph.ecog"
Comments
The previous graph was (closer to) males. There were more males in the dataset (sex of 1).
This pair of graphs shows the effect of ph.ecog score (above and below on each facet), and the effect of males (left) vs. females (right).
The difference between males and females is about the same as 1 point on the ph.ecog scale (compare the red curve on the left facet with the green curve on the right facet).
A second go
Can we put the male and female survival curves on the same plot?
Try building plot ourselves:
plot_predictions(lung.3, condition =c("time", "ph.ecog", "sex"), type ="survival", draw =FALSE) %>%select(estimate, time, ph.ecog, sex) -> dddd
Comments