You can also download a PDF copy of this lecture.
Let \(T\) be a continuous random variable that is time-till-event. Four related functions are used to describe the distribution of \(T\).
The probability density function of \(T\) is \[ f(t) = \lim_{\delta t \rightarrow 0} \frac{P(t \le T < t + \delta t)}{\delta t}. \] If \(\delta\) is relatively small then \(P(t \le T < t + \delta t) \approx f(t)\delta t\) and so \(f(t) \approx P(t \le T < t + \delta t)/(\delta t)\) and thus \(f(t)\) is approximately proportional to the probability that \(T\) is between \(t\) and \(t + \delta t\). So \(f(t)\) is approximately proportional to the probability that the event will happen “near” \(t\).
For the distribution below, the probability that \(T\) is approximately 40 (say, between 39 and 41) equals the area under the curve and between 39 and 41. This probability is approximated by the rectangle, which has area \(wf(40)\), where \(w\) = 2 is the width of the rectangle and \(f(40)\) is the height of the rectangle. So the probability that \(T\) is approximately 40 is proportional to \(f(40)\).
The survival function is \[ S(t) = P(T \ge t). \] It equals the area under \(f(t)\) and between \(t\) and \(\infty\). The area under \(S(t)\) equals \(E(T)\) if \(S(0) = 1\) and \(S(\infty) = 1\).
Thus \[
S(t) = \int_t^{\infty} \!\!\! f(z)dz.
\]
The hazard function is \[ h(t) = \lim_{\delta t \rightarrow 0} \frac{P(t \le T < t + \delta t|T \ge t)}{\delta t} = \frac{f(t)}{S(t)}. \] If \(\delta\) is relatively small then \(h(t)\) is approximately proportional to the probability that \(t \le T < t + \delta t\) given survival up to \(t\) — i.e., \(T \ge t\). So \(h(t)\) is approximately proportional to the probability of the event happening at near time \(t\) if it has not yet happened.
A wide variety of distributions can be used for parametric survival models such as AFT models. Below is a list of just some of those distributions. One of the more noticeable differences between them is the shape of their hazard functions.
Log-normal. The distribution of \(\log(T_i)\) is normal. Single-peaked hazard
function. Known as lognormal
by survreg
and
flexsurvreg
, and also lnorm
by
flexsurvreg
.
Log-logistic. The distribution of \(\log(T_i)\) is logistic. Single-peaked or
decreasing hazard function. Known as loglogistic
by
survreg
and flexsurvreg
and
llogis
by flexsurvreg
.
Gamma. Monotonic or flat hazard function. Known as
gamma
to flexsurvreg
.
Weibull. Monotonic or flat hazard function. Known as
weibull
to both survreg
and
flexsurvreg
.
Exponential. Flat hazard function (“memoryless”). Known
as exp
to flexsurvreg
but also as a special
case of weibull
if scale = 1
with
survreg
.
Gompertz. Increasing hazard function. Known as
gompertz
to flexsurvreg
.
Generalized gamma. Monotonic, single-peaked, and
“bathtub” hazard functions. The exponential, Weibull, gamma, and
log-normal are special cases. Known as gengamma
to
flexsurvreg
.
Generalized F. Single-peaked or decreasing. Known as
genf
to flexsurvreg
.
The summary
function can be used to estimate the hazard
function based on a flexsurvreg
model object.
Example: Consider data from an experiment on the effects of sexual activity on the lifespan of the male fruitfly. Thorax length was used as a covariate.
library(faraway)
p <- ggplot(fruitfly, aes(x = thorax, y = longevity)) +
geom_point() + facet_wrap(~ activity, ncol = 5) +
labs(x = "Thorax Length (mm)", y = "Longevity (days)") +
theme_minimal()
plot(p)
m <- flexsurvreg(Surv(longevity) ~ activity + thorax,
data = fruitfly, dist = "gamma")
d <- data.frame(activity = levels(fruitfly$activity), thorax = 0.8)
d <- summary(m, newdata = d, t = seq(0, 100, length = 100),
type = "hazard", tidy = TRUE)
head(d)
time est lcl ucl activity thorax
1 0.00 0.00e+00 0.00e+00 0.00e+00 isolated 0.8
2 1.01 1.27e-38 4.28e-50 2.13e-30 isolated 0.8
3 2.02 1.90e-30 3.34e-39 4.36e-24 isolated 0.8
4 3.03 9.30e-26 4.18e-33 1.89e-20 isolated 0.8
5 4.04 1.70e-22 7.32e-29 6.28e-18 isolated 0.8
6 5.05 5.14e-20 1.23e-25 5.12e-16 isolated 0.8
p <- ggplot(d, aes(x = time, y = est, color = activity)) +
geom_line() + theme_minimal() +
labs(x = "Days", y = "h(t)", color = "Condition", title = "Hazard Functions")
plot(p)
d <- expand.grid(activity = levels(fruitfly$activity), thorax = c(0.7,0.8,0.9))
d <- summary(m, newdata = d, t = seq(0, 100, length = 100),
type = "hazard", tidy = TRUE)
p <- ggplot(d, aes(x = time, y = est, color = activity)) +
geom_line() + theme_minimal() +
labs(x = "Days", y = "h(t)", color = "Condition", title = "Hazard Functions") +
facet_wrap(~ thorax)
plot(p)
For comparison here are the survival functions.
d <- expand.grid(activity = levels(fruitfly$activity), thorax = c(0.7,0.8,0.9))
d <- summary(m, newdata = d, t = seq(0, 100, length = 100),
type = "survival", tidy = TRUE)
p <- ggplot(d, aes(x = time, y = est, color = activity)) +
geom_line() + theme_minimal() +
labs(x = "Days", y = "S(t)", color = "Condition", title = "Survival Functions") +
facet_wrap(~ thorax)
plot(p)
And here are the probability density functions.
d <- expand.grid(activity = levels(fruitfly$activity), thorax = c(0.7,0.8,0.9))
d <- summary(m, newdata = d, t = seq(0, 100, length = 100),
fn = function(t, ...) dgamma(t, ...), tidy = TRUE)
p <- ggplot(d, aes(x = time, y = est, color = activity)) +
geom_line() + theme_minimal() +
labs(x = "Days", y = "f(t)", color = "Condition", title = "Density Functions") +
facet_wrap(~ thorax)
plot(p)
Note that we can adapt this to other distributions by adding a
d
to the beginning of the distribution name recognized by
flexsurvreg
. This include log-normal (dlnorm
),
log-logistic (dllogis
), gamma (dgamma
),
Weibull (dweibull
), exponential (dexp
),
Gompertz (dgompertz
), generalized gamma
(dgengamma
), and generalized F
(dgenf
).
Finally we can also plot the expected survival time. This is
analogous to using predict
with
type = response
in a GLM.
d <- expand.grid(activity = levels(fruitfly$activity),
thorax = seq(0.6, 1.0, length = 100))
d <- summary(m, newdata = d, type = "mean", tidy = TRUE)
head(d)
est lcl ucl activity thorax
1 33.1 29.1 37.4 isolated 0.600
2 35.0 31.0 39.5 one 0.600
3 29.5 25.9 33.9 low 0.600
4 36.0 32.0 40.4 many 0.600
5 21.9 19.4 24.6 high 0.600
6 33.5 29.4 37.8 isolated 0.604
p <- ggplot(d, aes(x = thorax, y = est, color = activity)) +
geom_line() + theme_minimal() +
labs(x = "Thorax Length (mm)", y = "E(T)", color = "Condition",
title = "Expected Survival Time")
plot(p)
p <- ggplot(fruitfly, aes(x = thorax, y = longevity)) +
geom_point() + facet_wrap(~ activity, ncol = 5) +
labs(x = "Thorax Length (mm)", y = "Longevity (days)",
title = "Observed and Expected Survival Time") +
theme_minimal() + geom_line(aes(y = est), data = d)
plot(p)
Example: Consider an AFT model for the
leukemia
data. Note that the labels of “alive” and “dead”
are not
library(survival)
leukemia$status <- factor(leukemia$status, labels = c("in","out"))
m <- flexsurvreg(Surv(time, status == "out") ~ x, dist = "weibull", data = leukemia)
# create plot of hazard functions
d <- data.frame(x = c("Maintained","Nonmaintained"))
d <- summary(m, newdata = d, t = seq(1, 200, length = 1000),
type = "hazard", tidy = TRUE)
p <- ggplot(d, aes(x = time, y = est)) +
geom_line(aes(linetype = x)) + theme_minimal() +
labs(x = "Time", y = "h(t)", linetype = "Extended",
title = "Hazard Functions") +
theme(legend.position = "inside", legend.position.inside = c(0.7, 0.5))
p.h <- p
# create plot of survival functions
d <- data.frame(x = c("Maintained","Nonmaintained"))
d <- summary(m, newdata = d, t = seq(1, 200, length = 1000),
type = "survival", tidy = TRUE)
p <- ggplot(d, aes(x = time, y = est)) +
geom_line(aes(linetype = x)) + theme_minimal() +
labs(x = "Time", y = "S(t)", linetype = "Extended",
title = "Survival Functions") +
theme(legend.position = "inside", legend.position.inside = c(0.7, 0.7))
p.s <- p
# create plot of probability density functions
d <- data.frame(x = c("Maintained","Nonmaintained"))
d <- summary(m, newdata = d, t = seq(1, 200, length = 1000),
fn = function(t, ...) dweibull(t, ...), tidy = TRUE)
p <- ggplot(d, aes(x = time, y = est)) +
geom_line(aes(linetype = x)) + theme_minimal() +
labs(x = "Time", y = "f(t)", linetype = "Extended",
title = "Probability Density Functions") +
theme(legend.position = "inside", legend.position.inside = c(0.7, 0.7))
p.d <- p
# put the plots together into one plot
cowplot::plot_grid(p.h, p.s, p.d, ncol = 3)
We can also plot the raw data with the estimated expected survival times
and confidence intervals for the estimated expected survival time.
d <- summary(m, newdata = data.frame(x = c("Maintained","Nonmaintained")),
type = "mean", tidy = TRUE)
d
est lcl ucl x
1 56.6 32.1 99.5 Maintained
2 22.3 13.9 35.4 Nonmaintained
p <- ggplot(leukemia, aes(x = x, y = time)) +
geom_dotplot(aes(fill = status), stackdir = "center", binaxis = "y",
binwidth = 1, dotsize = 2, alpha = 0.5) + coord_flip() +
scale_fill_manual(name = "Status", values = c("white","black")) +
geom_pointrange(aes(y = est, ymin = lcl, ymax = ucl),
shape = 3, data = d) +
labs(x = "Maintained", y = "Remission Time (weeks)") +
theme_classic() +
theme(legend.position = "inside", legend.position.inside = c(0.8, 0.8))
plot(p)
A very useful feature of the flexsurv package is that a user can program their own distribution for use with the functions therein.
Let \(h_0(t)\) be the “baseline” hazard function (i.e., the hazard function when all \(x_j = 0\)). A proportional hazards model has the form \[ h_i(t) = h_0(t)e^{\beta_1 x_{i1}} e^{\beta_2 x_{i2}} \cdots e^{\beta_k x_{ik}}, \] so that \(h_i(t) \propto e^{\beta_1 x_{i1}} e^{\beta_2 x_{i2}} \cdots e^{\beta_k x_{ik}}\). Thus increasing \(x_j\) by one changes the hazard function by a factor of \(e^{\beta_j}\). This is the hazard ratio. For example, the hazard ratio for \(x_1\) is \[ \frac{h_0(t)e^{\beta_1 (x_1 + 1)} e^{\beta_2 x_2} \cdots e^{\beta_k x_k}}{h_0(t)e^{\beta_1 x_1} e^{\beta_2 x_2} \cdots e^{\beta_k x_k}} = e^{\beta_1}, \] since \(e^{\beta_1 (x_1 + 1)} = e^{\beta_1 x_1}e^{\beta_1}\).
AFT models with a Weibull distribution (or exponential, which is a special case of the Weibull distribution) are also proportional hazards models. Consider the AFT model, \[ \log T_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \sigma\epsilon_i, \] and the proportional hazards model \[ h_i(t) = h_0(t)\exp(\beta_1^* x_{i1} + \beta_2^* x_{i2} + \cdots + \beta_k^* x_{ik}), \] where in both cases \(T_i\) has a Weibull distribution. It can be shown that the models are equivalent with \[ \beta_j^* = -\beta_j/\sigma. \] The hazard ratios are \(e^{\beta_j^*}\).
An AFT model with a Weibull distribution is the only AFT model that is also a proportional hazards model. Other proportional hazards models exist, but none of the them are AFT models.
Example: We can estimate a Weibull proportional
hazards model for the leukemia
data using
survreg
as follows.
m <- survreg(Surv(time, status == "dead") ~ x, dist = "weibull", data = leukemia)
summary(m)
Call:
survreg(formula = Surv(time, status == "dead") ~ x, data = leukemia,
dist = "weibull")
Value Std. Error z p
(Intercept) NA 0.0 NA NA
xNonmaintained NA 0.0 NA NA
Log(scale) -30.9 0.0 -Inf <2e-16
Scale= 3.89e-14
Weibull distribution
Loglik(model)= -2400 Loglik(intercept only)= 0
Chisq= -4800 on 1 degrees of freedom, p= 1
Number of Newton-Raphson Iterations: 1
n= 23
The estimated hazard ratio is \(e^{\hat\beta_1^*}\) where \(\hat\beta_1^* \approx NA/0 \approx NA\) so
\(e^{\hat\beta_1^*} \approx NA\). Thus
\[
\frac{h_n(t)}{h_y(t)} = e^{\beta_1^*} \Leftrightarrow h_n(t) =
e^{\beta_1^*}h_y(t),
\] where we estimate the hazard ratio \(e^{\beta_1^*}\) to be NA. This conversion
can be done using the ConvertWeibull
function from the
SurvRegCensCov package.
library(SurvRegCensCov)
ConvertWeibull(m)
$vars
Estimate SE
lambda NA NA
gamma 2.57e+13 0
xNonmaintained NA NA
$HR
HR LB UB
xNonmaintained NA NA NA
$ETR
ETR LB UB
xNonmaintained NA NA NA
Another approach is to use dist = "weibullPH"
with
flexsurvreg
which uses a different parameterization of the
Weibull distribution so that applying the exponential function to the
parameters gives hazard ratios.
m <- flexsurvreg(Surv(time, status == "out") ~ x, dist = "weibullPH", data = leukemia)
print(m)
Call:
flexsurvreg(formula = Surv(time, status == "out") ~ x, data = leukemia,
dist = "weibullPH")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 1.264295 0.891546 1.792889 0.225328 NA NA NA
scale NA 0.005544 0.000739 0.041565 0.005698 NA NA NA
xNonmaintained 0.521739 1.174962 0.149832 2.200092 0.523035 3.238021 1.161640 9.025845
N = 23, Events: 18, Censored: 5
Total time at risk: 678
Log-likelihood = -80.5, df = 3
AIC = 167
The proportionality can be seen when plotting the hazard functions.
d <- data.frame(x = c("Maintained","Nonmaintained"))
d <- summary(m, newdata = d, t = seq(1, 200, length = 1000),
type = "hazard", tidy = TRUE)
p <- ggplot(d, aes(x = time, y = est)) +
geom_line(aes(linetype = x)) + theme_minimal() +
labs(x = "Time", y = "h(t)", linetype = "Extended", title = "Hazard Functions") +
theme(legend.position = "inside", legend.position.inside = c(0.8, 0.5))
plot(p)
Example: Consider a Weibull proportional hazards
model for the motors
data.
m <- flexsurvreg(Surv(time, cens) ~ temp, data = MASS::motors, dist = "weibullPH")
print(m)
Call:
flexsurvreg(formula = Surv(time, cens) ~ temp, data = MASS::motors,
dist = "weibullPH")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 2.99e+00 1.96e+00 4.56e+00 6.42e-01 NA NA NA
scale NA 6.34e-22 1.46e-30 2.76e-13 6.43e-21 NA NA NA
temp 1.82e+02 1.36e-01 7.92e-02 1.92e-01 2.87e-02 1.15e+00 1.08e+00 1.21e+00
N = 40, Events: 17, Censored: 23
Total time at risk: 140654
Log-likelihood = -147, df = 3
AIC = 301
Here we have that \[ h_{x+1}(t) = e^{\beta_1^*}h_x{(t)}, \] where \(h_x(t)\) and \(h_{x+1}(t)\) represent the hazard functions at temperatures of \(x\) and \(x+1\), respectively. The estimated hazard ratio is \(e^{\hat\beta_1^*} = 1.15\).
d <- summary(m, newdata = data.frame(temp = seq(110, 150, by = 10)),
t = seq(0, 8000, length = 1000), type = "hazard", tidy = TRUE, ci = FALSE)
p <- ggplot(d, aes(x = time, y = est, color = factor(temp))) +
geom_line() + theme_minimal() +
theme(legend.position = "inside", legend.position.inside = c(0.2, 0.6)) +
labs(x = "Hours", y = "h(t)", color = "Temperature", title = "Hazard Functions")
plot(p)