You can also download a PDF copy of this lecture.
Without much loss of generality we will limit the discussion here to right-censoring. We define an indicator variable Di such that Di={1,if the i-th observation is not censored,0,if the i-th observation is censored. The variable Di can be viewed as another response variable which depends on the actual time to event, Ti, as well as whatever is responsible for the censoring. In what follows we will let ti and di denote observed values of Ti and Di respectively.
Let ti be the actual time-till-event if di=1, and the lower-bound on the time-till-event if di=0 so that the actual time-till-event is greater than or equal to ti. Under certain assumptions about how censoring occurs, the likelihood function is L=n∏i=1f(ti)diP(Ti≥ti)1−di. where f(ti) is the probability density function of Ti, and P(Ti≥ti) is the probability that Ti is at least ti (this is also called the survival function). Note that f(ti)diP(Ti≥ti)1−di={f(ti),if di=1 (i.e., not censored),P(Ti≥ti),if di=0 (i.e., censored), so the indicator variable di simply selects the appropriate term for computing the likelihood of an observation depending on whether or not it was censored.
Surv
For right-censoring, the response variable can be specified
as Surv(t,d)
where t
is (a) the actual time to
event if there is no censoring or (b) the lower bound on time to event
if the observation is right-censored, and d
is either an
indicator variable (i.e., 0
or 1
) or a logical
variable (i.e., FALSE
or TRUE
) where we have
d = 1
or d = TRUE
if the observation is
not censored.
Example: Consider an AFT model for the
leukemia
data.
library(survival) # for leukemia and survreg
head(leukemia) # status=1 if remission ended at that time, status=0 if right-censored
time status x
1 9 1 Maintained
2 13 1 Maintained
3 13 0 Maintained
4 18 1 Maintained
5 23 1 Maintained
6 28 0 Maintained
m <- survreg(Surv(time, status) ~ x, dist = "lognormal", data = leukemia)
summary(m)$table
Value Std. Error z p
(Intercept) 3.579 0.285 12.571 3.04e-36
xNonmaintained -0.724 0.380 -1.905 5.68e-02
Log(scale) -0.145 0.170 -0.858 3.91e-01
Alternatively suppose we had a variable censored
that
told us if the observation was censored or not.
leukemia$censored <- factor(leukemia$status, labels = c("yes","no"))
head(leukemia)
time status x censored
1 9 1 Maintained no
2 13 1 Maintained no
3 13 0 Maintained yes
4 18 1 Maintained no
5 23 1 Maintained no
6 28 0 Maintained yes
Then we specify the censoring as follows.
m <- survreg(Surv(time, censored == "no") ~ x, dist = "lognormal", data = leukemia)
summary(m)$table
Value Std. Error z p
(Intercept) 3.579 0.285 12.571 3.04e-36
xNonmaintained -0.724 0.380 -1.905 5.68e-02
Log(scale) -0.145 0.170 -0.858 3.91e-01
It is useful to note that we can see how Surv
codes the
response variable for censoring. This is useful if you want to verify
that you have used Surv
correctly.
leukemia$ysurv <- Surv(leukemia$time, leukemia$censored == "no")
head(leukemia)
time status x censored ysurv
1 9 1 Maintained no 9
2 13 1 Maintained no 13
3 13 0 Maintained yes 13+
4 18 1 Maintained no 18
5 23 1 Maintained no 23
6 28 0 Maintained yes 28+
As before, interpretation is facilitated by applying the exponential function to the parameter estimates.
exp(cbind(coef(m),confint(m)))
2.5 % 97.5 %
(Intercept) 35.833 20.51 62.60
xNonmaintained 0.485 0.23 1.02
leukemia$x <- relevel(leukemia$x, ref = "Nonmaintained")
m <- survreg(Surv(time, status) ~ x, dist = "lognormal", data = leukemia)
summary(m)$table
Value Std. Error z p
(Intercept) 2.854 0.254 11.242 2.55e-29
xMaintained 0.724 0.380 1.905 5.68e-02
Log(scale) -0.145 0.170 -0.858 3.91e-01
exp(cbind(coef(m),confint(m)))
2.5 % 97.5 %
(Intercept) 17.36 10.556 28.56
xMaintained 2.06 0.979 4.35
Interval censoring occurs when Ti is only known to be between two numbers such that a<Ti<b where 0≤a<b≤∞. Note that right-censoring is a special case where b=∞, and left-censoring is a special case where a=0.
Example: Consider the following data from a study of the time till cosmetic deterioration for breast cancer patients undergoing radiotherapy alone versus radiotherapy and chemotherapy.
library(mable)
head(cosmesis, 10)
left right treat
1 45 NA RT
2 6 10 RT
3 0 7 RT
4 46 NA RT
5 46 NA RT
6 7 16 RT
7 17 NA RT
8 7 14 RT
9 37 44 RT
10 0 8 RT
Note that these data include left-censoring, interval-censoring, and
right-censoring).
Using the
Surv
function to specify censoring requires that
lower bounds of 0 and upper bounds of ∞ be replaced with
NA
.
cosmesis$left <- ifelse(cosmesis$left == 0, NA, cosmesis$left)
head(cosmesis, 10)
left right treat
1 45 NA RT
2 6 10 RT
3 NA 7 RT
4 46 NA RT
5 46 NA RT
6 7 16 RT
7 17 NA RT
8 7 14 RT
9 37 44 RT
10 NA 8 RT
tail(cosmesis, 10)
left right treat
85 14 19 RCT
86 4 8 RCT
87 34 NA RCT
88 30 36 RCT
89 18 24 RCT
90 16 60 RCT
91 35 39 RCT
92 21 NA RCT
93 11 20 RCT
94 48 NA RCT
It is also useful to note that you can accommodate an observation that is not censored by specifying equal left and right interval endpoints.
We can verify the censoring specification by looking at what
Surv
produces.
cosmesis$y <- with(cosmesis, Surv(left, right, type = "interval2"))
head(cosmesis, 10)
left right treat y
1 45 NA RT 45+
2 6 10 RT [ 6, 10]
3 NA 7 RT 7-
4 46 NA RT 46+
5 46 NA RT 46+
6 7 16 RT [ 7, 16]
7 17 NA RT 17+
8 7 14 RT [ 7, 14]
9 37 44 RT [37, 44]
10 NA 8 RT 8-
Now we can estimate an AFT model.
m <- survreg(Surv(left, right, type = "interval2") ~ treat,
dist = "lognormal", data = cosmesis)
summary(m)$table
Value Std. Error z p
(Intercept) 3.548 0.154 23.01 3.45e-117
treatRCT -0.421 0.203 -2.07 3.83e-02
Log(scale) -0.125 0.109 -1.15 2.52e-01
Applying the exponential function helps interpret the effect of the treatment.
exp(cbind(coef(m), confint(m)))
2.5 % 97.5 %
(Intercept) 34.739 25.680 46.995
treatRCT 0.656 0.441 0.978
Using flexsurvreg
produces the same information but in
one output.
library(flexsurv)
Warning: package 'flexsurv' was built under R version 4.4.3
m <- flexsurvreg(Surv(left, right, type = "interval2") ~ treat,
dist = "lognormal", data = cosmesis)
print(m)
Call:
flexsurvreg(formula = Surv(left, right, type = "interval2") ~
treat, data = cosmesis, dist = "lognormal")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
meanlog NA 3.5479 3.2457 3.8500 0.1542 NA NA NA
sdlog NA 0.8821 0.7118 1.0933 0.0966 NA NA NA
treatRCT 0.5106 -0.4210 -0.8192 -0.0228 0.2032 0.6564 0.4408 0.9775
N = 94, Events: 0, Censored: 94
Total time at risk: 2089
Log-likelihood = -147, df = 3
AIC = 299
Again, it is sometimes helpful for interpretation to change the reference level when dealing with categorical explanatory varaibles.
cosmesis$treat <- relevel(cosmesis$treat, ref = "RCT")
m <- flexsurvreg(Surv(left, right, type = "interval2") ~ treat,
dist = "lognormal", data = cosmesis)
print(m)
Call:
flexsurvreg(formula = Surv(left, right, type = "interval2") ~
treat, data = cosmesis, dist = "lognormal")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
meanlog NA 3.1269 2.8558 3.3980 0.1383 NA NA NA
sdlog NA 0.8821 0.7118 1.0933 0.0966 NA NA NA
treatRT 0.4894 0.4210 0.0228 0.8192 0.2032 1.5235 1.0230 2.2688
N = 94, Events: 0, Censored: 94
Total time at risk: 2089
Log-likelihood = -147, df = 3
AIC = 299
The survival function is S(t)=P(T≥t), i.e., the probability of a survival time of at least t. It is sometimes defined as S(t)=P(T>t) rather than S(t)=P(T≥t), but if time is modeled
as a continuous random variable this distinction does not matter.
Another useful property of survival functions is that the area under the
survival curve equals the expected survival time E(Ti), assuming S(0)=0 (i.e., no events have happened
at time zero) and S(∞)=1
(i.e., events eventually do happen).
Technical Explanation: Accelerated failure time models can be interpreted in terms of effects on survival functions. Let Tb=eβ0eβ1x1eβ2x2⋯eβkxkeσϵ, and let Ta=eβ1Tb as before where Ta and Tb are the survival times when the first explanatory variable assumes values of xa and xb, respectively. The survival functions for Ta and Tb are then Sa(t)=P(Ta≥t) and Sb(t)=P(Tb≥t), respectively. These survival functions are related because Sb(t)=P(Tb≥t)=P(eβ1Tb≥eβ1t)=P(Ta≥eβ1t)=Sa(eβ1t). That is, Sb(t)=Sa(eβ1t) and also Sb(t/eβ1)=Sa(t). So we can say the following.
The probability of survival past t at xb equals the probability of survival past eβ1t at xa.
The probability of survival past t at xa equals the probability of survival past t/eβ1 at xb.
It can also be shown that we can “order” the survival functions/probabilities from an AFT model because βj>0⇔eβj>1⇔Sb(t)<Sa(t),βj<0⇔eβj<1⇔Sb(t)>Sa(t). Note that with an AFT model the survival functions at two different values of an explanatory variable do not cross.
In an AFT model the explanatory variables can be viewed as “compressing” or “stretching” time which has the effect of “horizontally compressing/stretching” the survival function. Assume Ti=eβ0eβ1xi1⋯eβkxikeσϵi and let Si(t) be the survival function of Ti. Then Si(t)=P(Ti≥t)=P(eβ0eβ1xi1⋯eβkxikeσϵi≥t)=P[eβ0eσϵi≥t/(eβ1xi1⋯eβkxik)]. If all xij=0 then Ti=eβ0eσϵi with a “baseline” survival function S0(t)=P(eβ0eσϵi≥t). Then Si(t)=S0[t/(eβ1xi1⋯eβkxik)] and Si(teβ1xi1⋯eβkxik)=S0(t). So the explanatory variables effectively “horizontally” compress or stretch a (hypothetical) baseline survival function. Also in terms of the actual times, if T0=eβ0eσϵ represents a “baseline” survival time when all xij=0, the Ti=eβ1xi1⋯eβkxikT0, so that again the values of the explanatory variables have the effect of “stretching” or “compressing” time time scale.
Example: Recall the AFT model for the
lifespan
data where the model is logTi=β0+β1xi, where xi is an indicator
variable such that xi = 1 if the
species is human (so xa = 1 in the
above discussion), and xi = 0 if
the species is dog (so xb = 0 in
the above discussion). The estimate of β1 was ˆβ1 ≈ 1.946 so that eˆβ1 ≈ 7. The “baseline” survival
function is the survival function for dogs, which we can write as Sd(t). The survival function for humans
is then Sh(7t). For example, we
estimate that the probability that a dog lives for 10 or more years
equals the probability that a human will live for 70 or more years
because Sd(t)=Sh(7t) where
t = 10. The survival function of a
human is obtained by “stretching” the survival function of a dog by a
factor of 7.
If we re-parameters the model so that xi = 1 if the species is dog,
then we have that ˆβ1 ≈ 1/7, so that the “baseline”
survival function is for humans, and we have that Sh(t)=Sd(t/7). We can also say that
we estimate that the probability that a human lives to be 35 or more
equals the probability that a dog lives to be 5 or more because Sd(t/7)=Sh(t) where t = 35. The survival function of a dog is
obtained by “compressing” the survival function of a human by a factor
of 1/7.
Example: Recall the AFT model for the
motors
data where the model is logTi=β0+β1xi, where xi is temperature.
The estimate of β1 was ˆβ1 ≈ -0.047 so that eˆβ1≈0.95. Thus Sx+1(0.95t)=Sx(t) where the
subscript of x represents
temperature. Increasing by one degree “compresses” the survival function
by a factor of about 0.95 (i.e., 5%).
Estimating and plotting survival functions is relatively easy using
flexsurvreg
objects. Here the summary
function
behaves more like predict
for other model objects produced
by lm
, nls
, and glm
.
Example: The estimated survival functions for the
AFT model for the lifespan
data can be computed/plotted as
follows.
library(trtools) # for lifespan data
m <- flexsurvreg(Surv(years) ~ species, dist = "lognormal", data = lifespan)
d <- data.frame(species = c("dog","human"))
d <- summary(m, newdata = d, t = seq(0, 100, by = 0.5), type = "survival", tidy = TRUE)
head(d)
time est lcl ucl species
1 0.0 1.000 1.000 1.000 dog
2 0.5 1.000 1.000 1.000 dog
3 1.0 1.000 1.000 1.000 dog
4 1.5 0.999 0.999 0.999 dog
5 2.0 0.995 0.994 0.996 dog
6 2.5 0.987 0.984 0.990 dog
tail(d)
time est lcl ucl species
397 97.5 0.261 0.240 0.282 human
398 98.0 0.258 0.238 0.280 human
399 98.5 0.256 0.235 0.277 human
400 99.0 0.253 0.232 0.274 human
401 99.5 0.250 0.230 0.271 human
402 100.0 0.248 0.227 0.268 human
p <- ggplot(d, aes(x = time, y = est)) +
geom_line(aes(linetype = species)) +
labs(x = "Time (years)", y = "S(t)", linetype = "Species") +
theme_minimal() +
theme(legend.position = "inside", legend.position.inside = c(0.8,0.8))
plot(p)
Example: Survival functions at different
temperatures based on the AFT model for the motors
data can
be computed/plotted as follows.
library(MASS) # for motors data frame
m <- flexsurvreg(Surv(time, cens) ~ temp, dist = "lognormal", data = motors)
d <- data.frame(temp = c(150,175,200))
d <- summary(m, newdata = d, t = seq(0, 10000, length = 100),
type = "survival", tidy = TRUE)
p <- ggplot(d, aes(x = time, y = est, linetype = factor(temp))) +
geom_line() + theme_minimal() +
labs(x = "Time (Hours)", y = "S(t)", linetype = "Temperature (C)")
plot(p)
p <- ggplot(d, aes(x = time, y = est)) +
geom_line() + facet_wrap(~ temp, nrow = 1) +
geom_ribbon(aes(ymin = lcl, ymax = ucl), alpha = 0.1) +
labs(x = "Time (Hours)", y = "S(t)") + theme_minimal()
plot(p)