Friday, April 11

You can also download a PDF copy of this lecture.

In a discrete survival time model we model the hazard function $h(t) = P(T = t|T \ge t)$ (i.e., the probability of a unit not “surviving” to time $t+1$ given that it survived to time $t$ ). This is closely related to a family of models for ordered categorical response variables that are conceptualized as a series of “stages” or “phases” of some sort. But here instead we usually model $P(T > t|T \ge t)$ (i.e., the probability that a unit will transition to stage $t+1$ given that it made it to stage $t$ ). In terms of the hazard function $P(T > t|T \ge t) = 1 - P(T = t|T \ge t) = 1 - h(t).$ Warning: The VGAM package includes a function called margeff which computes instantaneous marginal effects for model objects created using the vglm function. To avoid conflicts, use trtools::margeff when using the margeff function from the trtools package if the VGAM package is loaded.

Example: The data frame pneumo from the package VGAM contains aggregated data of pneumoconiosis in coal miners.

library(VGAM)
print(pneumo)

  exposure.time normal mild severe
1           5.8     98    0      0
2          15.0     51    2      1
3          21.5     34    6      3
4          27.5     35    5      8
5          33.5     32   10      9
6          39.5     23    7      8
7          46.0     12    6     10
8          51.5      4    2      5

This kind of model can also be estimated using the vglm function from the VGAM package. With the original aggregated data we would specify the model as follows. Note that the order of the arguments to cbind is important. We want to order them from lowest/first to highest/last.

m <- vglm(cbind(normal,mild,severe) ~ exposure.time, 
    family = cratio(link = "logitlink"), data = pneumo)
summary(m)


Call:
vglm(formula = cbind(normal, mild, severe) ~ exposure.time, family = cratio(link = "logitlink"), 
    data = pneumo)

Coefficients: 
                Estimate Std. Error z value Pr(>|z|)    
(Intercept):1    -3.9664     0.4189   -9.47  < 2e-16 ***
(Intercept):2    -1.1133     0.7664   -1.45    0.146    
exposure.time:1   0.0963     0.0124    7.79  6.9e-15 ***
exposure.time:2   0.0355     0.0206    1.72    0.085 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y>1|Y>=1]), logitlink(P[Y>2|Y>=2])

Residual deviance: 13.3 on 12 degrees of freedom

Log-likelihood: -29.2 on 12 degrees of freedom

Number of Fisher scoring iterations: 6 

Warning: Hauck-Donner effect detected in the following estimate(s):
'(Intercept):1'


Exponentiated coefficients:
exposure.time:1 exposure.time:2 
           1.10            1.04

exp(cbind(coef(m), confint(m)))

                         2.5 % 97.5 %
(Intercept):1   0.0189 0.00833 0.0431
(Intercept):2   0.3285 0.07314 1.4751
exposure.time:1 1.1011 1.07469 1.1281
exposure.time:2 1.0361 0.99517 1.0787

If the data are not aggregated (i.e., one observational unit per row) then the syntax is different. Here I disaggregate the data for demonstration.

library(tidyr)
pneumosingle <- pneumo %>% pivot_longer(c(normal,mild,severe), 
  names_to = "condition", values_to = "frequency") %>% uncount(frequency)
head(pneumosingle)

# A tibble: 6 × 2
  exposure.time condition
          <dbl> <chr>    
1           5.8 normal   
2           5.8 normal   
3           5.8 normal   
4           5.8 normal   
5           5.8 normal   
6           5.8 normal

tail(pneumosingle)

# A tibble: 6 × 2
  exposure.time condition
          <dbl> <chr>    
1          51.5 mild     
2          51.5 severe   
3          51.5 severe   
4          51.5 severe   
5          51.5 severe   
6          51.5 severe

An important step here is that we need to order the levels of condition appropriately since we cannot order it in cbind now.

pneumosingle$conditionf <- factor(pneumosingle$condition,
    levels = c("normal","mild","severe"), ordered = TRUE)
levels(pneumosingle$conditionf) # correct order

[1] "normal" "mild"   "severe"

We actually don’t need the ordered = TRUE here, as the levels argument will imply the order, but it avoids vglm throwing a warning.

Now we can specify the model as follows.

m <- vglm(conditionf ~ exposure.time, 
  family = cratio(link = "logitlink"), data = pneumosingle)
summary(m)


Call:
vglm(formula = conditionf ~ exposure.time, family = cratio(link = "logitlink"), 
    data = pneumosingle)

Coefficients: 
                Estimate Std. Error z value Pr(>|z|)    
(Intercept):1    -3.9664     0.4189   -9.47  < 2e-16 ***
(Intercept):2    -1.1133     0.7664   -1.45    0.146    
exposure.time:1   0.0963     0.0124    7.79  6.9e-15 ***
exposure.time:2   0.0355     0.0206    1.72    0.085 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y>1|Y>=1]), logitlink(P[Y>2|Y>=2])

Residual deviance: 417 on 738 degrees of freedom

Log-likelihood: -208 on 738 degrees of freedom

Number of Fisher scoring iterations: 8 

Warning: Hauck-Donner effect detected in the following estimate(s):
'(Intercept):1'


Exponentiated coefficients:
exposure.time:1 exposure.time:2 
           1.10            1.04

exp(cbind(coef(m), confint(m)))

                         2.5 % 97.5 %
(Intercept):1   0.0189 0.00833 0.0431
(Intercept):2   0.3285 0.07314 1.4751
exposure.time:1 1.1011 1.07469 1.1281
exposure.time:2 1.0361 0.99517 1.0787

Now suppose we want to plot the model. First we can compute the probability of each condition as a function of exposure.

d <- data.frame(exposure.time = seq(5, 55, length = 100))
d <- cbind(d, predict(m, newdata = d, type = "response"))
head(d)

  exposure.time normal   mild  severe
1          5.00  0.970 0.0214 0.00838
2          5.51  0.969 0.0223 0.00890
3          6.01  0.967 0.0232 0.00945
4          6.52  0.966 0.0242 0.01003
5          7.02  0.964 0.0253 0.01064
6          7.53  0.962 0.0263 0.01129

We can use the pivot_longer function from the tidyr package to reshape the data for plotting.

library(tidyr)
d <- d %>% pivot_longer(c(normal,mild,severe), 
  names_to = "condition", values_to = "probability")
head(d)

# A tibble: 6 × 3
  exposure.time condition probability
          <dbl> <chr>           <dbl>
1          5    normal        0.970  
2          5    mild          0.0214 
3          5    severe        0.00838
4          5.51 normal        0.969  
5          5.51 mild          0.0223 
6          5.51 severe        0.00890

tail(d)

# A tibble: 6 × 3
  exposure.time condition probability
          <dbl> <chr>           <dbl>
1          54.5 normal          0.218
2          54.5 mild            0.239
3          54.5 severe          0.543
4          55   normal          0.209
5          55   mild            0.239
6          55   severe          0.552

p <- ggplot(d, aes(x = exposure.time, y = probability, color = condition)) + 
  geom_line() + theme_minimal() + 
  labs(x = "Exposure (Years)", y = "Probability", color = "Condition")
plot(p)

Alternatively we can plot the probability of passing from one condition to the next — i.e., $P(Y>y|Y \ge y)$ . But we need to compute those probabilities using the following fact. $P(Y > y|Y \ge y) = \frac{P(Y > y \text{ and } Y \ge y)}{P(Y \ge y)} = \frac{P(Y > y)}{P(Y \ge y)}.$ Note that this uses the definition of a conditional probability and the fact that if $Y > y$ and $Y \ge y$ then $Y > y$ . So $P(Y > 1|Y \ge 1) = \frac{P(Y > 1)}{P(Y \ge 1)} = \frac{P(Y=2) + P(Y=3)}{P(Y=1) + P(Y=2) + P(Y=3)},$ and $P(Y > 2|Y \ge 2) = \frac{P(Y > 2)}{P(Y \ge 2)} = \frac{P(Y=3)}{P(Y=2) + P(Y=3)}.$ So we can compute the probability by adding together category probabilities.

d <- data.frame(exposure.time = seq(5, 55, length = 100))
d <- cbind(d, predict(m, newdata = d, type = "response"))
# probability of going from normal to mild -- i.e., P(Y > normal|Y >= normal)
d$nm <- with(d, (mild + severe) / (normal + mild + severe))
# probability of going from mild to severe -- i.e., P(Y > mild|Y >= mild)
d$ms <- with(d, severe / (mild + severe))
head(d)

  exposure.time normal   mild  severe     nm    ms
1          5.00  0.970 0.0214 0.00838 0.0297 0.282
2          5.51  0.969 0.0223 0.00890 0.0312 0.285
3          6.01  0.967 0.0232 0.00945 0.0327 0.289
4          6.52  0.966 0.0242 0.01003 0.0343 0.293
5          7.02  0.964 0.0253 0.01064 0.0359 0.296
6          7.53  0.962 0.0263 0.01129 0.0376 0.300

Remove original category probabilities just for clarity, reshape the data, and plot.

d$normal <- NULL
d$mild <- NULL
d$severe <- NULL
d <- d %>% pivot_longer(c(nm,ms), names_to = "transition", values_to = "probability")
head(d)

# A tibble: 6 × 3
  exposure.time transition probability
          <dbl> <chr>            <dbl>
1          5    nm              0.0297
2          5    ms              0.282 
3          5.51 nm              0.0312
4          5.51 ms              0.285 
5          6.01 nm              0.0327
6          6.01 ms              0.289

tail(d)

# A tibble: 6 × 3
  exposure.time transition probability
          <dbl> <chr>            <dbl>
1          54.0 nm               0.774
2          54.0 ms               0.690
3          54.5 nm               0.782
4          54.5 ms               0.694
5          55   nm               0.791
6          55   ms               0.698

p <- ggplot(d, aes(x = exposure.time, y = probability, color = transition)) + 
  geom_line() + theme_minimal() + 
  labs(x = "Exposure (Years)", y = "Probability", color = "Transition")
plot(p)

Example: Consider again the firstsex data.

firstsex <- read.table("https://stats.idre.ucla.edu/stat/examples/alda/firstsex.csv", 
  sep = ",", header = TRUE)
firstsex$parent_trans <- factor(firstsex$pt, 
  levels = c(0,1), labels = c("no","yes"))

The discrete survival model can be estimated using vglm if the right-censoring is always at the highest observed time, which it is here (grade 12). We need to create a new “grade” for those cases where sex had not occurred for the first time in grade 12 (this represents first sex after HS, if at all).

firstsex$time <- ifelse(firstsex$censor == 1, 13, firstsex$time)

Probabilities of the form $P(Y = y|Y \ge y)$ can be modeled using vglm if we use the sratio family. Here we do not need to worry about the ordering of the response variable because it is implied by the ordering of the grade numbers.

m <- vglm(time ~ parent_trans, data = firstsex,
  family = sratio(link = "logitlink", parallel = TRUE))
summary(m)


Call:
vglm(formula = time ~ parent_trans, family = sratio(link = "logitlink", 
    parallel = TRUE), data = firstsex)

Coefficients: 
                Estimate Std. Error z value Pr(>|z|)    
(Intercept):1     -2.994      0.318   -9.43  < 2e-16 ***
(Intercept):2     -3.700      0.420   -8.81  < 2e-16 ***
(Intercept):3     -2.281      0.273   -8.36  < 2e-16 ***
(Intercept):4     -1.823      0.258   -7.06  1.7e-12 ***
(Intercept):5     -1.654      0.269   -6.15  7.8e-10 ***
(Intercept):6     -1.179      0.270   -4.36  1.3e-05 ***
parent_transyes    0.874      0.218    4.02  5.9e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of linear predictors:  6 

Names of linear predictors: logitlink(P[Y=1|Y>=1]), logitlink(P[Y=2|Y>=2]), 
logitlink(P[Y=3|Y>=3]), logitlink(P[Y=4|Y>=4]), logitlink(P[Y=5|Y>=5]), logitlink(P[Y=6|Y>=6])

Residual deviance: 635 on 1073 degrees of freedom

Log-likelihood: -317 on 1073 degrees of freedom

Number of Fisher scoring iterations: 5 

Warning: Hauck-Donner effect detected in the following estimate(s):
'(Intercept):2'


Exponentiated coefficients:
parent_transyes 
            2.4

Note that specifying parallel = TRUE means that the effect of pt is the same at each grade (i.e., no interaction between grade and pt). Note that the odds ratio for parenting transition (pt) is the same as what we obtained in the previous lecture.

exp(cbind(coef(m), confint(m)))

                        2.5 % 97.5 %
(Intercept):1   0.0501 0.0269 0.0933
(Intercept):2   0.0247 0.0109 0.0563
(Intercept):3   0.1022 0.0599 0.1744
(Intercept):4   0.1616 0.0974 0.2681
(Intercept):5   0.1912 0.1129 0.3240
(Intercept):6   0.3076 0.1811 0.5223
parent_transyes 2.3956 1.5641 3.6691

The other parameters are not the same because this model is parameterized differently, using indicator variables for all grades (called (Intercept) here for reasons we will see in the next lecture) and then dropping the overall intercept term.

firstsex <- read.table("https://stats.idre.ucla.edu/stat/examples/alda/firstsex.csv", 
  sep = ",", header = TRUE)
firstsex$parent_trans <- factor(firstsex$pt, 
  levels = c(0,1), labels = c("no","yes"))
firstsex$status <- ifelse(firstsex$censor == 1, 0, 1)
firstsex <- trtools::dsurvbin(firstsex, "time", "status")
m <- glm(y ~ -1 + t + parent_trans, 
  family = binomial, data = firstsex)
summary(m)$coefficients

                Estimate Std. Error z value Pr(>|z|)
t7                -2.994      0.318   -9.43 4.07e-21
t8                -3.700      0.420   -8.80 1.37e-18
t9                -2.281      0.272   -8.37 5.55e-17
t10               -1.823      0.258   -7.05 1.77e-12
t11               -1.654      0.269   -6.15 7.89e-10
t12               -1.179      0.272   -4.34 1.42e-05
parent_transyes    0.874      0.217    4.02 5.86e-05

exp(cbind(coef(m), confint(m)))

                         2.5 % 97.5 %
t7              0.0501 0.02588 0.0903
t8              0.0247 0.00991 0.0526
t9              0.1022 0.05847 0.1705
t10             0.1616 0.09543 0.2635
t11             0.1912 0.11043 0.3181
t12             0.3076 0.17740 0.5163
parent_transyes 2.3956 1.57661 3.7041

Example: We can estimate the discrete survival model for the cycles data as follows. Note that we do not need to do anything for the censoring here because all observations censored at 12 are recorded at 13 (much like we did with the firstsex data).

library(trtools)
m <- vglm(cycles ~ mother, data = cycles, 
   family = sratio(link = "logitlink", parallel = TRUE ~ 1 + mother))
summary(m)


Call:
vglm(formula = cycles ~ mother, family = sratio(link = "logitlink", 
    parallel = TRUE ~ 1 + mother), data = cycles)

Coefficients: 
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -1.242      0.116   -10.7  < 2e-16 ***
mothernonsmoker    0.541      0.129     4.2  2.7e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of linear predictors:  12 

Names of linear predictors: logitlink(P[Y=1|Y>=1]), logitlink(P[Y=2|Y>=2]), 
logitlink(P[Y=3|Y>=3]), logitlink(P[Y=4|Y>=4]), logitlink(P[Y=5|Y>=5]), logitlink(P[Y=6|Y>=6]), 
logitlink(P[Y=7|Y>=7]), logitlink(P[Y=8|Y>=8]), logitlink(P[Y=9|Y>=9]), logitlink(P[Y=10|Y>=10]), 
logitlink(P[Y=11|Y>=11]), logitlink(P[Y=12|Y>=12])

Residual deviance: 2258 on 7030 degrees of freedom

Log-likelihood: -1129 on 7030 degrees of freedom

Number of Fisher scoring iterations: 5 

No Hauck-Donner effect found in any of the estimates

exp(cbind(coef(m), confint(m)))

                      2.5 % 97.5 %
(Intercept)     0.289  0.23  0.363
mothernonsmoker 1.718  1.33  2.213

Same results as in the last lecture except confidence interval is slightly different (the confidence interval here is a Wald confidence interval as opposed to a profile likelihood interval). Note that the argument parallel = TRUE ~ 1 + mother forces the parameters to be the same across categories (i.e., cycles).