You can also download a PDF copy of this lecture.
Recall that a link function g in a generalized linear model (or extensions thereof) is a function so that g[E(Y)]=η where η=β0+β1x1+⋯+βkxk. This gives us the mean structure E(Y)=g−1(η). Note: To keep the notation clean I will omit the observation index i.
It is useful to distinguish between two families of link functions: those for binomial regression and those for regression models for positive (or non-negative) response variables.
Let Y be a proportion or binary response variable. The logit link function is defined as g[E(Y)]=log[E(Y)1−E(Y)], which implies the mean structure E(Y)=eη1+eη. Properties:
“Symmetric” in the sense that g(z)=−g(1−z). Whether we model the probability of one event or the other is only a choice of parameterization.
Allows (easy) use of odds ratios.
Connects logistic regression to a variety of models (e.g., multinomial regression) and designs (e.g., case-control studies).
Known as logit
in functions like glm
. The
inverse function ez/(1+ez) is
known to R as plogis
. This is the default link function for
the binomial
family.
Example: It is useful to note the “symmetry” of the model in that we can model either the event or its complement.
library(trtools) # for bliss data
# modeling the probability of beetle death
m <- glm(cbind(dead, exposed - dead) ~ concentration,
family = binomial(link = logit), data = bliss)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -14.808 1.2898 -11.5 1.63e-30
concentration 0.249 0.0214 11.7 2.25e-31
# modeling the probability of beetle survival
m <- glm(cbind(exposed - dead, dead) ~ concentration,
family = binomial(link = logit), data = bliss)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 14.808 1.2898 11.5 1.63e-30
concentration -0.249 0.0214 -11.7 2.25e-31
Defined as g[E(Y)]=log[−log[1−E(Y)]], which implies the mean structure E(Yi)=1−e−eηi. Known as cloglog
to functions like
glm
.
Properties:
Not symmetric. Approaches 1 faster than 0.
Variety of applications in bio-assay, epidemiology, and survival analysis.
Can be used when the observed response is assumed to depend on an underlying Poisson-distributed count. Let Z be an unobserved response with a Poisson distribution with mean eη, and let Yi={1,if Z>0,0,if Z=0. Because P(Z=z)=λze−λz! then P(Y=1)=1−P(Z=0)=1−e−λ. Assuming a Poisson regression model such that η=exp(β0+β1x1+⋯+βkxk) then P(Y=1)=1−e−eη.
Example: Suppose have the Poisson regression model E(Zi)=exp(β0+β1xi), but we only observe if the count is greater than zero so what we observe is the binary response Yi={1,if Zi>0,0,if Zi=0. We can make inferences concerning this model using Yi in a binomial regression model with the complementary log-log link function.
set.seed(123) # to reproduce results
n <- 1000 # sample size
b0 <- 0
b1 <- 1
# simulate data
d <- data.frame(x = seq(-2, 2, length = n))
d$z <- rpois(n, lambda = exp(b0 + b1*d$x))
# Poisson regression
m <- glm(z ~ x, family = poisson, data = d)
cbind(summary(m)$coefficients, confint(m))
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) 0.00072 0.0382 0.0188 9.85e-01 -0.0754 0.0745
x 0.99244 0.0281 35.3029 5.31e-273 0.9379 1.0481
d$y <- ifelse(d$z > 0, 1, 0) # create binary response
# binomial regression with complementary log-log link function
m <- glm(y ~ x, family = binomial(link = cloglog), data = d)
cbind(summary(m)$coefficients, confint(m))
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -0.0489 0.0520 -0.94 3.47e-01 -0.152 0.0519
x 0.9987 0.0574 17.40 8.76e-68 0.890 1.1129
Note the similarities (and differences) in the inferences.
Defined as g[E(Y)]=log[−logE(Y)] which implies the mean structure E(Y)=e−eη. Properties:
Not symmetric. Approaches 1 slower than it approaches 0.
Not usually available in software, but is available as
loglog
from the trtools package.
Related to the complementary log-log link function in that if the probability of an event occurring is modeled using a complementary log-log link function, then the probability of an event not occurring can be modeled using the log-log link function.
Example: The following two models are equivalent.
library(trtools) # for bliss data and loglog link function
m <- glm(cbind(dead, exposed - dead) ~ concentration,
family = binomial(link = loglog), data = bliss)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.058 0.7281 -12.4 1.56e-35
concentration 0.162 0.0127 12.7 3.36e-37
m <- glm(cbind(exposed - dead, dead) ~ concentration,
family = binomial(link = cloglog), data = bliss)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.058 0.7281 12.4 1.56e-35
concentration -0.162 0.0127 -12.7 3.36e-37
Example: Consider again the bliss
data.
The link function is specified through
family = binomial(link = X)
where X
is
logit
, cloglog
, and loglog
. Note
that we can use these link functions for other “families” such as
quasi
and quasibinomial
.
There are other link functions available in R including
probit
and cauchit
. In theory, there are an
infinity of link functions that could be considered. But here we will
consider some nicely motivated link functions that have been used in
practice.
Logistic regression has the mean structure E(Y)=eη1+eη, such that E(Y)→1 as η→∞ and E(Y)→0 as η→−∞. And recall that E(Y) is also the probability of the event. In some cases it may not be true that E(Y)→0 as η→−∞.
For simplicity, consider the case where Y is a binary response so that Y=0 or Y=1. Let N be the event that we observe a natural response and ˉN the event that we do not observe a natural response. Then P(Y=1)=P(Y=1|N)P(N)+P(Y=1|ˉN)P(ˉN)=P(N)+P(Y=1|ˉN)P(ˉN), noting that P(Y=1|N)=1. If we let P(N)=γ and assume that the “non-natural” response follows a logistic regression model so that P(Y=1|ˉN)=eη1+eη, then P(Y=1)=γ+(1−γ)eη1+eη, noting that P(ˉN)=1−P(N)=1−γ. In general we can write E(Y)=γ+(1−γ)eη1+eη,
or log[E(Y)−γ1−E(Y)]=η. This effectively requires us to modify the link function if γ>0. For making inferences we must consider two cases: (1) γ is known and can be specified as part of the link function, and (2) γ is unknown and must be estimated as an additional parameter. Also note that we can have a natural response variants of other link functions.
Example: Consider again the bliss
data.
Suppose we know that γ = 0.1 (i.e., there is a
probability of 0.1 that a beetle will die naturally during the
experiment and not due to exposure to carbon disulphide).
m <- glm(cbind(dead, exposed - dead) ~ concentration,
data = bliss, family = binomial)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -14.808 1.2898 -11.5 1.63e-30
concentration 0.249 0.0214 11.7 2.25e-31
Using the natural response link function sometimes requires starting values. An easy way to specify them is to take estimates when using the logit link function.
m.nr <- glm(cbind(dead, exposed - dead) ~ concentration, data = bliss,
family = binomial(link = logitnr(0.1)), start = c(-14.8, 0.25))
summary(m.nr)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -19.128 2.123 -9.01 2.03e-19
concentration 0.314 0.034 9.22 2.98e-20
Note that logitnr
is from the trtools
package. The interpretation of an odds ratio for this model would be for
the odds of death due to carbon disulphide (rather than
“natural causes”).
d <- data.frame(concentration = seq(40, 80, length = 100))
d$yhat <- predict(m, newdata = d, type = "response")
d$yhat.nr <- predict(m.nr, newdata = d, type = "response")
p <- ggplot(bliss, aes(x = concentration, y = dead/exposed)) +
geom_point() + ylim(0,1) + labs(x = "Concentration", y = "Proportion") +
geom_line(aes(y = yhat), data = d) + theme_classic() +
geom_line(aes(y = yhat.nr), data = d, linetype = 2) +
geom_hline(yintercept = 0.1, color = grey(0.5))
plot(p)
When γ is unknown the model is
no longer a traditional GLM because there is an unknown parameter in the
link function. But we can make inferences using maximum likelihood. This
requires a bit of programming.
Example: Consider again the bliss
data
and model with the logit link function with a natural response. The
log-likelihood function is logL=n∑i=1log[mi!ri!(mi−ri)!prii(1−pi)mi−ri] where pi=γ+(1−γ)eηi1+eηi and ηi=β0+β1di, where di is dose.
We can program the negative log-likelihood function.
nloglik <- function(beta0, beta1, gamma) {
eta <- beta0 + beta1 * concentration
prb <- gamma + (1 - gamma) * plogis(eta)
return(-sum(dbinom(dead, exposed, prb, log = TRUE)))
}
Now we can use another function (mle2
) that will solve
the maximization (minimization) problem.
library(bbmle)
Warning: package 'bbmle' was built under R version 4.4.3
m <- mle2(nloglik, data = bliss, method = "L-BFGS-B",
start = list(beta0 = -15, beta1 = 0.25, gamma = 0.1),
lower = list(beta0 = -Inf, beta1 = -Inf, gamma = 0),
upper = list(beta0 = Inf, beta1 = Inf, gamma = 1))
summary(m)
Maximum likelihood estimation
Call:
mle2(minuslogl = nloglik, start = list(beta0 = -15, beta1 = 0.25,
gamma = 0.1), method = "L-BFGS-B", data = bliss, lower = list(beta0 = -Inf,
beta1 = -Inf, gamma = 0), upper = list(beta0 = Inf, beta1 = Inf,
gamma = 1))
Coefficients:
Estimate Std. Error z value Pr(z)
beta0 -19.6470 3.1590 -6.22 5e-10 ***
beta1 0.3214 0.0487 6.60 4e-11 ***
gamma 0.1115 0.0495 2.25 0.024 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-2 log L: 50.5
trtools::lincon(m, a = c(0,1,0), tf = exp, fest = coef, fcov = vcov) # odds ratio
estimate lower upper
(0,1,0),0 1.38 1.25 1.52
We can also consider a variation on the “natural response” link
function where the upper asymptote is δ<1, or both lower and upper
asymptotes of 0<γ<δ<1. The model is then E(Y)=γ+(δ−γ)eη1+eη. A model with just an upper asymptote of δ<1 would be E(Y)=δeη1+eη. An example of an application with an upper asymptote of δ < 0 might be seed germination
(some seeds are “duds” and will never germinate). In the
trtools package there is a link function called
logitnnr
which works like logitnr
except it
specifies an upper asymptote of δ<0.
Using a log link function for binomial regression allows us to interpret exponentiated parameters/contrasts in terms of risk ratios (i.e., ratios of probabilities) instead of odds ratios.
Example: Consider the aflatoxicol study.
library(Sleuth3)
p <- ggplot(ex2116, aes(x = Dose, y = Tumor/Total)) +
geom_point(alpha = 0.5) + theme_classic() + ylim(0, 1) +
labs(x = "Dose (ppm)", y = "Proportion of Trout With Liver Tumors")
plot(p)
Here we will treat dose as a factor. With the logit link function we can
make inferences using odds ratios.
m <- glm(cbind(Tumor, Total - Tumor) ~ factor(Dose),
family = binomial(link = logit), data = ex2116)
trtools::contrast(m, tf = exp,
a = list(Dose = c(0.025,0.05,0.1,0.25)),
b = list(Dose = 0.01),
cnames = paste(c(0.025,0.05,0.1,0.25), "vs 0.01"))
estimate lower upper
0.025 vs 0.01 7.94 5.01 12.6
0.05 vs 0.01 22.92 14.45 36.4
0.1 vs 0.01 48.91 30.24 79.1
0.25 vs 0.01 70.84 42.84 117.1
But using a log link gives us risk ratios.
m <- glm(cbind(Tumor, Total - Tumor) ~ factor(Dose),
family = binomial(link = log), data = ex2116)
trtools::contrast(m, tf = exp,
a = list(Dose = c(0.025,0.05,0.1,0.25)),
b = list(Dose = 0.01),
cnames = paste(c(0.025,0.05,0.1,0.25), "vs 0.01"))
estimate lower upper
0.025 vs 0.01 5.30 3.55 7.91
0.05 vs 0.01 8.89 6.04 13.07
0.1 vs 0.01 10.99 7.50 16.09
0.25 vs 0.01 11.74 8.03 17.18
But be careful using the log link function with binomial regression, particularly with quantitative explanatory variables. It can produce estimated “probabilities” outside (0,1).
Suppose that the probability of “survival” each time unit (e.g., day)
is π, and assume that π=eη1+eη, where η=β0+β1x1+β2x2+⋯+βkxk as in a logistic
regression model. But we do not observe survival for each time unit,
only if there is survival for at least t units. Let Y = 1 if there is survival for at least
t units, and Y = 0 otherwise. Then it can be shown
that E(Y)=P(Y=1)=(eη1+eη)t. This implies the logistic-exposure link function
log[E(Y)1/t1−E(Y)1/t]=η. I have added this link function to trtools as
logitexp
.
Example: Suppose we have the following data.
library(dplyr)
set.seed(123)
b0 <- 2
b1 <- 1
d <- data.frame(x = seq(0, 3, length = 1000)) %>%
mutate(days = sample(1:10, n(), replace = TRUE)) %>%
mutate(y = rbinom(n(), 1, plogis(b0 + b1*x)^days))
head(d)
x days y
1 0.00000 3 1
2 0.00300 3 1
3 0.00601 10 0
4 0.00901 2 0
5 0.01201 6 1
6 0.01502 5 0
We can estimate the logistic-exposure model as follows.
library(trtools) # for logitexp link function
m <- glm(y ~ x, family = binomial(link = logitexp(exposure = d$days)), data = d)
cbind(summary(m)$coefficients, confint(m))
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) 1.90 0.12 15.8 2.40e-56 1.662 2.13
x 1.04 0.10 10.3 4.32e-25 0.844 1.24
Suppose Z has a binomial distribution consistent with a logistic regression model so that E(Z/m)=eη1+eη, where η=β0+β1x1+β2x2+⋯+βkxk. But we only observe Y where Y={1,if Z>0,0,if Z=0. This implies that E(Y)=P(Y=1)=1−(1−π)m, where π=E(Z/m). Then the link function for Y is log[(1−π)1/m−1]=η. This can be applied in composite sampling designs where a composite of m samples (e.g., blood, water, soil) will test positive collectively if any one of them would test positive individually.
Example: Suppose we have the following data.
library(dplyr)
set.seed(123)
d <- data.frame(m = sample(c(25,50,100,400), 100, replace = TRUE)) %>%
mutate(x = seq(-2, 2, length = n())) %>%
mutate(z = rbinom(n(), m, plogis(-5 + x/5)))
head(d)
m x z
1 100 -2.00 0
2 100 -1.96 0
3 100 -1.92 0
4 50 -1.88 1
5 100 -1.84 0
6 50 -1.80 1
And assume that the model for Z is a logistic regression model.
m <- glm(cbind(z, m - z) ~ x, family = binomial(link = logit), data = d)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.031 0.118 -42.57 0.0000
x 0.244 0.102 2.39 0.0171
But now suppose we only observe Y as defined above.
d <- d %>% mutate(y = ifelse(z > 0, 1, 0))
head(d)
m x z y
1 100 -2.00 0 0
2 100 -1.96 0 0
3 100 -1.92 0 0
4 50 -1.88 1 1
5 100 -1.84 0 0
6 50 -1.80 1 1
The model for Y is a binomial regression model with the link function given above.
library(trtools) # for the logitcomp link function
m <- glm(y ~ x, family = binomial(link = logitcomp(d$m)), data = d)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.015 0.173 -28.986 9.76e-185
x 0.143 0.148 0.969 3.33e-01
Suppose that Z is a binary variable that is an indicator variable for the presence of a disease such that Z={1,if the subject has the disease,0,if the subject does not have the disease. But assume we do not observe Z but instead observe only Y which is an indicator variable for a diagnostic test for the disease such that Y={1,if the subject tests positive for the disease,0,if the subject tests negative for the disease. Diagnostic test accuracy is defined in terms of the sensitivity and specificity of the test, where sensitivity=P(test positive|disease)=P(Y=1|Z=1) and specificity=P(test negative|no disease)=P(Y=0|Z=0). This implies that P(Y=1)=P(Y=1|Z=1)⏟sensitivityP(Z=1)+[1−P(Y=0|Z=0)⏟specificity]P(Z=0). So if the model for the disease is a binary logistic regression model such that E(Z)=P(Z=1)=eη1+eη, where η=β0+β1x1+⋯+βkxk, then we have that the model for the test result is E(Y)=P(Y=1)=sensitivity×eη1+eη+(1−specificity)(1−eη1+eη). The link function can then be written as log[E(Y)−(1−specificity)sensitivity−E(Y)]=η. Example: Suppose we have the following data.
library(dplyr)
set.seed(111)
d <- data.frame(x = seq(-3, 3, length = 1000)) %>%
mutate(z = rbinom(n(), 1, plogis(x)))
head(d)
x z
1 -3.00 0
2 -2.99 0
3 -2.99 0
4 -2.98 0
5 -2.98 0
6 -2.97 0
tail(d)
x z
995 2.97 1
996 2.98 0
997 2.98 1
998 2.99 1
999 2.99 1
1000 3.00 1
Here is a model for the disease indicator.
m <- glm(z ~ x, family = binomial(link = logit), data = d)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.122 0.0826 1.48 1.39e-01
x 1.034 0.0618 16.73 7.47e-63
But suppose we have to use a diagnostic test with a sensitivity of 0.8 and a specificity of 0.9. Here is the data with the diagnostic test result.
d <- d %>%
mutate(y = rbinom(n(), 1, ifelse(z == 1, 0.8, 1 - 0.9)))
head(d)
x z y
1 -3.00 0 1
2 -2.99 0 0
3 -2.99 0 0
4 -2.98 0 0
5 -2.98 0 0
6 -2.97 0 0
tail(d)
x z y
995 2.97 1 1
996 2.98 0 0
997 2.98 1 1
998 2.99 1 0
999 2.99 1 1
1000 3.00 1 1
Here is the model using the diagnostic test as the response variable.
library(trtools) # for logiterr link function
m <- glm(y ~ x, family = binomial(link = logiterr(0.8,0.9)), data = d)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0844 0.127 0.663 5.07e-01
x 0.9658 0.106 9.083 1.06e-19
The “power” family of link functions is defined as g[E(Y)]={E(Y)q,if q≠0,logE(Y)if q=0, so that E(Y)={η1/q,if q≠0,eη,if q=0. This includes several common link functions such as log (q = 0), identity (q = 1), reciprocal (q = -1), and square root (q = 0.5). These can be specified as
link = log
, link = identity
,
link = reciprocal
, and link = sqrt
,
respectively. Alternatively you can use
family = tweedie(link = q, variance = p)
for any value of
q (recall that tweedie
is from the statmod package).
Example: Here are models with different power link
functions for the ceriodaphniastrain
data.
One could try to estimate the power of the link function using maximum likelihood (much like we did for the natural response link function). But in that case it is useful to use the link function g[E(Y)]={E(Y)q−1q,if q≠0,logE(Y),if q=0 to make the function continuous with respect to q. This particular form uses the fact that lim It can be shown that this is a re-parameterization of the power link function given above, but this form is better for numerical reasons when trying to estimate q as a parameter.