You can also download a PDF copy of this lecture.
The likelihood function computes the likelihood of \(y_1, y_2, \dots, y_n\) as a function of all unknown parameters. If the distribution of the response variable is discrete then the likelihood is the same as the probability of \(y_1, y_2, \dots, y_n\).
Example: Suppose \(Y_i\) has a Poisson distribution such that \[ P(Y_i = y) = \frac{\lambda_i^ye^{-\lambda_i}}{y!} \] where \[ \lambda_i = \exp(\beta_0 + \beta_1 x_i), \] so that the model for \(Y_i\) is a Poisson regression model. The likelihood function is then \[ L(\beta_0,\beta_1) = \frac{\lambda_1^{y_1}e^{-\lambda_1}}{y_1!} \times \frac{\lambda_2^{y_2}e^{-\lambda_2}}{y_2!} \times \cdots \times \frac{\lambda_n^{y_n}e^{-\lambda_n}}{y_n!}, \] or \[ L(\beta_0,\beta_1) = \prod_{i=1}^n \frac{\lambda_i^{y_i}e^{-\lambda_i}}{y_i!}, \] where, again, \[ \lambda_i = \exp(\beta_0 + \beta_1 x_i). \]
Example: Suppose \(Y_i\) has a binomial distribution such that \[ P(Y_i = y) = \frac{m_i!}{y_i!(m_i-y_i)!}p_i^y(1-p_i)^{m_i-y}, \] where \[ p_i = \frac{e^{\beta_0+\beta_1x_i}}{1 + e^{\beta_0+\beta_1x_i}}, \] so that the model for \(Y_i\) is a logistic regression model. The likelihood function is then \[ L(\beta_0,\beta_1) = \frac{m_1!}{y_1!(m_1-y_1)!}p_1^{y_1}(1-p_1)^{m_1-y_1} \times \frac{m_2!}{y_2!(m_2-y_2)!}p_2^{y_2}(1-p_2)^{m_2-y_2} \times \cdots \times \frac{m_n!}{y_n!(m_n-y_n)!}p_n^{y_n}(1-p_n)^{m_n-y_n} \] or \[ L(\beta_0,\beta_1) = \prod_{i=1}^n \frac{m_i!}{y_i!(m_i-y_i)!}p_i^{y_i}(1-p_i)^{m_i-y_i}, \] where, again, \[ p_i = \frac{e^{\beta_0+\beta_1x_i}}{1 + e^{\beta_0+\beta_1x_i}}. \]
Note: If the response variable is discrete as it is in Poisson and logistic regression, then the likelihood function gives the probability of the observed responses as a function of the model parameters.
Example: Suppose \(Y_i\) has a normal distribution where the probability density function of \(Y_i\) is \[ f(y) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y-\mu_i)^2}{2\sigma^2}}, \] where \[ \mu_i = \beta_0 + \beta_1 x_i, \] so that the model for \(Y_i\) is a normal linear regression model. The likelihood function is then \[ L(\beta_0,\beta_1,\sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i-\mu_i)^2}{2\sigma^2}}. \] Frequently we use the log of the likelihood function or the log-likelihood function.
Example: The log-likelihood for the Poisson regression model above is \[ \log L(\beta_0,\beta_1) = \sum_{i=1}^n \log\left(\frac{\lambda_i^{y_i}e^{-\lambda_i}}{y_i!} \right) = \sum_{i=1}^n y_i\log(\lambda_i) - \sum_{i=1}^n \lambda_i - \sum_{i=1}^n \log(y_i!). \] Example: The log-likelihood for the normal regression model above is \[ \log L(\beta_0,\beta_1,\sigma^2) = -\frac{n}{2}\log(2\pi) - n\log(\sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mu_i)^2. \]
The maximum likelihood estimates (MLE) of the model parameters is those values of the parameters that maximize the likelihood of the data — i.e., the parameter values that make the values of \(y_1, y_2, \dots, y_n\) we observed the most likely values to have occurred. Finding these estimates is a optimization problem — i.e., find the values of the parameters that maximize the likelihood (or log-likelihood).
Example: In the normal linear model above, it can be shown that the MLEs of \(\beta_0\) and \(\beta_1\) are those values that minimize \[ \sum_{i=1}^n (y_i - \mu_i)^2 \] where \(\mu_i = \beta_0 + \beta_1 x_i\). So the MLEs are also the least squares estimators. But the MLE of \(\sigma^2\) is \[ \hat\sigma^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n} \] where \(\hat{y}_i = \hat\beta_0 + \hat\beta_1 x_i\). But we typically use the unbiased estimator \[ \hat\sigma^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-p}, \] where \(p\) is the number of \(\beta_j\) parameters (\(p = 2\) in the model above).
Except in the case of normal linear models and a few very simple nonlinear models, MLEs must be found numerically using iterative algorithms (similar to those used in nonlinear regression).
It is usually not possible to determine an exact method of estimating the standard error of a MLE, computing a confidence interval based on ML, or conducting a significance test based on ML. However there exist several asymptotic methods that give approximate results.
Likelihood ratio tests and profile likelihood confidence intervals.
Wald tests and confidence intervals.
Score (Lagrange multiplier) tests (not discussed).
The likelihood ratio test statistic is \[ 2\log\left(L/L_n\right) = 2\log(L) - 2\log(L_n) \stackrel{a}{\sim} \chi^2_{(r)}, \] where \(L\) and \(L_n\) are the likelihood functions for the model and the null model, respectively, evaluated at the MLEs, and \(r\) is the degrees of freedom.
Example: Consider the Poisson regression model for
ceriodaphniastrain
and the null hypothesis that there is no
difference in the expected number of daphnia between the two strains
when controlling for dose.
library(trtools) # for ceriodaphniastrain data
ceriodaphniastrain$strain <- factor(ceriodaphniastrain$strain, labels = c("a","b"))
m <- glm(count ~ strain + concentration, family = poisson, data = ceriodaphniastrain)
m.null <- glm(count ~ concentration, family = poisson, data = ceriodaphniastrain)
anova(m.null, m, test = "LRT") # has very little to do with ANOVA
Analysis of Deviance Table
Model 1: count ~ concentration
Model 2: count ~ strain + concentration
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 68 119.0
2 67 86.4 1 32.6 1.1e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The “residual deviance” (or just “deviance”) is related to the
log-likelihood of the model. It is defined as \[
D = 2\log L_s - 2\log L \ge 0,
\] where \(L\) and \(L_s\) are the likelihoods of the model in
question and a “saturated” model, respectively. The saturated model is a
“best possible” model where each unique combination of covariate values
is a distinct level of a factor. Thus the deviance can be viewed in a
way as the “lack of fit” of the model to the data. The residual deviance
can sometimes be obtained using the summary
or
deviance
functions.
summary(m) # shows residual deviance
Call:
glm(formula = count ~ strain + concentration, family = poisson,
data = ceriodaphniastrain)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.4546 0.0391 113.82 < 2e-16 ***
strainb -0.2750 0.0484 -5.68 1.3e-08 ***
concentration -1.5431 0.0466 -33.11 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1359.381 on 69 degrees of freedom
Residual deviance: 86.376 on 67 degrees of freedom
AIC: 416
Number of Fisher Scoring iterations: 4
deviance(m) # extracts only residual deviance
[1] 86.4
The likelihood ratio test statistic can be expressed in terms of the (residual) deviance of the model and the null model. \[ \underbrace{2(\log_s - \log L_n)}_{D_n} - \underbrace{2(\log_s - \log L)}_{D} = 2 \log L - 2 \log L_n \stackrel{a}{\sim} \chi^2_{(r)}, \] where \(D\) and \(D_n\) are the deviance of the model and null model, respectively. The degrees of freedom (\(r\)) is the difference in the residual degrees of freedom between the two models. The residual degrees of freedom equal \(n\) minus the number of parameters for the expected response.
The following shows how the deviance relates to what is done by
anova
.
anova(m.null, m, test = "LRT")
Analysis of Deviance Table
Model 1: count ~ concentration
Model 2: count ~ strain + concentration
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 68 119.0
2 67 86.4 1 32.6 1.1e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(m) # (residual) deviance of model
[1] 86.4
deviance(m.null) # (residual) deviance of null model
[1] 119
deviance(m.null) - deviance(m) # test statistic
[1] 32.6
Note: For models where \(Y_i\) is assumed to have a normal distribution, residual deviance essentially becomes the residual (error) sums of squares. Some in some sense the “analysis of deviance” can be viewed as a generalization of the “analysis of variance” in linear models.
Consider a significance test with hypotheses \(H_0\!: \theta = \theta_0\) versus \(H_a\!: \theta \neq \theta_0\) where \(\theta\) is some parameter of the model. A confidence interval \((\theta_l,\theta_h)\) for \(\theta\) can be used to conduct the test.
If \(\theta_0\) is in \((\theta_l, \theta_h)\), do not reject \(H_0\).
If \(\theta_0\) is not in \((\theta_l, \theta_h)\), reject \(H_0\)
Thus a confidence interval for \(\theta\) can be defined as all the values of \(\theta_0\) that are not rejected by a test of \(H_0\!: \theta = \theta_0\) versus \(H_a\!: \theta \neq \theta_0\).
A profile likelihood confidence interval is a confidence
interval based on the likelihood ratio test. It is the set/range of all
values of a parameter that would not be rejected by the likelihood ratio
test. Example: Consider again the
ceriodaphniastrain
data.
confint(m)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 4.38 4.53
strainb -0.37 -0.18
concentration -1.63 -1.45
For glm
objects (and some others we will talk about in
the future) confint
will compute profile likelihood
confidence intervals by default. A similar approach is used by
nls
, and glm
when using quasi-likelihood (more
on that later), but is based on a \(F\)
test statistic.
Wald tests and confidence intervals are based on the fact that MLEs have approximately normal sampling distributions. The Wald test statistic for a parameter \(\beta_j\) is \[ z = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}_j)} \stackrel{a}{\sim} \text{N(0,1)} \ \ \ \text{or} \ \ \ z^2 = \left[\frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}_j)}\right]^2 \stackrel{a}{\sim} \chi^2_{(1)}, \] where \(\beta_j\) is the value hypothesized by the null (frequently \(\beta_j = 0\)). The Wald confidence interval for \(\beta_j\) is \[ \hat{\beta}_j \pm z \times \widehat{\text{SE}}(\hat{\beta}_j). \] We can also apply this to a linear combinations of parameters by replacing \(\beta_j\) and \(\hat\beta_j\) with \(\ell\) and \(\hat\ell\), respectively.
Wald test statistics and confidence intervals are reported by
contrast
, lincon
, glmint
,
nlsint
from the trtools package, and the
functions in the emmeans package. The
summary
function reports Wald test statistics.
Note: In some cases a \(t\)-distribution is used as the approximate distribution of a Wald test statistic, and is also used when calibrating a confidence interval. In generalized linear models this usually happens for models in which we need to estimate the dispersion parameter. We do not need to estimate the dispersion parameter for Poisson or binomial/logistic regression.
Example: Consider again the
ceriodaphniastrain
data.
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.455 0.0391 113.82 0.00e+00
strainb -0.275 0.0484 -5.68 1.31e-08
concentration -1.543 0.0466 -33.11 2.06e-240
lincon(m)
estimate se lower upper tvalue df pvalue
(Intercept) 4.455 0.0391 4.38 4.53 113.82 Inf 0.00e+00
strainb -0.275 0.0484 -0.37 -0.18 -5.68 Inf 1.31e-08
concentration -1.543 0.0466 -1.63 -1.45 -33.11 Inf 2.06e-240
trtools::contrast(m,
a = list(strain = "a", concentration = c(0,1,2)),
b = list(strain = "b", concentration = c(0,1,2)),
cnames = c("0%","1%","2%"))
estimate se lower upper tvalue df pvalue
0% 0.275 0.0484 0.18 0.37 5.68 Inf 1.31e-08
1% 0.275 0.0484 0.18 0.37 5.68 Inf 1.31e-08
2% 0.275 0.0484 0.18 0.37 5.68 Inf 1.31e-08
trtools::contrast(m, tf = exp,
a = list(strain = "a", concentration = c(0,1,2)),
b = list(strain = "b", concentration = c(0,1,2)),
cnames = c("0%","1%","2%"))
estimate lower upper
0% 1.32 1.2 1.45
1% 1.32 1.2 1.45
2% 1.32 1.2 1.45
library(emmeans)
pairs(emmeans(m, ~strain|concentration,
at = list(concentration = c(0,1,2))),
type = "response", infer = TRUE)
concentration = 0:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
a / b 1.32 0.0637 Inf 1.2 1.45 1 5.680 <.0001
concentration = 1:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
a / b 1.32 0.0637 Inf 1.2 1.45 1 5.680 <.0001
concentration = 2:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
a / b 1.32 0.0637 Inf 1.2 1.45 1 5.680 <.0001
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Tests are performed on the log scale
For Wald tests involving joint hypotheses, you can use the
waldtest
function from the lmtest package
which works similarly to how the anova
function can be used
for likelihood ratio tests.
m <- glm(count ~ strain * concentration, family = poisson, data = ceriodaphniastrain)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.481 0.0435 103.01 0.00e+00
strainb -0.337 0.0670 -5.02 5.11e-07
concentration -1.598 0.0624 -25.59 1.86e-144
strainb:concentration 0.125 0.0939 1.34 1.82e-01
m.null <- glm(count ~ concentration, family = poisson, data = ceriodaphniastrain)
summary(m.null)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.33 0.0331 130.7 0.00e+00
concentration -1.54 0.0466 -33.1 2.06e-240
library(lmtest)
waldtest(m.null, m, test = "Chisq")
Wald test
Model 1: count ~ concentration
Model 2: count ~ strain * concentration
Res.Df Df Chisq Pr(>Chisq)
1 68
2 66 2 34 4.1e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The two methods are equivalent asymptotically and tend to produce relatively similar results for larger \(n\), but for smaller \(n\) they may produce somewhat different results.
Likelihood ratio tests/intervals are generally more accurate than Wald tests/intervals. The latter assumes that the sampling distribution of the estimator is normal. Likelihood ratio tests/intervals assume that the sampling distribution of the likelihood ratio test statistic has a \(\chi^2\) distribution. The latter is easier to meet since it does not depend on the parameterization of the model.
Likelihood ratio tests/intervals are a bit more computationally
expensive and difficult to program in the case of functions of model
parameters (e.g., linear combinations where we use lincon
,
contrast
, or functions from the emmeans
package).
Wald tests/intervals can be used in cases where a likelihood function is not specified because we do not or cannot assume a particular distribution for the response variable.
For inferences based on maximum likelihood to be correct, we are assuming we have the correct likelihood function. It is useful to note three aspects of the likelihood function where we need to be correct.
The distribution of the response variable.
The relationship between the parameter(s) of the distribution and the explanatory variable(s).
The independence of the \(n\) observations.
Interestingly, for GLMs the distribution actually isn’t something that we need to be correct about, provided that we are correct about the following:
\(E(Y_i)\) is related to the explanatory variables in the assumed way — e.g., \(g[E(Y_i)] = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik}\).
\(\text{Var}(Y_i)\) depends on \(E(Y_i)\) in the assumed away — i.e., through the variance structure \(\text{Var}(Y_i) = \phi V[E(Y_i)]\).
The observations are independent (although there are some special cases where we can violate this assumption and still get “asymptotically correct” inferences).
When you specify a distribution via the family
argument
to the glm
function, all that matters is the variance
structure implied by that family.