Wednesday, January 15

You can also download a PDF copy of this lecture.

Quantitative and Categorical Explanatory Variables

library(MASS) # for the whiteside data frame
head(whiteside)

   Insul Temp Gas
1 Before -0.8 7.2
2 Before -0.7 6.9
3 Before  0.4 6.4
4 Before  2.5 6.0
5 Before  2.9 5.8
6 Before  3.2 5.8

Here is a plot of the data (we will consider how to create such plots later). Consider the following model specified using the lm function.

m <- lm(Gas ~ Insul + Temp, data = whiteside)
summary(m)$coefficients

             Estimate Std. Error   t value     Pr(>|t|)
(Intercept)  6.551329 0.11808641  55.47911 1.256860e-48
InsulAfter  -1.565205 0.09705338 -16.12726 4.310904e-22
Temp        -0.336697 0.01776311 -18.95484 2.864348e-25

The model specified above can be written as \[ E(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}, \] or \[ E(G_i) = \beta_0 + \beta_1 d_i + \beta_2 t_i, \] where $G_i$ is gas consumption, $t_i$ is temperature, and $d_i$ is defined as \[ d_i = \begin{cases} 1, & \text{if the $i$-th observation is after insulation}, \\ 0, & \text{otherwise}. \end{cases} \] Thus we can also write the model case-wise as \[ E(G_i) = \begin{cases} \beta_0 + \beta_2t_i, & \text{if the $i$-th observation is before insulation}, \\ \beta_0 + \beta_1 + \beta_2t_i, & \text{if the $i$-th observation is after insulation.} \end{cases} \] If we plot expected gas consumption versus temperature, we get a line with a slope of $\beta_2$ and an intercept of $\beta_0$ before insulation and $\beta_0 + \beta_1$ after insulation.

For the model \[ E(G_i) = \begin{cases} \beta_0 + \beta_2t_i, & \text{if the $i$-th observation is before insulation}, \\ \beta_0 + \beta_1 + \beta_2t_i, & \text{if the $i$-th observation is after insulation,} \end{cases} \] how can we write the following as functions of $\beta_0$, $\beta_1$, and/or $\beta_2$?

The change in the expected gas consumption per unit increase in temperature before insulation.
The change in the expected gas consumption per unit increase in temperature after insulation.
The difference in the two rates of change described above.
The expected gas consumption before and after insulation at 5C.
The difference in expected gas consumption between before and after insulation at 5C.

Now consider the following model specified using the lm function.

m <- lm(Gas ~ Insul + Temp + Insul:Temp, data = whiteside)
# Note: A shortcut for the model formula a + b + a:b is a*b, 
# so we could also write the model formula as Gas ~ Insul*Temp.
summary(m)$coefficients

                  Estimate Std. Error    t value     Pr(>|t|)
(Intercept)      6.8538277 0.13596397  50.409146 7.997414e-46
InsulAfter      -2.1299780 0.18009172 -11.827185 2.315921e-16
Temp            -0.3932388 0.02248703 -17.487358 1.976009e-23
InsulAfter:Temp  0.1153039 0.03211212   3.590665 7.306852e-04

The model specified above can be written as \[ E(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}, \] or \[ E(G_i) = \beta_0 + \beta_1 d_i + \beta_2 t_i + \beta_3 d_i t_i, \] where again $G_i$ is gas consumption, $t_i$ is temperature, and $d_i$ is defined as \[ d_i = \begin{cases} 1, & \text{if the $i$-th observation is after insulation}, \\ 0, & \text{otherwise}. \end{cases} \] Thus we can also write the model case-wise as \[ E(G_i) = \begin{cases} \beta_0 + \beta_2t_i, & \text{if the $i$-th observation is before insulation}, \\ \beta_0 + \beta_1 + (\beta_2 + \beta_3)t_i, & \text{if the $i$-th observation is after insulation.} \end{cases} \] If we plot expected gas consumption versus temperature, we get a line with a slope of $\beta_2$ before insulation and $\beta_2 + \beta_3$ after insulation, and an intercept of $\beta_0$ before insulation and $\beta_0 + \beta_1$ after insulation.

For the model \[ E(G_i) = \begin{cases} \beta_0 + \beta_2t_i, & \text{if the $i$-th observation is before insulation}, \\ \beta_0 + \beta_1 + (\beta_2 + \beta_3)t_i, & \text{if the $i$-th observation is after insulation,} \end{cases} \] how can we write the following as functions of $\beta_0$, $\beta_1$, $\beta_2$ and/or $\beta_3$?

The rate of change in the expected gas consumption per unit increase in temperature before insulation.
The rate of change in the expected gas consumption per unit increase in temperature after insulation.
The difference in the two rates of change described above.
The expected gas consumption before and after insulation at 5C.
The difference in expected gas consumption between before and after insulation at 5C.

Identifying Model Terms in Linear Models in R

A linear model can be written as \[ E(Y) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k \] that has $k$ terms, where the $j$-th term is $\beta_jx_j$. The row labels returned by summary such as

summary(m)$coefficients

                  Estimate Std. Error    t value     Pr(>|t|)
(Intercept)      6.8538277 0.13596397  50.409146 7.997414e-46
InsulAfter      -2.1299780 0.18009172 -11.827185 2.315921e-16
Temp            -0.3932388 0.02248703 -17.487358 1.976009e-23
InsulAfter:Temp  0.1153039 0.03211212   3.590665 7.306852e-04

let us identify each term in the model using the following rules.

If the label is (Intercept) them the first term is $\beta_0$. Otherwise $\beta_0$ is excluded.
If the label is the name of a quantitative variable (e.g., Temp), then the model includes the term $\beta_jx_j$ where $x_j$ is the value of that variable for a given observation.
If the label is the name of a categorical/factor variable with a category/level “suffix” (e.g., InsulAfter), then the model includes the term $\beta_jx_j$ where $x_j$ is an indicator variable for when the observation of the categorical/factor variable equals that category/level.¹
If the label is has a colon (:) between two labels (e.g., InsulAfter:Temp), then the model includes the term $\beta_jx_j$ where $x_j$ is the product of the values of two $x$ variables defined by the two labels (i.e., $x$ defined by InsulAfter times the $x$ defined by Temp).

Example: Here’s an alternative parameterization of the model used above.

m <- lm(Gas ~ -1 + Insul + Insul:Temp, data = whiteside)
summary(m)$coefficients

                   Estimate Std. Error   t value     Pr(>|t|)
InsulBefore       6.8538277 0.13596397  50.40915 7.997414e-46
InsulAfter        4.7238497 0.11809668  39.99985 9.918382e-41
InsulBefore:Temp -0.3932388 0.02248703 -17.48736 1.976009e-23
InsulAfter:Temp  -0.2779350 0.02292426 -12.12405 8.936039e-17

How would we write this model?

Linear Functions of Parameters

“The default output isn’t necessarily useful, and what is useful is not necessarily the default output.” — Me

For a regression model like \[ E(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik}, \] a quantity of interest can often be written as a linear function of the model parameters (sometimes also called a linear combination or linear contrast of the parameters). This can be written in general as \[ \ell = a_0\beta_0 + a_1\beta_1 + a_2\beta_2 + \cdots + a_k\beta_k + b, \] where $a_0, a_1, \dots, a_k$ and $b$ are specified coefficients. There are several ways to do this on R. One way is to use the lincon function (for linear contrast) from the trtools package.

Example: Consider the following model.

m <- lm(Gas ~ Insul + Temp + Insul:Temp, data = whiteside)
summary(m)$coefficients

                  Estimate Std. Error    t value     Pr(>|t|)
(Intercept)      6.8538277 0.13596397  50.409146 7.997414e-46
InsulAfter      -2.1299780 0.18009172 -11.827185 2.315921e-16
Temp            -0.3932388 0.02248703 -17.487358 1.976009e-23
InsulAfter:Temp  0.1153039 0.03211212   3.590665 7.306852e-04

library(trtools)    # so we can use lincon below
options(digits = 4) # for display purposes

How do we write the following as linear combinations and then make inferences about that linear combination?

The rate of change in expected gas consumption before insulation is $\beta_2$.

lincon(m, a = c(0,0,1,0), b = 0)

            estimate      se   lower   upper tvalue df    pvalue
(0,0,1,0),0  -0.3932 0.02249 -0.4384 -0.3481 -17.49 52 1.976e-23

The rate of change in expected gas consumption after insulation is $\beta_2 + \beta_3$.

lincon(m, a = c(0,0,1,1), b = 0)

            estimate      se   lower   upper tvalue df    pvalue
(0,0,1,1),0  -0.2779 0.02292 -0.3239 -0.2319 -12.12 52 8.936e-17

The difference in the two rates of change described above (after minus before) is $\beta_3$.

lincon(m, a = c(0,0,0,1), b = 0)

            estimate      se   lower  upper tvalue df    pvalue
(0,0,0,1),0   0.1153 0.03211 0.05087 0.1797  3.591 52 0.0007307

The expected gas consumption before and after insulation at 5C are $\beta_0 + \beta_25$ and $\beta_0 + \beta_1 + (\beta_2 + \beta_3)5$, respectively.

lincon(m, a = c(1,0,5,0), b = 0) # before

            estimate      se lower upper tvalue df    pvalue
(1,0,5,0),0    4.888 0.06383  4.76 5.016  76.57 52 3.885e-55

lincon(m, a = c(1,1,5,5), b = 0) # after

            estimate      se lower upper tvalue df    pvalue
(1,1,5,5),0    3.334 0.06024 3.213 3.455  55.35 52 6.772e-48

The difference in expected gas consumption between before and after insulation at 5C is $\beta_1 + \beta_35$.

lincon(m, a = c(0,1,0,5), b = 0)

            estimate      se lower  upper tvalue df    pvalue
(0,1,0,5),0   -1.553 0.08777 -1.73 -1.377  -17.7 52 1.155e-23

Note: Because in many cases we have $b = 0$, the b argument has this as a default value and can be omitted if $b = 0$. So we can write lincon(m, a = c(0,1,0,5)) instead of lincon(m, a = c(0,1,0,5), b = 0).

There are alternative “systems” to indicator variables when dealing with categorical explanatory variables. One example is where the reference category/level gets a value of -1 instead of 0. These have some applications when using specialized parameterizations of linear models and simplify some calculations, but they can often be avoided. Most software (including R) uses the indicator/dummy variable system by default.↩︎