You can also download a PDF copy of this lecture.
The i-th observed rate Ri can be written as Ri=Ci/Si, where Ci is a count and Si is the “size” of the interval in which the counts are observed. Examples include fish per minute, epileptic episodes per day, or defects per (square) meter. In some cases Si is referred to as the “exposure” of the i-th observation.
Assume that the count Ci has a Poisson distribution and that E(Ci)=Siexp(β0+β1xi1+⋯+βkxik)⏟λi, where λi is the expected count per unit (e.g., per minute) so that Siλi is then the expected count per Si (e.g., per hour if Si = 60, per day if Si = 1440, or per second if Si = 1/60). The expected rate is then E(Ri)=E(Ci/Si)=E(Ci)/Si=exp(β0+β1xi1+⋯+βkxik), if we treat exposure as fixed (like we do xi1,xi2,…,xik). But rather than using Ri as the response variable we can use Ci as the response variable in a Poisson regression model where E(Ci)=Siexp(β0+β1xi1+⋯+βkxik)=exp(β0+β1xi1+⋯+βkxik+logSi), and where logSi is an “offset” variable (i.e., basically an explanatory variable where it’s βj is “fixed” at one).
Note: If Si is a constant for all observations so that Si=S then we can write the model as E(Ci)=exp(β0+β1xi1+⋯+βkxik+logSi)=exp(β∗0+β1xi1+β2xi2+⋯+βkxik), where β∗0=log(S)+β0 so that the offset is “absorbed” into β0, and we do not need to be concerned about it. Including an offset is only necessary if Si is not the same for all observations.
Using rates as response variables in a linear or nonlinear model without accounting for Si is not advisable because of heteroscedasticity due to unequal Si.
We have that E(Ri)=E(Ci)/Si. But Var(Ri)=Var(Ci/Si)=Var(Ci)/S2i=E(Ri)Si/S2i=E(Ri)/Si, because (a) Var(Y/c)=Var(Y)/c2 if c is a constant, Var(Ci)=E(Ci) because Ci has a Poisson distribution, and thus E(Ci)=E(Ri)Si. Thus the variance of the rate depends on the expected response and Si (so larger/smaller Si, then smaller/larger variance of Ri).
We can deal with this heteroscedasticity by either (a) using an appropriate offset variable in Poisson regression or a related model or (b) using weights of wi=Si/E(Ri) in an iteratively weighted least squares with weights of wi=Si/ˆyi.
Software for GLMs (and sometimes linear models) will often permit
specification of an offset variable. In R this is done using
offset
in the model formula.
Example: Consider the following data from an observational study of auto accidents.
library(trtools)
head(accidents)
accidents years location treatment
1 13 9 a before
2 6 9 b before
3 30 8 c before
4 20 8 d before
5 10 9 e before
6 15 8 f before
p <- ggplot(accidents, aes(x = location, y = accidents/years)) +
geom_point(aes(size = years, color = treatment)) +
labs(x = "Location", y = "Accidents per Year",
size = "Years", color = "Treatment") + theme_minimal()
plot(p)
m <- glm(accidents ~ location + treatment + offset(log(years)),
data = accidents, family = poisson)
cbind(summary(m)$coefficients, confint(m))
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -0.510 0.373 -1.366 0.17207 -1.2924 0.177
locationb -0.486 0.449 -1.080 0.27994 -1.4122 0.378
locationc 1.018 0.326 3.117 0.00182 0.4027 1.694
locationd 0.537 0.356 1.507 0.13168 -0.1510 1.260
locatione -0.262 0.421 -0.624 0.53279 -1.1136 0.559
locationf 0.586 0.353 1.660 0.09690 -0.0939 1.304
locationg -0.486 0.449 -1.080 0.27994 -1.4122 0.378
locationh 0.199 0.379 0.526 0.59921 -0.5459 0.958
treatmentbefore 0.781 0.275 2.834 0.00459 0.2741 1.362
exp(cbind(coef(m), confint(m)))
2.5 % 97.5 %
(Intercept) 0.601 0.275 1.19
locationb 0.615 0.244 1.46
locationc 2.767 1.496 5.44
locationd 1.711 0.860 3.53
locatione 0.769 0.328 1.75
locationf 1.797 0.910 3.68
locationg 0.615 0.244 1.46
locationh 1.221 0.579 2.61
treatmentbefore 2.183 1.315 3.90
When using other tools like contrast
or functions from
the emmeans package, be sure to specify the offset.
Typically we would use a value of one corresponding to one unit of
whatever the offset represents (e.g., space or time). Here are the rate
ratios for the treatment.
trtools::contrast(m,
a = list(treatment = "before", location = letters[1:8], years = 1),
b = list(treatment = "after", location = letters[1:8], years = 1),
cnames = letters[1:8], tf = exp)
estimate lower upper
a 2.18 1.27 3.75
b 2.18 1.27 3.75
c 2.18 1.27 3.75
d 2.18 1.27 3.75
e 2.18 1.27 3.75
f 2.18 1.27 3.75
g 2.18 1.27 3.75
h 2.18 1.27 3.75
trtools::contrast(m,
a = list(treatment = "after", location = letters[1:8], years = 1),
b = list(treatment = "before", location = letters[1:8], years = 1),
cnames = letters[1:8], tf = exp)
estimate lower upper
a 0.458 0.267 0.786
b 0.458 0.267 0.786
c 0.458 0.267 0.786
d 0.458 0.267 0.786
e 0.458 0.267 0.786
f 0.458 0.267 0.786
g 0.458 0.267 0.786
h 0.458 0.267 0.786
Here are the estimated expected number of accidents per year at
location a
.
trtools::contrast(m, a = list(treatment = c("before","after"), location = "a", years = 1),
cnames = c("before","after"), tf = exp)
estimate lower upper
before 1.311 0.759 2.26
after 0.601 0.289 1.25
Here are the estimated expected number of accidents per
decade at location a
.
trtools::contrast(m, a = list(treatment = c("before","after"), location = "a", years = 10),
cnames = c("before","after"), tf = exp)
estimate lower upper
before 13.11 7.59 22.6
after 6.01 2.89 12.5
When using functions from the emmeans package we use
the offset
argument with the value specified on the log
scale. Here are the estimated number of accidents per decade.
library(emmeans)
emmeans(m, ~treatment|location, type = "response", offset = log(10))
location = a:
treatment rate SE df asymp.LCL asymp.UCL
after 6.0 2.24 Inf 2.89 12.5
before 13.1 3.65 Inf 7.59 22.6
location = b:
treatment rate SE df asymp.LCL asymp.UCL
after 3.7 1.60 Inf 1.58 8.6
before 8.1 2.86 Inf 4.03 16.2
location = c:
treatment rate SE df asymp.LCL asymp.UCL
after 16.6 4.83 Inf 9.39 29.4
before 36.3 6.39 Inf 25.68 51.2
location = d:
treatment rate SE df asymp.LCL asymp.UCL
after 10.3 3.42 Inf 5.35 19.7
before 22.4 5.06 Inf 14.42 34.9
location = e:
treatment rate SE df asymp.LCL asymp.UCL
after 4.6 1.86 Inf 2.10 10.2
before 10.1 3.20 Inf 5.42 18.8
location = f:
treatment rate SE df asymp.LCL asymp.UCL
after 10.8 3.56 Inf 5.65 20.6
before 23.6 5.18 Inf 15.30 36.3
location = g:
treatment rate SE df asymp.LCL asymp.UCL
after 3.7 1.60 Inf 1.58 8.6
before 8.1 2.86 Inf 4.03 16.2
location = h:
treatment rate SE df asymp.LCL asymp.UCL
after 7.3 2.56 Inf 3.70 14.5
before 16.0 4.18 Inf 9.59 26.7
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Here is the rate ratio for the effect of treatment.
pairs(emmeans(m, ~treatment|location, type = "response", offset = log(10)), infer = TRUE)
location = a:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
location = b:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
location = c:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
location = d:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
location = e:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
location = f:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
location = g:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
location = h:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Tests are performed on the log scale
Use reverse = TRUE
to “flip” the rate ratio. Also for
rate ratios the size of the offset does not matter since it
“cancels-out” in the ratio. Also since there is no interaction in this
model which means the rate ratio does not depend on location, we can
omit it when using emmeans
(but not
contrast
).
pairs(emmeans(m, ~treatment, type = "response"), infer = TRUE)
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
after / before 0.458 0.126 Inf 0.267 0.786 1 -2.834 0.0046
Results are averaged over the levels of: location
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Tests are performed on the log scale
When using predict
we need to be sure to also include
the offset amount. Again, we would use a value of one assuming we want
the number of events per unit space/time.
d <- expand.grid(treatment = c("before","after"), location = letters[1:8], years = 1)
d$yhat <- predict(m, newdata = d, type = "response")
head(d)
treatment location years yhat
1 before a 1 1.311
2 after a 1 0.601
3 before b 1 0.807
4 after b 1 0.370
5 before c 1 3.627
6 after c 1 1.662
p <- ggplot(accidents, aes(x = location, y = accidents/years)) +
geom_point(aes(size = years, color = treatment)) +
labs(x = "Location", y = "Accidents per Year",
size = "Years", color = "Treatment") + theme_minimal() +
geom_point(aes(y = yhat, color = treatment), data = d) +
geom_line(aes(y = yhat, group = treatment, color = treatment), data = d)
plot(p)
We can use the
glmint
function from the
trtools package if we want to produce confidence
intervals for plots.
d <- expand.grid(treatment = c("before","after"), location = letters[1:8], years = 1)
d$yhat <- predict(m, newdata = d, type = "response")
glmint(m, newdata = d)
fit low upp
1 1.311 0.759 2.263
2 0.601 0.289 1.248
3 0.807 0.403 1.616
4 0.370 0.158 0.864
5 3.627 2.568 5.123
6 1.662 0.939 2.939
7 2.243 1.442 3.489
8 1.028 0.535 1.975
9 1.008 0.542 1.878
10 0.462 0.210 1.018
11 2.355 1.530 3.625
12 1.079 0.565 2.059
13 0.807 0.403 1.616
14 0.370 0.158 0.864
15 1.600 0.959 2.671
16 0.733 0.370 1.453
cbind(d, glmint(m, newdata = d))
treatment location years yhat fit low upp
1 before a 1 1.311 1.311 0.759 2.263
2 after a 1 0.601 0.601 0.289 1.248
3 before b 1 0.807 0.807 0.403 1.616
4 after b 1 0.370 0.370 0.158 0.864
5 before c 1 3.627 3.627 2.568 5.123
6 after c 1 1.662 1.662 0.939 2.939
7 before d 1 2.243 2.243 1.442 3.489
8 after d 1 1.028 1.028 0.535 1.975
9 before e 1 1.008 1.008 0.542 1.878
10 after e 1 0.462 0.462 0.210 1.018
11 before f 1 2.355 2.355 1.530 3.625
12 after f 1 1.079 1.079 0.565 2.059
13 before g 1 0.807 0.807 0.403 1.616
14 after g 1 0.370 0.370 0.158 0.864
15 before h 1 1.600 1.600 0.959 2.671
16 after h 1 0.733 0.733 0.370 1.453
d <- cbind(d, glmint(m, newdata = d))
p <- ggplot(accidents, aes(x = location)) +
geom_point(aes(y = accidents/years, size = years), shape = 21, fill = "white") +
facet_wrap(~ treatment) + theme_minimal() +
labs(x = "Location", y = "Accidents per Year", size = "Years") +
geom_errorbar(aes(ymin = low, ymax = upp), data = d, width = 0.5) +
geom_point(aes(y = fit), data = d)
plot(p)
p <- ggplot(accidents, aes(x = location, color = treatment)) +
geom_point(aes(y = accidents/years, size = years),
position = position_dodge(width = 0.6), shape = 21, fill = "white") +
labs(x = "Location", y = "Accidents per Year",
size = "Years", color = "Treatment") + theme_minimal() +
geom_errorbar(aes(ymin = low, ymax = upp), data = d,
position = position_dodge(width = 0.6), width = 0.5) +
geom_point(aes(y = fit), data = d, position = position_dodge(width = 0.6))
plot(p)
Example: Consider the following data from an observational study that investigated the possible effect of the development of a commercial fishery on deep sea fish abundance. The figure below shows the number of fish per square meter of swept area from 147 trawls by mean depth in meters, and by whether the trawl was during one of two periods. The 1977-1989 period was from before the development of a commercial fishery, and the period 2000-2002 was when the fishery was active.
library(COUNT)
data(fishing)
head(fishing)
site totabund density meandepth year period sweptarea
1 1 76 0.002070 804 1978 1977-1989 36710
2 2 161 0.003520 808 2001 2000-2002 45741
3 3 39 0.000981 809 2001 2000-2002 39775
4 4 410 0.008039 848 1979 1977-1989 51000
5 5 177 0.005933 853 2002 2000-2002 29831
6 6 695 0.021801 960 1980 1977-1989 31880
p <- ggplot(fishing, aes(x = meandepth, y = totabund/sweptarea)) +
geom_point(alpha = 0.5) + facet_wrap(~ period) + theme_minimal() +
labs(x = "Mean Trawl Depth (meters)",
y = "Fish Caught Per Square Meter Trawled")
plot(p)
An appropriate model for these data might be as follows.
m <- glm(totabund ~ period * meandepth + offset(log(sweptarea)),
family = poisson, data = fishing)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.422819 1.49e-02 -229.67 0.00e+00
period2000-2002 -0.771117 2.97e-02 -25.94 2.55e-148
meandepth -0.000971 7.96e-06 -121.94 0.00e+00
period2000-2002:meandepth 0.000132 1.52e-05 8.65 5.09e-18
d <- expand.grid(sweptarea = 1, period = c("1977-1989","2000-2002"),
meandepth = seq(800, 5000, length = 100))
d$yhat <- predict(m, newdata = d, type = "response")
p <- p + geom_line(aes(y = yhat), data = d)
plot(p)
What is the expected number of fish per square meter in 1977-1989 at
depths of 1000, 2000, 3000, 4000, and 5000 meters? What is it in
2000-2002?
trtools::contrast(m,
a = list(sweptarea = 1,
meandepth = c(1000,2000,3000,4000,5000), period = "1977-1989"),
cnames = c("1000m","2000m","3000m","4000m","5000m"), tf = exp)
estimate lower upper
1000m 0.012350 0.012147 0.012556
2000m 0.004676 0.004613 0.004739
3000m 0.001770 0.001728 0.001813
4000m 0.000670 0.000645 0.000696
5000m 0.000254 0.000241 0.000268
trtools::contrast(m,
a = list(sweptarea = 1,
meandepth = c(1000,2000,3000,4000,5000), period = "2000-2002"),
cnames = c("1000m","2000m","3000m","4000m","5000m"), tf = exp)
estimate lower upper
1000m 0.006517 0.006325 0.006714
2000m 0.002815 0.002751 0.002881
3000m 0.001216 0.001170 0.001263
4000m 0.000525 0.000494 0.000558
5000m 0.000227 0.000208 0.000247
Here is how we can do that with emmeans.
library(emmeans)
emmeans(m, ~meandepth|period, at = list(meandepth = seq(1000, 5000, by = 1000)),
type = "response", offset = log(1))
period = 1977-1989:
meandepth rate SE df asymp.LCL asymp.UCL
1000 0.01235 1.04e-04 Inf 0.01215 0.01256
2000 0.00468 3.23e-05 Inf 0.00461 0.00474
3000 0.00177 2.17e-05 Inf 0.00173 0.00181
4000 0.00067 1.31e-05 Inf 0.00065 0.00070
5000 0.00025 6.90e-06 Inf 0.00024 0.00027
period = 2000-2002:
meandepth rate SE df asymp.LCL asymp.UCL
1000 0.00652 9.91e-05 Inf 0.00633 0.00671
2000 0.00281 3.31e-05 Inf 0.00275 0.00288
3000 0.00122 2.38e-05 Inf 0.00117 0.00126
4000 0.00053 1.63e-05 Inf 0.00049 0.00056
5000 0.00023 9.80e-06 Inf 0.00021 0.00025
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Note that we can change the units of swept area very easily here. There are 10,000 square meters in a hectare. Here are the expected number of fish per hectare.
trtools::contrast(m,
a = list(sweptarea = 10000,
meandepth = c(1000,2000,3000,4000,5000), period = "1977-1989"),
cnames = c("1000m","2000m","3000m","4000m","5000m"), tf = exp)
estimate lower upper
1000m 123.50 121.47 125.56
2000m 46.76 46.13 47.39
3000m 17.70 17.28 18.13
4000m 6.70 6.45 6.96
5000m 2.54 2.41 2.68
trtools::contrast(m,
a = list(sweptarea = 10000,
meandepth = c(1000,2000,3000,4000,5000), period = "2000-2002"),
cnames = c("1000m","2000m","3000m","4000m","5000m"), tf = exp)
estimate lower upper
1000m 65.17 63.25 67.14
2000m 28.15 27.51 28.81
3000m 12.16 11.70 12.63
4000m 5.25 4.94 5.58
5000m 2.27 2.08 2.47
emmeans(m, ~meandepth|period, at = list(meandepth = seq(1000, 5000, by = 1000)),
type = "response", offset = log(10000))
period = 1977-1989:
meandepth rate SE df asymp.LCL asymp.UCL
1000 123.5 1.040 Inf 121.5 125.6
2000 46.8 0.323 Inf 46.1 47.4
3000 17.7 0.217 Inf 17.3 18.1
4000 6.7 0.131 Inf 6.5 7.0
5000 2.5 0.069 Inf 2.4 2.7
period = 2000-2002:
meandepth rate SE df asymp.LCL asymp.UCL
1000 65.2 0.991 Inf 63.3 67.1
2000 28.1 0.331 Inf 27.5 28.8
3000 12.2 0.238 Inf 11.7 12.6
4000 5.3 0.163 Inf 4.9 5.6
5000 2.3 0.098 Inf 2.1 2.5
Confidence level used: 0.95
Intervals are back-transformed from the log scale
What is the rate ratio of fish per square meter in 2000-2002 versus 1977-1989 at 1000, 2000, 3000, 4000, and 5000 meters?
trtools::contrast(m,
a = list(sweptarea = 1, meandepth = c(1000,2000,3000,4000,5000), period = "2000-2002"),
b = list(sweptarea = 1, meandepth = c(1000,2000,3000,4000,5000), period = "1977-1989"),
cnames = c("1000m","2000m","3000m","4000m","5000m"), tf = exp)
estimate lower upper
1000m 0.528 0.510 0.546
2000m 0.602 0.586 0.618
3000m 0.687 0.656 0.719
4000m 0.784 0.729 0.842
5000m 0.894 0.809 0.989
Here it is for 1977-1989 versus 2000-2002.
trtools::contrast(m,
a = list(sweptarea = 1, meandepth = c(1000,2000,3000,4000,5000), period = "1977-1989"),
b = list(sweptarea = 1, meandepth = c(1000,2000,3000,4000,5000), period = "2000-2002"),
cnames = c("1000m","2000m","3000m","4000m","5000m"), tf = exp)
estimate lower upper
1000m 1.90 1.83 1.96
2000m 1.66 1.62 1.71
3000m 1.46 1.39 1.52
4000m 1.28 1.19 1.37
5000m 1.12 1.01 1.24
Now using emmeans.
pairs(emmeans(m, ~meandepth*period, at = list(meandepth = seq(1000, 5000, by = 1000)),
type = "response", offset = log(1)), by = "meandepth", infer = TRUE)
meandepth = 1000:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
(1977-1989) / (2000-2002) 1.90 0.0330 Inf 1.83 1.96 1 36.700 <.0001
meandepth = 2000:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
(1977-1989) / (2000-2002) 1.66 0.0227 Inf 1.62 1.71 1 37.200 <.0001
meandepth = 3000:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
(1977-1989) / (2000-2002) 1.46 0.0336 Inf 1.39 1.52 1 16.300 <.0001
meandepth = 4000:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
(1977-1989) / (2000-2002) 1.28 0.0468 Inf 1.19 1.37 1 6.600 <.0001
meandepth = 5000:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
(1977-1989) / (2000-2002) 1.12 0.0573 Inf 1.01 1.24 1 2.200 0.0288
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Tests are performed on the log scale
How does the expected number of fish per square meter change per 1000m of depth?
# increasing depth by 1000m
trtools::contrast(m,
a = list(sweptarea = 1, meandepth = 2000, period = c("1977-1989","2000-2002")),
b = list(sweptarea = 1, meandepth = 1000, period = c("1977-1989","2000-2002")),
cnames = c("1977-1989","2000-2002"), tf = exp)
estimate lower upper
1977-1989 0.379 0.373 0.385
2000-2002 0.432 0.421 0.443
# decreasing depth by 1000m
trtools::contrast(m,
a = list(sweptarea = 1, meandepth = 1000, period = c("1977-1989","2000-2002")),
b = list(sweptarea = 1, meandepth = 2000, period = c("1977-1989","2000-2002")),
cnames = c("1977-1989","2000-2002"), tf = exp)
estimate lower upper
1977-1989 2.64 2.60 2.68
2000-2002 2.32 2.26 2.37
Here is how to do the latter with emmeans.
pairs(emmeans(m, ~meandepth*period, at = list(meandepth = c(1000,2000)),
offset = log(1), type = "response"), by = "period", infer = TRUE)
period = 1977-1989:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
meandepth1000 / meandepth2000 2.64 0.0210 Inf 2.60 2.68 1 121.900 <.0001
period = 2000-2002:
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
meandepth1000 / meandepth2000 2.31 0.0301 Inf 2.26 2.38 1 64.600 <.0001
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Tests are performed on the log scale
In epidemiology, the standardized mortality ratio (SMR) is the ratio of the observed number of deaths and the (estimated) expected number of deaths. Poisson regression with an offset can be used to model the SMR to determine if the number of deaths tends to be higher or lower than we would expect.
Example: Here is an example of an observational study using a Poisson regression model to investigate the relationship between lung cancer and radon exposure in counties in Minnesota.
Note: The data manipulation and plotting is quite a bit more complicated than what you will normally see in this class, but I have included it in case you might be interested to see the code.
First we will process the data containing the observed and expected number of deaths due to lung cancer, where the latter are based on the known distribution of age and gender in the county.
lung <- read.table("http://faculty.washington.edu/jonno/book/MNlung.txt",
header = TRUE, sep = "\t") %>%
mutate(obs = obs.M + obs.F, exp = exp.M + exp.F) %>%
dplyr::select(X, County, obs, exp) %>%
rename(county = County) %>%
mutate(county = tolower(county)) %>%
mutate(county = ifelse(county == "red", "red lake", county))
head(lung)
X county obs exp
1 1 aitkin 92 76.9
2 2 anoka 677 600.5
3 3 becker 105 107.9
4 4 beltrami 101 105.7
5 5 benton 61 81.4
6 6 big stone 32 27.4
Now we will read in data to estimate the average radon exposure of residents of each county.
radon <- read.table("http://faculty.washington.edu/jonno/book/MNradon.txt",
header = TRUE) %>% group_by(county) %>%
summarize(radon = mean(radon)) %>% rename(X = county)
head(radon)
# A tibble: 6 × 2
X radon
<int> <dbl>
1 1 2.08
2 2 3.21
3 3 3.18
4 4 3.66
5 5 3.78
6 6 4.93
Next we merge the two data frames.
radon <- left_join(lung, radon) %>% dplyr::select(-X)
head(radon)
county obs exp radon
1 aitkin 92 76.9 2.08
2 anoka 677 600.5 3.21
3 becker 105 107.9 3.17
4 beltrami 101 105.7 3.66
5 benton 61 81.4 3.77
6 big stone 32 27.4 4.93
For fun we can make some plots of the data by county.
library(maps)
dstate <- map_data("state") %>%
filter(region == "minnesota")
dcounty <- map_data("county") %>%
filter(region == "minnesota") %>%
rename(county = subregion)
dcounty <- left_join(dcounty, radon) %>%
mutate(smr = obs/exp)
no_axes <- theme_minimal() + theme(
axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank()
)
p <- ggplot(dcounty, aes(x = long, y = lat, group = group)) + coord_fixed(1.3) +
geom_polygon(aes(fill = exp), color = "black", linewidth = 0.25) +
scale_fill_gradient(low = grey(0.95), high = grey(0.25),
trans = "log10", na.value = "pink") +
theme(legend.position = "inside", legend.position.inside = c(0.8,0.4)) +
no_axes + ggtitle("Expected Number of Cases") + labs(fill = "Cases")
plot(p)
p <- ggplot(dcounty, aes(x = long, y = lat, group = group)) + coord_fixed(1.3) +
geom_polygon(aes(fill = smr), color = "black", linewidth = 0.25) +
scale_fill_gradient(low = grey(0.95), high = grey(0.25), na.value = "pink") +
theme(legend.position = "inside", legend.position.inside = c(0.8,0.4)) +
no_axes + ggtitle("Standardized Mortality Ratio") + labs(fill = "SMR")
plot(p)
p <- ggplot(dcounty, aes(x = long, y = lat, group = group)) + coord_fixed(1.3) +
geom_polygon(aes(fill = radon), color = "black", linewidth = 0.25) +
scale_fill_gradient(low = grey(0.95), high = grey(0.25), na.value = "pink") +
theme(legend.position = "inside", legend.position.inside = c(0.8,0.4)) +
no_axes + ggtitle("Average Radon (pCi/liter)") + labs(fill = "Radon")
plot(p)
How does the expected SMR relate to radon? Consider the Poisson
regression model logE(Yi/Ei)=β0+β1ri, where Yi and Ei are the observed and expected number
of lung cancer deaths (or cases), respectively, in the i-th county, and ri is the average radon exposure in the
i-th county. Here Yi/Ei is the SMR for the i-th county. We can also write this model
as logE(Yi)=logEi+β0+β1ri, so logEi is an
offset.
m <- glm(obs ~ offset(log(exp)) + radon, family = poisson, data = dcounty)
summary(m)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2107 0.00562 37.5 6.95e-308
radon -0.0421 0.00119 -35.2 4.37e-272
exp(cbind(coef(m), confint(m)))
2.5 % 97.5 %
(Intercept) 1.235 1.221 1.248
radon 0.959 0.957 0.961
We should be careful and remember the ecological fallacy which states that relationships at the group level (e.g., county) do not necessarily hold at the individual level. Radon may be related to other variables (e.g., smoking) that affect the risk of lung cancer.