Multiple imputations involves three steps:
library(tidyverse)
library(SDaA)
set.seed(124)
myotters <- otters |>
mutate(habtype = case_when(
habitat == 1 ~ "cliffs",
habitat == 2 ~ "agricultural",
habitat == 3 ~ "peat",
habitat == 4 ~ "notpeat")) |>
mutate(N = case_when(
habtype == "cliffs" ~ 89,
habtype == "agricultural" ~ 61,
habtype == "peat" ~ 40,
habtype == "notpeat" ~ 47)) |>
mutate(area = round(rnorm(n(), 5.5, 0.5), 2)) |>
select(-habitat)
head(myotters)
section holts habtype N area
1 1 6 notpeat 47 4.81
2 3 0 agricultural 61 5.52
3 4 8 cliffs 89 5.12
4 8 0 cliffs 89 5.61
5 11 0 cliffs 89 6.21
6 19 0 agricultural 61 5.87
There are no missing data, so inferences for, say, the total number of holts could be done as follows.
library(survey)
mydesign <- svydesign(id = ~1, strata = ~habtype, fpc = ~N, data = myotters)
svytotal(x = ~holts, design = mydesign)
total SE
holts 985 73.9
Let’s simulate some missing data for the habtype and
area variables.
myotters.missing <- myotters |>
mutate(habtype = ifelse(runif(n()) < 0.25, NA, habtype)) |>
mutate(area = ifelse(runif(n()) < 0.25, NA, area)) |>
mutate(habtype = factor(habtype)) |>
mutate(N = ifelse(is.na(habtype), NA, N))
head(myotters.missing)
section holts habtype N area
1 1 6 notpeat 47 4.81
2 3 0 agricultural 61 NA
3 4 8 cliffs 89 5.12
4 8 0 <NA> NA 5.61
5 11 0 <NA> NA 6.21
6 19 0 <NA> NA NA
We could do a complete-case analysis which drops those elements with unknown habitat type.
myotters.complete <- myotters.missing |> filter(!is.na(habtype))
mydesign <- svydesign(id = ~1, strata = ~habtype, fpc = ~N, data = myotters.complete)
svytotal(x = ~holts, design = mydesign)
total SE
holts 995 80.6
But that introduces two potential problems: (1) reduced sample size and (2) potential of “non-response” bias.
Two general methods of multiple imputation: joint modeling and fully conditional specification. Here we are going to use an algorithm that is based on the latter called MICE (Multivariate Imputation by Chained Equations).
library(mice)
pmat <- matrix(0, 5, 5)
pmat[3,2] <- 1
pmat[3,5] <- 1
pmat[5,2] <- 1
pmat[5,3] <- 1
myotters.imputed <- mice(myotters.missing, predictorMatrix = pmat, m = 5)
myotters.imputed
Class: mids
Number of multiple imputations: 5
Imputation methods:
section holts habtype N area
"" "" "polyreg" "pmm" "pmm"
PredictorMatrix:
section holts habtype N area
section 0 0 0 0 0
holts 0 0 0 0 0
habtype 0 1 0 0 1
N 0 0 0 0 0
area 0 1 1 0 0
We need to get the data into a form that is usable in the second step.
library(mitools)
myotters.imputed <- imputationList(complete(myotters.imputed, action = "all"))
But first, we can see each of the \(m = 5\) samples as follows, while also fixing the stratum size.
head(myotters.missing)
section holts habtype N area
1 1 6 notpeat 47 4.81
2 3 0 agricultural 61 NA
3 4 8 cliffs 89 5.12
4 8 0 <NA> NA 5.61
5 11 0 <NA> NA 6.21
6 19 0 <NA> NA NA
for (i in 1:5) {
myotters.imputed$imputations[[i]] <- myotters.imputed$imputations[[i]] |>
mutate(N = case_when(
habtype == "cliffs" ~ 89,
habtype == "agricultural" ~ 61,
habtype == "peat" ~ 40,
habtype == "notpeat" ~ 47))
print(head(myotters.imputed$imputations[[i]]))
}
section holts habtype N area
1 1 6 notpeat 47 4.81
2 3 0 agricultural 61 5.66
3 4 8 cliffs 89 5.12
4 8 0 notpeat 47 5.61
5 11 0 notpeat 47 6.21
6 19 0 agricultural 61 4.98
section holts habtype N area
1 1 6 notpeat 47 4.81
2 3 0 agricultural 61 5.49
3 4 8 cliffs 89 5.12
4 8 0 agricultural 61 5.61
5 11 0 agricultural 61 6.21
6 19 0 notpeat 47 5.06
section holts habtype N area
1 1 6 notpeat 47 4.81
2 3 0 agricultural 61 5.47
3 4 8 cliffs 89 5.12
4 8 0 agricultural 61 5.61
5 11 0 cliffs 89 6.21
6 19 0 agricultural 61 5.07
section holts habtype N area
1 1 6 notpeat 47 4.81
2 3 0 agricultural 61 5.68
3 4 8 cliffs 89 5.12
4 8 0 agricultural 61 5.61
5 11 0 agricultural 61 6.21
6 19 0 notpeat 47 5.56
section holts habtype N area
1 1 6 notpeat 47 4.81
2 3 0 agricultural 61 5.06
3 4 8 cliffs 89 5.12
4 8 0 cliffs 89 5.61
5 11 0 agricultural 61 6.21
6 19 0 agricultural 61 4.81
We define the design. The svydesign function understands
that we are passing it several data sets.
mydesign.imputed <- svydesign(id = ~1, strata = ~habtype, fpc = ~N,
data = myotters.imputed)
mydesign.imputed
Multiple (5) imputations: svydesign(id = ~1, strata = ~habtype, fpc = ~N, data = myotters.imputed)
We essentially have \(m\) surveys, and we can estimate the total number of holts from each of them.
with(mydesign.imputed, svytotal(x = ~holts))
$`1`
total SE
holts 932 62.4
$`2`
total SE
holts 954 64.3
$`3`
total SE
holts 961 67.9
$`4`
total SE
holts 946 64.9
$`5`
total SE
holts 981 66.3
attr(,"call")
with(mydesign.imputed, svytotal(x = ~holts))
But we do not want \(m\) estimates and \(m\) standard errors. We want just one of each.
We can pool (i.e., combine) the \(m\) inferences relatively simply. Let \(\hat\tau_j\) be the estimate from the \(j\)-th sample, and let \(\mbox{V}(\hat\tau_j)\) be its variance. Our pooled estimator of the population total is then \[ \hat\tau = \frac{1}{m}\sum_{j=1}^m \hat\tau_j, \] and its (approximate) variance of this pooled estimator is \[ \mbox{V}(\hat\tau) = \bar{W} + \left(1 + \frac{1}{m}\right)B, \] where \(\bar{W}\) is the “within-imputation” variance, \[ \bar{W} = \frac{1}{m}\sum_{j=1}^m \mbox{V}(\hat\tau_j), \] and \(B\) is the “between-imputation” variance, \[ B = \frac{1}{m-1}\sum_{j=1}^m (\hat\tau_j - \hat\tau)^2. \] We can also write the variance of \(\hat\tau\) is based on three components, \[ \mbox{V}(\hat\tau) = \bar{W} + B + \frac{1}{m}B. \] The second and third terms are the variance due to the uncertainty about the imputed values, and the variance due to their simulation, respectively.
These results are computed very easily for us using the
MIcombine function.
holts.total <- with(mydesign.imputed, svytotal(x = ~holts))
MIcombine(holts.total)
Multiple imputation results:
with(mydesign.imputed, svytotal(x = ~holts))
MIcombine.default(holts.total)
results se
holts 954.9 68.19
For more information about multiple imputation, see Flexible Imputation of Missing Data by Stef van Buuren.