Multiple Imputation

Multiple imputations involves three steps:

  1. Imputation of missing data to create \(m\) complete samples.
  2. Make inferences from each of the \(m\) samples.
  3. Pool the inferences from the \(m\) samples.

Setup: The Otters Stratified Random Sampling Design

library(tidyverse)
library(SDaA)
set.seed(124)

myotters <- otters |> 
  mutate(habtype = case_when(
    habitat == 1 ~ "cliffs",
    habitat == 2 ~ "agricultural",
    habitat == 3 ~ "peat",
    habitat == 4 ~ "notpeat")) |>
  mutate(N = case_when(
    habtype == "cliffs" ~ 89,
    habtype == "agricultural" ~ 61,
    habtype == "peat" ~ 40,
    habtype == "notpeat" ~ 47)) |>
  mutate(area = round(rnorm(n(), 5.5, 0.5), 2)) |> 
  select(-habitat)

head(myotters)
  section holts      habtype  N area
1       1     6      notpeat 47 4.81
2       3     0 agricultural 61 5.52
3       4     8       cliffs 89 5.12
4       8     0       cliffs 89 5.61
5      11     0       cliffs 89 6.21
6      19     0 agricultural 61 5.87

There are no missing data, so inferences for, say, the total number of holts could be done as follows.

library(survey)
mydesign <- svydesign(id = ~1, strata = ~habtype, fpc = ~N, data = myotters)
svytotal(x = ~holts, design = mydesign)
      total   SE
holts   985 73.9

Let’s simulate some missing data for the habtype and area variables.

myotters.missing <- myotters |> 
  mutate(habtype = ifelse(runif(n()) < 0.25, NA, habtype)) |>
  mutate(area = ifelse(runif(n()) < 0.25, NA, area)) |> 
  mutate(habtype = factor(habtype)) |>
  mutate(N = ifelse(is.na(habtype), NA, N))
head(myotters.missing)
  section holts      habtype  N area
1       1     6      notpeat 47 4.81
2       3     0 agricultural 61   NA
3       4     8       cliffs 89 5.12
4       8     0         <NA> NA 5.61
5      11     0         <NA> NA 6.21
6      19     0         <NA> NA   NA

We could do a complete-case analysis which drops those elements with unknown habitat type.

myotters.complete <- myotters.missing |> filter(!is.na(habtype))
mydesign <- svydesign(id = ~1, strata = ~habtype, fpc = ~N, data = myotters.complete)
svytotal(x = ~holts, design = mydesign)
      total   SE
holts   995 80.6

But that introduces two potential problems: (1) reduced sample size and (2) potential of “non-response” bias.

Step 1: Multiple Imputation

Two general methods of multiple imputation: joint modeling and fully conditional specification. Here we are going to use an algorithm that is based on the latter called MICE (Multivariate Imputation by Chained Equations).

library(mice)

pmat <- matrix(0, 5, 5)
pmat[3,2] <- 1
pmat[3,5] <- 1
pmat[5,2] <- 1
pmat[5,3] <- 1

myotters.imputed <- mice(myotters.missing, predictorMatrix = pmat, m = 5)
myotters.imputed
Class: mids
Number of multiple imputations:  5 
Imputation methods:
  section     holts   habtype         N      area 
       ""        "" "polyreg"     "pmm"     "pmm" 
PredictorMatrix:
        section holts habtype N area
section       0     0       0 0    0
holts         0     0       0 0    0
habtype       0     1       0 0    1
N             0     0       0 0    0
area          0     1       1 0    0

We need to get the data into a form that is usable in the second step.

library(mitools)
myotters.imputed <- imputationList(complete(myotters.imputed, action = "all"))

But first, we can see each of the \(m = 5\) samples as follows, while also fixing the stratum size.

head(myotters.missing)
  section holts      habtype  N area
1       1     6      notpeat 47 4.81
2       3     0 agricultural 61   NA
3       4     8       cliffs 89 5.12
4       8     0         <NA> NA 5.61
5      11     0         <NA> NA 6.21
6      19     0         <NA> NA   NA
for (i in 1:5) {
  myotters.imputed$imputations[[i]] <- myotters.imputed$imputations[[i]] |> 
    mutate(N = case_when(
      habtype == "cliffs" ~ 89,
      habtype == "agricultural" ~ 61,
      habtype == "peat" ~ 40,
      habtype == "notpeat" ~ 47))
  print(head(myotters.imputed$imputations[[i]]))
}
  section holts      habtype  N area
1       1     6      notpeat 47 4.81
2       3     0 agricultural 61 5.66
3       4     8       cliffs 89 5.12
4       8     0      notpeat 47 5.61
5      11     0      notpeat 47 6.21
6      19     0 agricultural 61 4.98
  section holts      habtype  N area
1       1     6      notpeat 47 4.81
2       3     0 agricultural 61 5.49
3       4     8       cliffs 89 5.12
4       8     0 agricultural 61 5.61
5      11     0 agricultural 61 6.21
6      19     0      notpeat 47 5.06
  section holts      habtype  N area
1       1     6      notpeat 47 4.81
2       3     0 agricultural 61 5.47
3       4     8       cliffs 89 5.12
4       8     0 agricultural 61 5.61
5      11     0       cliffs 89 6.21
6      19     0 agricultural 61 5.07
  section holts      habtype  N area
1       1     6      notpeat 47 4.81
2       3     0 agricultural 61 5.68
3       4     8       cliffs 89 5.12
4       8     0 agricultural 61 5.61
5      11     0 agricultural 61 6.21
6      19     0      notpeat 47 5.56
  section holts      habtype  N area
1       1     6      notpeat 47 4.81
2       3     0 agricultural 61 5.06
3       4     8       cliffs 89 5.12
4       8     0       cliffs 89 5.61
5      11     0 agricultural 61 6.21
6      19     0 agricultural 61 4.81

Step 2: Inferences from Each Sample

We define the design. The svydesign function understands that we are passing it several data sets.

mydesign.imputed <- svydesign(id = ~1, strata = ~habtype, fpc = ~N, 
  data = myotters.imputed)
mydesign.imputed
Multiple (5) imputations: svydesign(id = ~1, strata = ~habtype, fpc = ~N, data = myotters.imputed)

We essentially have \(m\) surveys, and we can estimate the total number of holts from each of them.

with(mydesign.imputed, svytotal(x = ~holts))
$`1`
      total   SE
holts   932 62.4

$`2`
      total   SE
holts   954 64.3

$`3`
      total   SE
holts   961 67.9

$`4`
      total   SE
holts   946 64.9

$`5`
      total   SE
holts   981 66.3

attr(,"call")
with(mydesign.imputed, svytotal(x = ~holts))

But we do not want \(m\) estimates and \(m\) standard errors. We want just one of each.

Step 3: Pool Inferences

We can pool (i.e., combine) the \(m\) inferences relatively simply. Let \(\hat\tau_j\) be the estimate from the \(j\)-th sample, and let \(\mbox{V}(\hat\tau_j)\) be its variance. Our pooled estimator of the population total is then \[ \hat\tau = \frac{1}{m}\sum_{j=1}^m \hat\tau_j, \] and its (approximate) variance of this pooled estimator is \[ \mbox{V}(\hat\tau) = \bar{W} + \left(1 + \frac{1}{m}\right)B, \] where \(\bar{W}\) is the “within-imputation” variance, \[ \bar{W} = \frac{1}{m}\sum_{j=1}^m \mbox{V}(\hat\tau_j), \] and \(B\) is the “between-imputation” variance, \[ B = \frac{1}{m-1}\sum_{j=1}^m (\hat\tau_j - \hat\tau)^2. \] We can also write the variance of \(\hat\tau\) is based on three components, \[ \mbox{V}(\hat\tau) = \bar{W} + B + \frac{1}{m}B. \] The second and third terms are the variance due to the uncertainty about the imputed values, and the variance due to their simulation, respectively.

These results are computed very easily for us using the MIcombine function.

holts.total <- with(mydesign.imputed, svytotal(x = ~holts))
MIcombine(holts.total)
Multiple imputation results:
      with(mydesign.imputed, svytotal(x = ~holts))
      MIcombine.default(holts.total)
      results    se
holts   954.9 68.19

For more information about multiple imputation, see Flexible Imputation of Missing Data by Stef van Buuren.