Multiple times you have seen me estimate the variance of an estimator for a specific design via simulation. This involves two steps:
We can then approximate the variance as \[ \mbox{V}(\hat\theta) = \frac{1}{R-1}\sum_{r=1}^R(\hat\theta_r - \bar\theta)^2, \] where \(\hat\theta_r\) is the estimate computed from the \(r\)-th sample, and \[ \bar\theta = \frac{1}{R}\sum_{r=1}^R \hat\theta_r. \] But, of course, this is only applicable if we have the entire population.
Suppose that rather than conducting a survey to obtain a sample of \(n\) elements, we conduct \(R\) surveys using the same design that each obtain a sample of \(n/R\) elements. Then let \[ \tilde\theta = \frac{1}{R}\sum_{r=1}^R \hat\theta_r, \] be our estimator of \(\theta\), where \(\hat\theta_r\) is the estimate from the \(r\)-th survey. The variance of \(\tilde\theta\) can be estimated as \[ \hat{\mbox{V}}(\tilde\theta) = \frac{1}{R(R-1)}\sum_{r=1}^R(\hat\theta_r - \tilde\theta)^2. \] A variation on this approach is to obtain one sample using some design, but then attempt to divide the sample into \(R\) sub-samples and apply the above.
Resampling methods using some sort of method of sampling from the sample to approximate the variance (or other properties) of the sampling distribution of an estimator.
There are several resampling methods that have been proposed.
Bootstrap.
Jackknife.
Balanced repeated replication.
Successive-difference replication.
Rather than listing \(R\) samples, we define the samples in terms of \(R\) sets of survey weights called replication weights.
Estimate a parameter of interest \(\theta\) with an estimator \(\hat\theta\) using the survey weights.
Estimate the parameter using each of \(R\) sets of replication weights. Call these estimates \(\hat\theta_1, \hat\theta_2, \dots, \hat\theta_R\).
Approximate the variance of the estimator \(\hat\theta\) as \[ \hat{\mbox{V}}(\hat\theta) = a \sum_{r = 1}^R b_r (\hat\theta_r - \hat\theta)^2, \] where \(a\) and \(b_r\) are scale factors that depend on the type of replication weights used.
The above steps could be done using something as simple as a spreadsheet, but specialized software will streamline the calculation.
Load R packages.
library(tidyverse)
library(tidycensus)
library(tigris)
library(srvyr)
Enter API key (get your own).
census_api_key("key goes here")
Download microdata.
pnw_microdata <- get_pums(variables = c("HISPEED","FINCP","PUMA"),
survey = "acs5", year = 2023, state = c("ID","WA","OR"),
rep_weights = "housing", recode = TRUE) |> rename_with(tolower)
Here we can sample a few households to see some of the variables of
interest as well as the survey weights (wgtp) and the first
few (out of 80) replication weights (wgtp1 through
wgtp80).
set.seed(123)
pnw_microdata |> slice_sample(n = 10) |>
select(state_label, hispeed_label, puma, wgtp, wgtp1:wgtp5)
# A tibble: 10 × 9
state_label hispeed_label puma wgtp wgtp1 wgtp2 wgtp3 wgtp4 wgtp5
<ord> <ord> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Washington/WA Yes 23312 47 64 81 16 52 15
2 Idaho/ID Yes 00500 9 10 11 11 9 16
3 Washington/WA Yes 26102 15 26 14 16 5 26
4 Oregon/OR N/A (GQ/vacant/no pa… 09100 1 3 3 3 2 3
5 Washington/WA Yes 20501 12 14 19 18 4 16
6 Washington/WA Yes 23304 12 14 13 3 18 12
7 Washington/WA Yes 23306 14 13 21 17 13 12
8 Oregon/OR Yes 03905 11 11 10 18 20 4
9 Washington/WA Yes 26106 10 17 18 3 11 10
10 Washington/WA Yes 20502 14 4 30 13 3 14
Create survey design object by specifying the survey weights and replication weights.
pnw_design <- pnw_microdata |>
as_survey_rep(weights = wgtp, repweights = wgtp1:wgtp80,
type = "ACS", mse = TRUE)
Domain means for household income by state and whether or not they have hi-speed internet.
pnw_design |> group_by(state_label, hispeed_label) |>
filter(hispeed_label %in% c("Yes","No")) |>
summarise(meanfinc = survey_mean(fincp))
# A tibble: 6 × 4
# Groups: state_label [3]
state_label hispeed_label meanfinc meanfinc_se
<ord> <ord> <dbl> <dbl>
1 Idaho/ID Yes 87260. 1026.
2 Idaho/ID No 66953. 1363.
3 Oregon/OR Yes 87049. 597.
4 Oregon/OR No 56730. 1491.
5 Washington/WA Yes 109212. 535.
6 Washington/WA No 65756. 1031.
Here is how we can get domain estimates for the proportion of households with hi-speed internet for each PUMA and make a geographic plot. First I will obtain the domain estimates.
pnw_hispeed <- pnw_design |> group_by(state, puma) |>
summarise(meanfinc = survey_mean(hispeed_label == "Yes"))
Next I need to download geometry data for the plot.
fips <- as.numeric(unique(pnw_microdata$state))
geom <- vector(mode = "list", length = length(fips))
for (i in 1:length(geom)) {
geom[[i]] <- pumas(state = fips[i], year = 2023)
}
geom <- geom |> bind_rows() |> rename_with(tolower)
Then I merge the domain estimates with the geometry data.
d <- full_join(geom, pnw_hispeed,
by = c("statefp20" = "state", "pumace20" = "puma"))
And then, finally, the plot.
ggplot(d, aes(fill = meanfinc)) + geom_sf() + theme_void() +
labs(fill = "Proportion")