This demonstration shows how calibration can be approached as a constrained optimization problem.
First I will create some artificial data, although we could use real data if it was available.
set.seed(123)
N <- 100 # population size
n <- 5 # sample size
y <- runif(N, 2, 10) # generate sample
x <- y + runif(N, -1, 1) # generate auxiliary variable
mux <- mean(x) # population mean of x
taux <- sum(x) # population total of x
i <- sample(1:N, n) # indices of sampled elements
i
[1] 30 94 89 16 88
y <- y[i] # values of target variable for sampled elements
x <- x[i] # values of auxiliary variable for sampled elements
cbind(y, x)
y x
[1,] 3.176909 3.556923
[2,] 7.254065 7.494052
[3,] 9.091752 9.918129
[4,] 9.198600 8.483188
[5,] 9.144409 9.273590
Here we used simple random sampling, so the design weights are all defined as \(d_i = N/n\) which is 20 for every sampled element. Consider finding a set of survey weights that are “close” to the design weights in the sense that the loss function \[ \sum_{i \in \mathcal{S}} \frac{(w_i - d_i)^2}{d_iq_i}, \] where \(1/q_i\) is a “weight” for the \(i\)-th element (not the same as a design or survey weight). We can program this in R as follows.
lf <- function(w) {
sum((w - d)^2/(d * q))
}
We want to minimize the loss function, but subject to the constrain that the sample is calibrated. Suppose we wish to calibrate the sample with respect to \(\tau_x\) so that \[ \tau_x = \sum_{i \in \mathcal{S}}w_ix_i. \] We can write the following constraint function which should take a value of zero if the constraint is met.
cf <- function(w) {
sum(w*x) - taux
}
Now suppose we specify that \(q_i = 1/x_i\) and solve the constrained optimization problem by finding the set of weights that minimize the loss function subject to the constraint defined above.
library(Rsolnp)
w <- runif(n) # initial values of weights
d <- rep(N/n, n) # design weights
q <- 1/x # reciprocal loss function weights
w <- solnp(pars = w, fun = lf, eqfun = cf, eqB = 0, LB = rep(0, n))$pars
Iter: 1 fn: 38.5642 Pars: 15.53739 15.53712 15.53717 15.53722 15.53725
Iter: 2 fn: 38.5642 Pars: 15.53723 15.53720 15.53721 15.53721 15.53721
Iter: 3 fn: 38.5642 Pars: 15.53721 15.53721 15.53721 15.53721 15.53721
solnp--> Completed in 3 iterations
w
[1] 15.53721 15.53721 15.53721 15.53721 15.53721
We can verify that these weights calibrate the sample with respect to \(\tau_x\).
taux
[1] 601.6922
sum(w*x)
[1] 601.6922
Also we can see that these weights are equivalent to using a ratio estimator.
sum(w*y)
[1] 588.3279
taux * mean(y) / mean(x)
[1] 588.3279
Now suppose we define \(q_i = 1\) and define a second auxiliary variable \(z_i = 1\). Note that \(N = \sum_{i=1}^N z_i\) so if we calibrate with respect to this auxiliary variable we are simply requiring that we can “estimate” the population size.
q <- 1
z <- rep(1, n)
cf <- function(w) {
sum(w*z) - N
}
w <- solnp(pars = w, fun = lf, eqfun = cf, eqB = 0, LB = rep(0, n))$pars
Iter: 1 fn: 1.4e-13 Pars: 20.00000 20.00000 20.00000 20.00000 20.00000
Iter: 2 fn: 1.174e-28 Pars: 20.00000 20.00000 20.00000 20.00000 20.00000
solnp--> Completed in 2 iterations
w
[1] 20 20 20 20 20
These are equivalent to the design weights. The sample was already calibrated to with respect to this auxiliary variable. But now suppose we want to calibrate with respect to both \(x_i\) and \(z_i\).
cf <- function(w) {
c(sum(w*z) - N, sum(w*x) - taux)
}
w <- solnp(pars = w, fun = lf, eqfun = cf, eqB = c(0, 0), LB = rep(0, n))$pars
Iter: 1 fn: 59.2468 Pars: 48.71575 21.72178 5.10167 14.94000 9.52080
Iter: 2 fn: 59.2468 Pars: 48.71575 21.72177 5.10167 14.94000 9.52080
Iter: 3 fn: 59.2468 Pars: 48.71575 21.72178 5.10167 14.94000 9.52080
solnp--> Completed in 3 iterations
w
[1] 48.71575 21.72178 5.10167 14.94000 9.52080
We can verify calibration.
c(N,taux)
[1] 100.0000 601.6922
c(sum(w*z), sum(w*x))
[1] 100.0000 601.6922
It turns out that this is equivalent to using a regression estimator.
sum(w*y)
[1] 583.209
b <- cor(y, x) * sd(y) / sd(x)
N*mean(y) + b * (taux - N * mean(x))
[1] 583.209