In class we discussed formulas for computing the sample sizes for optimum allocation for a stratified random sampling design. These formulas can be derived analytically using calculus and what are called Lagrange multipliers. Another approach is to solve the optimum allocation problem numerically by computer. This has a couple of advantages. One is that it lets us avoid having to do mathematical derivations which can be difficult, time-consuming, and error-prone. Of course, we do not need to re-do the derivations for optimum allocation for the cases we discussed in class since we already have the equations, but we would if we were to (a) consider a different estimator with a different variance or (b) a different cost function. Another advantage is that the numerical approach allows us to find a solution where the sample sizes are constrained such that \(2 \le n_j \le N_j\).

Optimum allocation is a constrained optimization problem. There are several R packages available for solving such problems. Here I will demonstrate using the package Rsolnp. First we need to install and load the package.

install.packages("Rsolnp")
library(Rsolnp)

Note that you only need to install the package via install.packages once per installation of R on a given computer. This downloads the package and installs it. But the library command needs to be used each time you restart R to make its contents available (namely the solnp function we will be using below).

Functions for Computing Bounds and Cost

Next we need to program functions functions to compute the variance of an estimator, the cost of the survey (if we want to have a fixed cost), and/or the bound on the error of estimation (if we want to have a fixed bound). Recall that the variance of \(\hat\mu\) is \[ V(\hat\mu) = \frac{1}{N^2}\sum_{j=1}^L N_j^2\left(1 - \frac{n_j}{N_j}\right)\frac{\sigma_j^2}{n_j}. \] We can program a function vf to compute the variance of \(\hat\mu\).

vf <- function(nj, Nj, sj, c0, cj) {
  1 / sum(Nj)^2 * sum(Nj^2 * (1 - nj / Nj) * sj^2 / nj)
}

Here nj, Nj, and sj represent the \(n_j\), \(N_j\), and \(\sigma_j\) for the \(L\) strata, respectively. But what might be confusing is that each of these variables represent the quantities for all strata — i.e., the nj represents \(n_1, n_2, \dots, n_L\). These are what are sometimes called vectors or (one-dimensional) arrays. R supports “vectorized” calculations where we can effectively program calculations on sets of values at once. Something else that may be confusing is that I have included as arguments to the function c0 and cj which represent what we called \(c_0\) and \(c_j\) for the cost of the survey. The variance is not a function of either of these quantities, but they are included here because the function we will use for the constrained optimization requires that every function passed to it has the same arguments.

Recall that we defined the cost of a survey as \[ C = c_0 + \sum_{j=1}^L n_j c_j. \] We can program this as follows.

cf <- function(nj, Nj, sj, c0, cj) {
  c0 + sum(nj * cj)
}

Finally the bound on the error of estimation is \[ B = 2\sqrt{V(\hat\mu)} \] which we can program as follows using the vf function to compute \(V(\hat\mu)\).

bf <- function(nj, Nj, sj, c0, cj) {
  2 * sqrt(vf(nj, Nj, sj, c0, cj))
}
Now we are ready for the constrained optimization to compute optimum allocations. Here I will replicate the examples from lecture for the sword fern survey using the following data.
Stratum Region \(N_j\) \(n_j\) \(\bar{y}_j\) \(s_j\)
1 Forest 30 8 287 149.1
2 Prairie 87 5 11.3 16.8
117 13

Minimizing Cost for a Fixed Bound

First consider the problem of minimizing the cost of the survey for a fixed bound on the error of estimation of \(B\) = 20 \(g/m^2\). We can find the solution as follows using the function solnp from the Rsolnp package.

tmp <- solnp(pars = c(5,5), fun = cf, eqfun = bf, eqB = 20,
  Nj = c(30,87), sj = c(149.1,16.8), cj = c(4,1), c0 = 20,
  LB = c(2,2), UB = c(30,87))

Iter: 1 fn: 60.0301  Pars:  8.63482 5.49076
Iter: 2 fn: 70.4930  Pars:  10.84835  7.09957
Iter: 3 fn: 72.5197  Pars:  11.28601  7.37569
Iter: 4 fn: 72.5719  Pars:  11.29723  7.38298
Iter: 5 fn: 72.5719  Pars:  11.29724  7.38298
Iter: 6 fn: 72.5719  Pars:  11.29724  7.38298
solnp--> Completed in 6 iterations
tmp$par
[1] 11.297240  7.382985

This may require a bit of explanation. The first argument par are “starting values” for the two sample sizes. These need not be the actual sample sizes for the optimum allocation. They are just “rough guesses” to get the algorithm started. The argument fun is the function for what we are minimizing, which is the cost of the survey. The argument eqfun is the function for what we want to fix (an equality constraint), which is the bound on the error of estimation. The argument eqB is the value we want to fix (here the bound on the error of estimation). The arguments Nj, sj, cj, and c0 pass some necessary information to these functions. And finally LB and UB are the lower-bound and upper-bound to the sample sizes for the optimum allocation. These are not needed here since the sample sizes for the optimum allocation are both between two and the strata sizes, but this is how those constraints would be included. Notice that the solution shown using tmp$par is within rounding error of what was found in lecture.

Minimizing the Bound for a Fixed Cost

Now consider the problem of minimizing the bound on the error of estimation for a fixed total cost of \(C\) = 100 and an overhead cost of \(c_0\) = 20. This can be done as follows.

tmp <- solnp(pars = c(5,5), fun = bf, eqfun = cf, eqB = 100,
  Nj = c(30,87), sj = c(149.1,16.8), cj = c(4,1), c0 = 20,
  LB = c(2,2), UB = c(30,87))

Iter: 1 fn: 13.9135  Pars:  17.19128 11.23487
Iter: 2 fn: 13.9135  Pars:  17.19128 11.23487
solnp--> Completed in 2 iterations
tmp$par
[1] 17.19128 11.23487

Note that all that has changed is the function defining what we are minimizing (bf), the function for what we are fixing (cf), and the value we are fixing (eqB). Again, the solution is within rounding error of what we found in lecture.

Other Optimum Allocation Problems

The numerical approach is useful in that it can be applied to other optimum allocation problems. Suppose we want to minimize the cost of the survey but subject to two restrictions: (1) the bound on the error of estimation for \(\mu\) must be no larger than 20 \(g/m^2\), and the bound on the error of estimation for \(\mu_f\) (i.e., the mean for the forest stratum) must be no larger than 50 \(g/m^2\). To do this first we program a function to compute both bounds which are returned as a vector of two numbers.

bf <- function(nj, Nj, sj, c0, cj) {
  bfor <- 2 * sqrt((1 - nj[1]/Nj[1]) * sj[1]^2 / nj[1])  
  ball <- 2 * sqrt(vf(nj, Nj, sj, c0, cj))
  c(bfor, ball)
}

Next we use the ineqfun argument of solnp instead of the eqfun argument, and specify the upper bound and lower bound on the bounds on the errors of estimation using the ineqLB and ineqUB arguments.

tmp <- solnp(pars = c(5,5), fun = cf, ineqfun = bf, 
  ineqLB = c(0, 0), ineqUB = c(50,20),
  Nj = c(30,87), sj = c(149.1,16.8), cj = c(4,1), c0 = 20,
  LB = c(2,2), UB = c(30,87))

Iter: 1 fn: 65.2111  Pars:  9.91073 5.56818
Iter: 2 fn: 81.8148  Pars:  14.62208  3.32651
Iter: 3 fn: 87.0642  Pars:  16.18818  2.31150
Iter: 4 fn: 87.6445  Pars:  16.27386  2.54904
Iter: 5 fn: 87.6670  Pars:  16.27406  2.57071
Iter: 6 fn: 87.6671  Pars:  16.27406  2.57087
Iter: 7 fn: 87.6671  Pars:  16.27406  2.57087
solnp--> Completed in 7 iterations
tmp$par
[1] 16.274063  2.570867

Note that if we remove the upper bound on the bound on the error of estimation of \(\mu_f\) by setting it to \(\infty\) we get what we had before.

tmp <- solnp(pars = c(5,5), fun = cf, ineqfun = bf, 
  ineqLB = c(0, 0), ineqUB = c(Inf,20),
  Nj = c(30,87), sj = c(149.1,16.8), cj = c(4,1), c0 = 20,
  LB = c(2,2), UB = c(30,87))

Iter: 1 fn: 71.0980  Pars:   6.57210 24.80965
Iter: 2 fn: 71.6839  Pars:   9.29049 14.52193
Iter: 3 fn: 72.9552  Pars:  10.53017 10.83449
Iter: 4 fn: 71.9678  Pars:  11.04156  7.80152
Iter: 5 fn: 72.5300  Pars:  11.28917  7.37329
Iter: 6 fn: 72.5719  Pars:  11.29727  7.38286
Iter: 7 fn: 72.5719  Pars:  11.29724  7.38297
Iter: 8 fn: 72.5719  Pars:  11.29724  7.38298
solnp--> Completed in 8 iterations
tmp$par
[1] 11.297240  7.382984

Here is another example. Suppose we also planned to make inferences about a second target variable from the same survey (i.e., from the same sampled quadrats). Perhaps it is the biomass of another species of plant. And assume that we guess that the stratum standard deviations for this variable are 20 and 40 in the forest and prairie strata, respectively. Finally assume that we want to design a survey that will estimate these parameters with bounds on the error of estimation no larger than 20 for the mean biomass for sword fern, and 10 for the mean biomass for the other plant species. We can write a function to compute both of these bounds as follows.

bf <- function(nj, Nj, sj1, sj2, c0, cj) {
  b1 <- 2 * sqrt(vf(nj, Nj, sj1, c0, cj))  
  b2 <- 2 * sqrt(vf(nj, Nj, sj2, c0, cj))
  c(b1, b2)
}

Note that we needed to add an extra argument since we have two sets of standard deviations. The solnp function requires that all functions have the same set of arguments (even if they are not used by a given function) so we need to rewrite the cost function even though cost does not depend on the standard deviations.

cf <- function(nj, Nj, sj1, sj2, c0, cj) {
  c0 + sum(nj * cj)
}

For this example we will assume the same costs, although we might imagine it could be higher given that we are observing the biomass of an extra species. Here is how we can numerically find the optimum allocation for this problem.

tmp <- solnp(pars = c(5,5), fun = cf, ineqfun = bf, 
  ineqLB = c(0, 0), ineqUB = c(20,10),
  Nj = c(30,87), sj1 = c(149.1,16.8), sj2 = c(20,40), cj = c(4,1), c0 = 20,
  LB = c(2,2), UB = c(30,87))

Iter: 1 fn: 63.1206  Pars:   8.05885 10.88514
Iter: 2 fn: 78.1108  Pars:   9.77870 18.99605
Iter: 3 fn: 85.2743  Pars:  10.08816 24.92167
Iter: 4 fn: 86.8187  Pars:  10.10637 26.39319
Iter: 5 fn: 86.8804  Pars:  10.10684 26.45306
Iter: 6 fn: 86.8805  Pars:  10.10684 26.45315
Iter: 7 fn: 86.8805  Pars:  10.10684 26.45315
solnp--> Completed in 7 iterations
tmp$par
[1] 10.10684 26.45315

Note that if the costs were different for the different species we could incorporate that into cost function. We might use something like this.

cf <- function(nj, Nj, sj1, sj2, c0, cj1, cj2) {
  c0 + sum(nj * cj1) + sum(nj * cj2) 
}

This assumes a cost function of the form \[ C = c_0 + \sum_{j=1}^L n_jc_{j1} + \sum_{j = 1}^L n_jc_{j2}, \] where \(c_{j1}\) and \(c_{j2}\) is the cost per quadrat for the sword fern and the other plant species, respectively.