Two-stage Sampling Design and Sample Selection with the R package R2BEAT

R2BEAT (R"to"Bethel Extended Allocation for Two-stage sampling) is an R package for the allocation of a sample. Besides other software and packages dealing with the allocation problems, its peculiarity lies in facing properly allocation problems for complex sampling designs with multi-domain and multi-purpose aims. This is common in many official and non-official statistical surveys, therefore R2BEAT could become an essential tool for planning a sample survey. The package implements the Tschprow (1923) - Neyman (1934) method for the optimal allocation of units in stratified sampling, extending it to the multivariate (accordingly to Bethel's proposal (1989)), multi-domain and to the complex sampling designs case (Falorsi et al., 1998). The functions implemented in R2BEAT allow the use of different workflows, depending on the available information on one or more interest variables. The package covers all the phases, from the optimization of the sample to the selection of the Primary and Secondary Stage Units. Furthermore, it provides several outputs for evaluating the allocation results.


Introduction
National Statistical Institutes (NSIs) and other official statistics institutions usually stratify the target population into homogeneous groups, defined by variables.Survey data usually benefits from stratification, and sampling error decreases.However, from a logistic point of view, the stratified sample could be geographically widespread, entailing such a cost increase in the data collection process.For solving this issue and to avoid sample dispersing, the two-stage stratified sampling design is often used for planning surveys, mainly the social ones carried out in households.This sampling design enables to control of the number of Primary Stage Units (PSUs) selected in the survey.For instance, the municipalities or the enumeration areas in which the selected households (Secondary Stage Units, SSUs) belong.Controlling the municipalities number remarkable reduces data collection costs, mainly for face-to-face interviews, and avoids logistic problems given by a geographically scattered sample.Nevertheless, allocating a two-stage sample among strata can be tricky: usually, households surveys are defined as multipurpose, since they estimate many target variables; moreover, produced estimates are provided for many estimation domains, such as national level, geographical areas, municipality types, etc.In this context, the allocation of the whole sample size becomes a multivariate and multidomain problem.It is important to point out that the total size is defined according to three types of constraints: estimates precision, budget, logistic ones, or more likely by a combination of the three.
Once indicatively defined the whole sample size, intended as the number of SSUs to select and interview, has to be allocated among the strata in which the PSUs population is partitioned.Different methods can be used for allocating the sampling units among the strata according to the available information.The easiest methods are uniform and proportional allocation.If, however, the values and the variances of some survey target variables are known in each design stratum, from auxiliary sources such as registers or previous survey occasions, then an optimal allocation can be computed.
The idea behind the optimal allocation is that strata with larger sizes and larger variability recorded on the target variables need a larger sample size to provide better estimates.Several publications and packages focus on this aspect.
R2BEAT extends the methodology implemented in Istat's open-source software called Mauss-R (Barcaroli et al., 2020), which stands for "Multivariate Allocation of Units in Sampling Surveys", widely used for designing one-stage sample surveys and also in the SamplingStrata (Barcaroli, 2014).
Furthermore, it faces the optimal allocation definition in the two-stage sampling design case.Its name stands for R "to" Bethel Extended Allocation for Two-stage.The package represents a very specific tool for designing, allocating and selecting the most complex and challenging sample in the context of survey designs.
Furthermore, R2BEAT fills a gap, within the range of statistical software concerning sample size allocation.In fact, several R packages are available for allocating a stratified sample, such as surveyplanning (Breidaks et al., 2020), PracTools (Valliant et al., 2020), optimStrat (Bueno, 2020), and the already mentioned Mauss-R and SamplingStrata, but none of these can compute the optimal allocation among strata in such a complex sampling design context, considering both multivariate and multi-domain case.
In the following paragraph the methodological aspects, underlying the package and its functions, will be presented in detail: the optimal allocation of the sample and its selection will be illustrated.In the third paragraph will be shown how to prepare, organize and check the input data needed by the package for allocating the whole sample size among strata and to finally select the units.A case study on a synthetic dataset will be used as an example to test the package functions.Finally, the results will be discussed in the concluding remarks.

Methodological aspects
Sample surveys carried out by National Statistical Institutes and by other institutions have multidomains and multi-purpose objectives, so they have to provide accurate estimates for different parameters and different domains (i.e.geographical areas such as national, regional, and more).However, usually, the survey has budgetary constraints, then, they must be carefully planned to provide highquality estimates for parameters of interest.
A seminal work in this perspective is due to Kish (1965).While a broad theoretical framework for optimizing surveys by maximizing data quality within budgetary constraints is provided by Biemer and Lyberg (2003) and Biemer (2010).
When designing a multipurpose survey several choices need to be made.They usually are not trivial, because identifying the best solution for every purpose (i.e.every interest variable for each domain of interest) is challenging.Usually "just" a practical optimum, not the best solution, can be pursued.The research for the best solution -maximizing data quality within budgetary constraintsmay arise conflicts in several areas (Kish, 1988).Among these areas, sample size and the relation of biases to sampling errors are considered the most important because their influence ripples throughout the overall survey.
This view justifies the care and attention always given in the literature to the optimal sample design (Cochran, 1977;Cicchitelli et al., 1992;Conti and Marella, 2012;Tillé, 2020).Gonzalez and Eltinge (2010) present an interesting overview of the approaches for defining optimal sampling strategies.
The optimization problem of a sample design is usually dealt with the estimation of a mean (or equivalently of a total) in stratified sampling designs with a fixed sample size.The problem of the optimization of stratified sample design can be classified depending on whether stratification is given or also the stratification has to be optimized, before or at the same time of the allocation.
The R2BEAT package solves the optimization problem when the stratification is given and the optimization must be sought in the allocation of sampling units.Therefore, in the following, we focus just on this situation.For more details on the optimization problems when also the stratification has to be optimized see, e.g., Ballin and Barcaroli (2013) and references therein.

Optimal allocation
Let us consider a population U of size N (k = 1, . . ., N ) partitioned in H subgroups, U h (h = 1, . . ., H), called strata.Hence, each stratum contains N h elements, where N h is assumed to be known and such as The strata can be defined in different ways on the basis of one or more qualitative variables known for all the units in the population.
Then, we assume, at least for the moment, to be interested in investigating the mean of just one y variable in the population U , where y k is the value of the y variable observed on the k-th unit in the population U .The y variable could be a quantitative variable or dichotomous, that is y ∈ {0, 1}.Please note that, even when y is a dichotomous variable, expression (1) holds and µ y is equal to the proportion of units in the population for which y = 1.Furthermore, assume we want to estimate µ y through a probabilistic sample s of size n with the estimator where ŶHT is the Horvitz-Thompson estimator for the total (Horvitz and Thompson, 1952) in which d k is the design weight usually equal to the inverse of the first order inclusion probability.
The sample size of a survey, n, is usually exogenous information, dictated by budget and, sometimes, by logistic constraints associated to the unit k in the sample.Then, in practice, the problem comes down to the allocation of the n units in the H strata, such as Therefore, let us define the mean of the y in each stratum where 1 h is the membership indicator for the unit k in the stratum h.
In the same way, expressions (2) can be easily adapted for estimating µ hy , that is Ŷh = ŶHT,h where s h is the sample in the stratum h.The sampling variance estimator of Ŷh is given by where ȳh is the sample mean of the variable y in the stratum h.
In this perspective, the mean of y in (1) can be written also as Therefore, the sampling variance estimator for Ŷ is When there is no information on y, the sample size to be allocated to each stratum, n h , can be assigned by performing uniform or proportional allocation.
Uniform allocation assigns an equal number of sampling units to each stratum, that is More often, we want the sample size assigned to strata in the sample to be proportional to the sizes of the strata in the population, that is where N h /N is the weight of the stratum in the population with When there is information in the population strata on y and in particular on its variance, S 2 yh , a more favourable allocation can be performed.Alternatively, it is possible to consider also a proxy variable highly correlated with y.In this case, Tschprow (1923) demonstrated that the optimal allocation can be obtained by giving However, this result is better known as Neyman allocation by the namesake author that in 1934 published the same result.The rationale behind the optimal allocation is that strata with more weight and in which y has much more variability need much more observations for reaching better estimates.If the variance is the same in all the strata (S As evidence, the computation of the population variance is a crucial point in the optimal allocation.A distinction between the types of variables and the sources from which they can be obtained is needed.When y is a dichotomous variable available from a population register, its population variance can be computed as where p h is the proportion of units with y = 1 in the population strata.In the case of a quantitative variable, S 2 yh is equal to When there is no population register, information on the variability can be obtained from a sample survey or a pilot survey previously carried out.Let us assume to have collected the y variable, or at least its proxy variable, on a sample s * .Then, (6) can be computed just by replacing p h with ph = that is the related estimate for each stratum obtained from the sample s * .In (7) w k is the sampling weight associated with the unit k in the sample s * .Instead, when y is a quantitative variable, are the quadratic mean and the arithmetic mean estimated on the sample s * in the h-th stratum, respectively.
Sometimes, collecting data on units belonging to different strata can have different costs for the difficulties in reaching them (e.g.strata are altitude zone) or the need of using different data collection modes.Therefore, it is advisable to define the allocation by taking into account also the unit cost and the budget constraints.
The global cost of the survey can be defined as where c 0 is the fixed cost (not dependent on the sample size) and c h is the unit cost for collecting data on one unit belonging to stratum h.Then, the optimal allocation under budget constraints is given by The optimal allocation for just one y variable is of little practical use unless the various variables under study are highly correlated.This is because an allocation that is optimal for one characteristic is generally far from being optimal for others.
Therefore, several works have been devoted to solving the problem when more than one variable of interest has to be measured on each sampled unit.All the contributions can be classified into two main approaches: the "average variance" and convex programming.
The methods under the "average variance" approach consist of defining a weight for each variable to consider, computing a weighted average of the stratum variance and finding the optimal allocation on the "average variance" which results.They are computationally simple, intuitive and can be solved under fixed cost assumption.However, the choice of the weights is completely arbitrary and the optimal properties are not clear (see, e.g., Dalenius, 1953;Yates, 1960;Folks and Antle, 1965;Hartley, 1965;Kish, 1976, for more details).
The most important method in the convex programming approach is the Bethel algorithm (Bethel, 1989) which extends the Neyman allocation to the multivariate case.
In particular, when we are interested in investigating the mean of more than one y variable (quantitative or dichotomous), namely y 1 , . . ., y i , . . ., y J , the optimal allocation problem reduces, in practice, to a minimum optimization problem of a convex function under a set of linear constraints where C is the global cost of the survey and ĈV Ŷi,h is the estimate of the relative error.The estimate of the relative error, is the ratio between the estimate of the sampling variance for the mean estimator of y i variable (i = 1, . . ., J) in the stratum h given by expression (5) and the related estimate.In this case, CV Ŷi,h is called expected errors and it must be less than or equal to the precision constraints defined by the user or by regulation, δ Ŷi,h .Bethel (1989) demonstrates that the solution to this optimization problem exists and can be obtained through an algorithm that applies the Lagrangian multipliers method.The solution is a continuous solution, then it must be rounded to provide an integer stratum sample size.The rounding clearly causes some deviations from the solution that, however, do not affect its optimality (Cochran, 1977).The Bethel algorithm is very similar to the Chromy algorithm (Chromy, 1987).However, it is preferable because, even if the Chromy algorithm is simpler, there is no proof that it converges if a solution exists.
The same framework works to deal also with the multi-domain problem.
Usually, estimates of a survey are disseminated for the whole population and sub-domains, for instance for geographical areas, but not only.Then, it is useful to define the optimal allocation also taking into account these outcomes of the survey.
Sub-domain estimation is actually a long-established theory (Särndal et al., 2003).Expressions (1) can be easily adapted just by introducing the sub-domain membership indicator variable, 1 k,d , which is equal to 1 for all the unit k in the domain d and 0 otherwise, that is where N d is the population size in the domain d (d = 1, . . ., D).It is important to point out, that domains must be an aggregation of strata and they do not have to cut the strata.Then, it is sufficient to consider the domain estimates in the minimum optimization problem in (8) and use the Bethel's algorithm for deriving the multivariate allocation in the multi-domain case.However, in official statistics, especially for household surveys, two-stage sampling designs are usually adopted.
Two-stage sampling is based on a double sampling procedure: one on the primary stage units (PSUs) and another on the second stage units (SSUs).For instance, in the household survey, the PSUs are the municipalities that are firstly selected.Then, in each selected municipality, a sample of households -the SSU -can be selected.
Two-stage sampling permits more complex sampling strategies and, moreover, it helps in the organization and cost reduction of data collection, because it reduces the interviewer's travels.However, this economic saving is paid off with a loss of efficiency of the estimates.In fact, each additional stage of selection usually entails an increase of the sampling variance of the mean estimator.This increase can be assessed by the design effect (def f ) that measures how much the sampling variance of Ŷi , under the adopted sampling design (des), is inflated with respect to a simple random sample (srs), with the same sample size.An estimate of the design effect, under can be given by the expression: While a rough approximation of the def f can be obtained when the clusters have the same sample size and the same inclusion probability (Cicchitelli et al., 1992), where b is the average cluster (i.e.PSU) size in terms of the final sampling units and ρ i is the intra-class correlation within the cluster (PSU) for the variable y i (i = 1, . . ., J).
The intra-class correlation provides a measure of data clustering in PSUs and SSUs.In general, if ρ i is close to 1, the clustering is high and it is convenient to collect only a few units in the cluster.On the contrary, if ρ i is close to 0, the collection of units from the same cluster does not affect the efficiency of the estimates.
Also for computing ρ i , we can distinguish whether a population register in which the y i variables (i = 1, . . ., J), or at least their proxies, are available or not.In the former case, a good approximation is given by the expression where

2
and are the deviance within clusters and the global deviance of the y i variable, respectively.Remember that D yi = D wi + D bi , where is the deviance between clusters.Therefore, 0 ≤ ρ i ≤ 1.Instead, ρ i can be estimated from a sample with the expression (10 Here we consider, directly, a more general expression for the estimate of the def f in terms of the intra-class correlation coefficient.This expression refers to a typical situation in household surveys where PSUs are assigned to Self-Representing (SR) strata, that is they are included for sure in the sample, or to Not-Self-Representing (NSR) strata, where they are selected by chance.In practice, this assignment is usually performed by comparing the measure of the size of PSUs to the threshold: where m is the minimum number of SSUs to be interviewed in each selected PSU, f = n/N is the sampling fraction and ∆ is the average dimension of the SSU in terms of elementary survey units.
Then, ∆ must be set equal to 1 if, for the survey, the selection units are the same as the elementary units (that is, household-household or individuals-individuals), whereas it must be set equal to the average dimension of the households if the elementary units are individuals, while the selection units are the households.PSUs with a measure of size exceeding the threshold are identified as SR, while the remaining PSUs are identified as NSR.
Then, the extended expression of def f (see among the others Rojas, 2016) is where, for SR and N SR strata, • N SR and N N SR are the population sizes; • n SR and n N SR are the sample sizes; • ρ i,SR and ρ i,N SR the intra-class correlation coefficients for the variable i (i = 1, . . ., J); • b SR and b N SR are the average PSU size in terms of the final sampling units.
Of course, if there are no SR strata the expression (10) recurs.The design effect is equal to 1 under the srs design and increases for each additional stage of selection, due to the intra-class correlation coefficient which is, usually, positive.The intra-class correlation coefficient for NSR can be derived with expression (11) or ( 12) whether population register data are available or not.While it is not necessary to compute the intra-class correlation coefficient for SR strata because just one PSU is selected and the intra-class correlation is 1 by definition.
Therefore, under a two-stage sample design for determining the optimal allocation, the number of PSUs and SSUs must be determined.
The solution has been proposed by Falorsi et al. (1998) in a paper published in Italian and it is obtained with an iterative use of the Bethel algorithm.In fact, at the first iteration, the Bethel algorithm is applied.The optimal allocation for a stratified simple sampling design is obtained.Then, this allocation is used to update the threshold in ( 13) and the design effect in (14).A new design effect is computed and used in turn to inflate the S 2 h (or equivalently Ŝ2 h ).It is used as input in the next iteration in which the Bethel algorithm is used again.The obtained allocation is used again to update the threshold and the design effect, and a new allocation is found.The process is iterated until when the difference between two consecutive iterations is lower than a predefined threshold.1. input deff is used to inflate standard deviations of target variables in sampling strata; 2. optimal allocation of SSUs in sampling strata is obtained by applying the Bethel algorithm as if it were a one-stage sampling design; 3. the number of PSUs is determined on the basis of the minimum number of SSUs per PSU; 4. the threshold for determination of self-representing PSUs is calculated; 5. new deff is calculated and used to update the standard deviations of target variables in sampling strata; REM Next iterations; while not convergence do 1.optimal allocation of SSUs in sampling strata is obtained by applying the Bethel algorithm; 2. the number of PSUs is determined on the basis of the minimum number of SSUs per PSU; 3. the threshold for determination of self-representing PSUs is calculated; 4. new deff is calculated an used to update standard deviations of target variables in sampling strata; 5. the iteration stops if a. the difference between the sample sizes of two iterations is lower than 5 (default value) or b. the maximum of defts (square root of deffs) largest differences is lower than 0.06 (default value) or c. the number of iterations is higher than 20 (default value); end However, as pointed out by Waters and Chester (1987), different combinations yield the same variance and can satisfy the precision constraints, δ Ŷi,h .The optimal solution strongly depends on the budgetary constraints that limit the SSU s and the data collection organization that influences the maximum number of P SU s that can be managed.All this discussion holds when you want to use the HT estimator.But, currently, the most applied estimator for the NSIs survey is the calibrated estimator (Deville and Särndal, 1992;Särndal, 2007;Devaud and Tillé, 2019).The calibrated estimator, through the use of auxiliary variables, usually provides better estimates than HT .Then, it can be useful to take into account, since the allocation phase, also be the impact on the estimates of an estimator different from the HT estimator.This can be done by inflating the S 2 yh with the estimator effect and following the procedure explained above.An estimate of the estimaror effect (ef f st) is given by It measures how much the sampling variance of the applied estimator under the adopted design is inflated or deflated with respect to the sampling variance of the HT estimator, on the same sample design.

Sample selection
Once the optimal allocation is defined, the selection of sampling units must be performed.
In the case of a stratified two-stage sampling design two sampling selections need to be done: one for PSUs and one for SSUs.
In each stratum, the PSUs are split into SR and NSR according to a size threshold (13).PSUs with a measure of size exceeding the threshold are identified as SR, included for sure in the sample and each of them constitutes an independent sub-stratum.Therefore, the probability that they are included in the sample (inclusion probability, π I ) is always equal to 1.However, it can happen that no one PSU has a measure of size higher than the threshold.
The remaining PSUs, NSR-PSUs, are ordered by their measure of the size and divided into finer strata (sub-strata) whose sizes are approximately equal to the threshold multiplied by the number of PSUs to be selected in each stratum.In this way, sub-strata are composed of PSUs having size as homogeneous as possible.
The PSUs in each stratum can be selected in different ways.However, the selection of a fixed number of PSUs per stratum is usually carried out with Sampford's method (unequal probabilities, without replacement, fixed sample size).Then, the inclusion probability of the generic -th NSR-PSU, is where N h is the measure of size in the sub-stratum h-th, m is the number of NSR-PSUs to be selected in the sub-stratum and M h is the measure of size in the -th PSU in the sub-stratum h.
Finally, the SSUs must be drawn in the selected PSU.Also in this case the SSU can be selected in different ways.In most cases, they are selected through a systematic sampling design that shares several properties with the srs.Then, the inclusion probability for the second stage is equal to where n h is the number of SSUs to be selected in the -th PSU in the h-th sub-stratum.Then, the design weight for the unit k in the h-th strata in the -th PSU is equal to the inverse of the product of the first stage and the second stage inclusion probabilities, The design weights sum up to the population size, k∈s d k = N , and are almost constant in each stratum, which means that the sample is self-weighting.
Algorithm 2: R2BEAT selection of PSUs Input : a.The output of the allocation step (function beat.2st)(universe of PSUs, measure of PSUs, number of PSUs and SSUs to be selected in each stratum, threshold); Output: a. universe of PSUs with stratum, sub-stratum, PSU first order inclusion probability, PSU weight, flag sample, and number of SSUs to be selected in each PSU; b. sample of PSUs (flag sample=1) with stratum, sub-stratum, PSU first order inclusion probability, PSU weight, number of SSUs to be selected in each PSU; c. statistics related to the sample of PSUs at stratum level; REM creation of sub-strata and selection of PSUs; 1. in each stratum, PSUs are sorted in descending order according to their measure of size; 2. the measure of size of PSUs are compared with the threshold; 3. PSUs with a measure of size exceeding the threshold are identified as SR, included for sure in the sample and constitutes an independent sub-stratum; 4. the remaining PSUs, NSR-PSUs, are ordered in decreasing way by their measure of the size and aggregated into finer strata (sub-strata); 5. sub-strata are created adding PSUs (still in descending order of measure of size) for which the sum of the measure of size of the sub-strata is approximately equal to the threshold multiplied by the number of PSUs to be selected in each stratum; 6. in each sub-stratum a fixed number of PSUs per stratum are usually selected with Sampford's method (unequal probabilities, without replacement, fixed sample size); 3 Structure of the package The R2BEAT package provides functions for drawing complex sample designs using an optimal allocation also performing the selection of the PSUs and SSUs.To install the latest release version of R2BEAT from CRAN, type install.packages("R2BEAT")within R. The current development version can be downloaded and installed from GitHub by executing devtools::install github("barcaroli/R2BEAT").
This section provides an introduction to the structure and functions associated with the package while the next section will present examples of its specific use.
The workflow to draw and select a complex sample using R2BEAT is: (1) prepare the input data, (2) check the input data, (3) define the design and obtain the allocation, and (4) select the final sample units.

Prepare the input data
As it will be illustrated in detail in the next sub-sections the R2BEAT package provides functions to define one-stage stratified sample design (beat.1st)and two-stage stratified sample design (beat.2st).The preparation of the input dataset changes whether the former or the latter sample design will be adopted.
In the case of a multivariate optimal allocation for different domains in a stratified one-stage sample design, the function beat.1st can be used.The inputs required by this function are two, a data frame containing survey strata information (stratif ) and a data frame of expected CV for each domain and each variable (errors).No functions to prepare these inputs are provided by the package but is possible to follow the example dataset stratif and error to properly create the input datasets for the function beat.1st.
In the case of a two-stage design, two functions are provided by the package to help in the creation of the input data for the function beat.2st.The functions are two because two different scenarios are possible, depending on the initial information available: 1.Only the sampling frame is available, no previous rounds of the survey have been carried out.In this scenario, a strict condition on the information content of the sampling frame must hold: values of the sample target surveys (or of their proxy correlated variables) are available for each unit in the frame.This can be accomplished by considering the previous census, or by using administrative registers.In this scenario, the function prepateInputToAllocation1 can be used to create the input dataframes stratif, rho, deft, effst, des file and psu file.
2. Together with a sampling frame containing the units of the population of reference, also a previous round of the sampling survey to be planned is available.The prepateInputToAlloca-tion2 produces the same outputs of prepateInputToAllocation1, but it requires the design and/or calibrated objects of the previous sample survey, obtained using the ReGenesees package (Zardetto, 2015).
The function sensitivity min SSU allows analyzing the different results in terms of first stage size (number of PSUs) and second stage size (number of SSUs), obtained when varying the values of the minimum number of SSUs to be selected in each PSU.
To check the coherence between the estimated population in the strata (stratif ) and the population calculated by the PSUs dataset (des file), the function check input is provided to the users.This function compares the strata sizes giving information about the differences and replacing the estimated stratum size with the stratum population calculated by the PSUs dataset.

Defining the design and determining the allocation
As already introduced, the package allows performing the optimal allocation for both one-stage and two-stage stratified sampling The first one is implemented within the function beat.1st and computes a multivariate optimal allocation for different domains in one-stage stratified sample design.As described in section 3.1, in a one-stage stratified sample design there are only two inputs to be provided to beat.1st: the dataframes stratif and errors.Besides these two mandatory inputs, it is also possible to indicate the minimum number of sampling units to be selected in each stratum, by default set equal to 2.
The function beat.2stperforms the same multivariate optimal allocation for different domains considering stratified two-stage design.Together with the input data stratif and errors other mandatory input are: • des file: dataframe containing a row per each stratum, with information on total population, the values of the delta parameter (equal to the mean number of final SSUs contained in clusters to be selected, for instance, the mean number of individuals in a household), and the minimum number of SSUs to be selected in each PSU; • psu file: dataframe containing information on each PSUs (identifier, stratum, measure of size).
• rho: dataframe contains a row per each stratum with the intra-class correlation coefficient both for self representing and non-self representing PSUs.
Is also possible to provide optional information about: • deft start: dataframe containing a row per each stratum with the starting values for the square root of the design effect in the stratum of each variable of interest.
• effst: dataframe containing a row per each stratum with the estimator effect for each variable of interest.
The functions beat.1st and beat.2stproduce lists with respectively 4 and 9 items.The beat.1st output contains: 1. n: a vector with the optimal sample size for each stratum; 2. file strata: a dataframe corresponding to the input dataframe stratif with the n optimal sample size column added; 3. alloc: a dataframe with optimal (ALLOC), proportional (PROP), equal (EQUAL) sample size allocation; 4. sensitivity: a data frame with a summary of planned coefficients of variation (Planned CV), the expected ones under the given optimal allocation (Actual CV), and the sensitivity at 10% for each domain and each variable.Sensitivity can be a useful tool to help in finding the best allocation, as it provides a hint of the expected sample size variation for a 10% change in planned CVs.
Together with the previous outputs, the function beat.2stproduces also: 1. iterations: a dataframe that for each iteration of the Bethel algorithm provides a summary with the number of PSUs (PSU Total), distinguished between SR (PSU SR) and NSR (PSU NSR), plus the number of SSUs; 2. planned: a dataframe with the planned coefficients of variation for each variable in each domain.
3. expected: a dataframe with a summary of expected coefficients of variation under the given optimal allocation for each target variable in each domain; 4. deft c: a dataframe with the design effect for each variable in each domain in each iteration.
Note that DEFT1 0 -DEFTn 0 is always equal to 1 if deft start is NULL; otherwise it is equal to deft start.While DEFT1 -DEFTn are the final design effect related to the given allocation.
5. param alloc: a vector with a resume of all the parameters given for the allocation.

Sample units selection
Once the allocation for the primary and secondary sampling stage units has been defined, it is possible to use two functions for the selection of the final sampling units.
The function select PSU allows the users to select the PSUs allocated in each stratum, using the Sampford method, as implemented by the UPsampford function of R package sampling (Tillé and Matei, 2021).
The input of this function is the output of the beat.2stfunction.
The output of the function is a list containing the following items: 1. universe PSU: a dataframe that reports the whole universe of PSUs, with the inner strata formed for the selection; 2. sample PSU: a dataframe containing the selected PSUs, with the indication, for each of them, of how many SSUs must be selected; 3. PSU stats: a table containing summary information on selected PSUs.
In the last step, the selection of a sample of SSUs has to be carried out.The function select SSU allows selecting a sample of SSUs from the population frame, based on the SSUs allocated to each selected PSUs.
The input datasets are two: 1. df : the dataframe containing the final sampling units; 2. PSU sampled: the dataframe containing selected PSUs, corresponding to the second item of the output of the select PSUfunction.
The function select SSU returns a dataframe containing the selection of the df dataframe, enriched with information about the first stage inclusion probability, the second stage inclusion probability, the final inclusion probability (the product of the first stage and the second stage inclusion probabilities) and the design weights.

Illustrative examples
To illustrate how to implement workflows making use of R2BEAT functions, we will consider two scenarios, depending on the initial setting: 1. only the sampling frame is available, no previous rounds of the survey have been carried out; 2. together with a sampling frame containing the units of the population of reference, also a previous round of the sampling survey to be planned is available; In both cases, we assume that the sampling frame contains information on the final sampling units, together with the indication of the PSUs to which each unit belongs.In the first scenario, a stricter condition on the information content of the sampling frame must hold: values of the sample target surveys (or of their proxy correlated variables) must be available for each unit in the frame.This can be accomplished by considering a previous census, or by imputing values using predictive models.In the following paragraphs, we will show only a subset of the code necessary to produce the final results, the relevant part of it1 .

Scenario 1 workflow
In this scenario, it is assumed that a sampling frame is available.We consider a frame (pop.RData), containing 2,258,507 units: • region: the NUTS2 identifier; • province: the NUTS3 identifier; • municipality: identifier of the municipality, that plays the role of the PSU identifier; • id hh: the household identifier; • id ind: the individual identifier; • stratum and stratum label: identifier of the initial strata (provinces); • sex and cl age: demographic information on individuals.together with information that is related to the sampling survey we want to design: • active, inactive, unemployed: binary variables indicating the occupation status of the individual; • income hh: household income.
We suppose that the values of these variables have been made available by a different source (for instance a census) or by predicting them with a model-based approach.In any case the uncertainty related to these values should be taken into account, by correctly evaluating the anticipated variance related to the models used for the predictions when producing the strata dataset (Baillargeon and Rivest, 2011, p. 59).
Anyway, in the following, we will not consider this issue, as we want only to illustrate how it is possible to automatically derive all the inputs required by the next steps.

Step 1: preparation of the inputs for the optimal sample design
The function prepareInputToAllocation1 allows preparing all the inputs required by the optimal allocation step under this first scenario.This function requires the attribution of values to the following parameters: • samp frame • delta (average dimension of the SSU in terms of elementary survey units) • minimum (minimum number of SSUs to be interviewed in each selected PSU) About the values of these parameters, the choices are almost always driven by the content and structure of the sampling frame, except for minimum.In order to orientate in the choice of one of a suitable value for this parameter, the function sensitivity min SSU allows performing a sensitivity analysis, showing how the first and second stage sample sizes vary by varying its values: > sens_min_SSU <-sensitivity_min_SSU ( + samp_frame=pop, + id_PSU="municipality", + id_SSU="id_ind", + strata_var="stratum", + target_vars=c("income_hh","active","inactive","unemployed"), + deff_var="stratum", + domain_var="region", + minimum=50, + delta=1, + deff_sugg=1.5,+ min=30, + max=80) This function calculates 10 different couples of values for the number of PSUs and SSU as resulting from the allocation step, starting with the value '30' assigned to the parameter minimum, ending with the value '80'.The results are reported in Figure 1.
Here we report the content of the rho dataframe: that has been calculated using the equation ( 11).

Step 2: optimization of PSUs and SSUs allocation
It is now possible to execute the optimization step of the sample design.
First of all, we define the set of precision constraints on the target variables: DOM CV1 CV2 CV3 CV4 1 DOM1 0.02 0.03 0.03 0.05 2 DOM2 0.03 0.06 0.06 0.08 We interpret the values of the CVs in this way: the maximum expected coefficient of variation for the first target variable (household income) is 2% at the national level and 3% at the regional level; for active and inactive the expected maximum values of CV is 3% at the national level and 6% at the regional level; finally, for unemployed it is 5% at the national level and 8% at the regional level.
The optimization step is performed by executing the beat.2sfunction: This design is characterized by 142 PSUs (of which 38 self-representative, SR, and 104 non selfrepresentative, NSR) and 8,320 SSUs.

Step 3: selection of PSUs and SSUs
We can now proceed in selecting the PSUs: A discrepancy can be noted between the number of SSUs determined by the allocation step and the one produced by the selection of PSUs.This is because the selection of PSUs controls that the minimum number of SSUs to be allocated in each selected PSU is compliant with the minimum, in our case equal to 50: if not, this minimum is assigned.This is why the total number of SSUs increases from 8,320 to 9,421.
Selected PSUs are contained in the sample PSU element of the output list: With this input, we can proceed to select the sample of final units:  In Figure 3 the distribution of weights is reported.
It can be seen that the distribution of weights is variable at the national, regional and provincial level, and only inside each stratum, the variability is low, as desired, except for those strata in which for some PSUs the minimum number of SSUs (50) had to be attributed instead of the optimal allocation.

Step 4: verify compliance with precision constraints
The function eval 2stage] allows verifying the compliance of the two-stage sample design to the set of precision constraints, by selecting a given number of different samples (in our case, 500) from the sampling frame, producing the estimates for each sample, and calculating over them the coefficients of variation for each target estimate.
We apply twice the function, first for the national level: > # Domain level = national > domain_var <-"one" > set.seed(1234)We recall that the precision constraints had been set equal to 2% for the first variable, 3% for the second and third, and 5% for the fourth, at national level; and respectively to 3% and 6% and 8% at regional level.We can see that the computed CVs are all compliant.

Scenario 2 workflow
Together with the availability of a sampling frame, containing the same information presented in the previous scenario, we assume also the availability of at least one previous round of the survey.For sake of simplicity, we assume that the previous round sample is the same selected in scenario 1.We assume also that the values of the four target variables are the observed ones after the data collection.Having set the above conditions, the main difference with scenario 1 is that, instead of choosing in a somewhat arbitrarily way the values of the inputs required by the optimal allocation step, we can derive them directly from the collected survey data.

Step 1: processing and analysis of survey data
In this step, we proceed to perform the usual phases of calibration and production of the estimates.In doing that, we make use of the R package ReGenesees.
First we describe the sample design: The configuration of the output is just the same already seen in scenario 1 for the function pre-pareInputToAllocation1.
Here we report the content of the effst dataframe:

Step 3: optimization of PSUs and SSUs allocation
The optimal allocation of PSUs and SSUs is the same as the one already seen in the first scenario:   We can observe now the distribution of weights in the selected sample (see Figure 5).Also in this case, their variability is lower inside the strata level, where it is almost null except in those strata in which for some PSUs the minimum number of SSUs (50) had to be attributed, instead of the optimal allocation.

Step 5: verify the compliance to precision constraints
As in the previous scenario, the final check consists in verifying the compliance of the optimized design to the precision constraints.
We, therefore, apply the function eval 2stage, first for the national level: > # Domain level = national > domain_var <-"one"  In running the function, we have indicated that the optimization step was to be carried out having a target CV of 2% for the variable income hh.As there is no way to directly indicate a desired minimum number of SSUs per PSU, we managed to obtain the desired value of 50 by indicating a couple of values 130 and 1 respectively for C1 and C2.As a result, the number of PSUs is 141 and the number of SSUs is 7,127.

R package samplesize4surveys
This package offers two functions to compute a grid of possible sample sizes for estimating single means (ss2s4m) or single proportions (ss2s4p) under two-stage sampling designs.
The required parameters are the following: • N: the population size • mu: the value of the estimated mean of a variable of interest • sigma: the value of the estimated standard deviation of a variable of interest • conf: the statistical confidence • delta: the maximum relative margin of error that can be allowed for the estimation • M: number of clusters in the population • to: (integer) maximum number of final units to be selected per cluster • rho: the intraclass correlation coefficient Here is the code we used in the case of the target variable income hh: > load("pop.RData") > PSU <-length(unique(pop$municipality)) > pop_strata <-as.numeric(table(pop$stratum))> rho <-0.04875369 # value taken from scenario 1 analysis > ss2s4m(N = nrow(pop), + mu = mean(pop$income_hh), + sigma = sd(pop$income_hh), + delta = 0.02 * 1.96, + M = PSU, + to = 50, + rho = sum(rho$RHO_NAR1*pop_strata) / sum(pop_strata)) 50 3.388931 142 50 7061 we obtain a design characterized by a total sample size of 7,061, with 142 PSUs.Concerning the way we indicated the value of the parameter rho, we made use of the value of the intra-class correlation coefficient computed in scenario 1 by R2BEAT, not considering domains and strata.
In order to compare the 2% precision constrain expressed in terms of coefficient of variation, as the package requires the margin of error, we multiply the value of the CV by a z-value equal to 1.96, to obtain the ratio between the semi-width of the confidence interval and the estimate of the mean of the parameter.
The use of the function ss2s4p, applicable for the other three variables, is practically the same.

Comparison of results
As already said, we refer to the scenario 1 setting.We consider the same precision levels for the four variables for the unique domain, set equals respectively to 2%, 3%, 3% and 5%.
We apply another constraint for all the three softwares, that is, we want to select a minimum number of final units in each PSU, set equal to 50.
There is no problem in doing that for package samplesize4surveys, by setting the parameter to equal to 50: the last value of the final grid is the result we want.Moreover, there is no loss in the optimality of the solution in doing that, because the sample sizes obtained for further values are increasingly higher.
As for PractTools, it is more complicated because, as already said, there is no direct way to set this constraint.In any case, we manage to do that, by varying the value of C1 (leaving C2 equal to 1) until we find the solution with the nearest value of n.opt to 50.
A final consideration regarding the application of R2BEAT: in this setting, to be comparable with the other packages (that are univariate and mono-domain), it has been applied in a simplified way, that is, one variable per time (univariate), and no different domains and strata in the sampling frame.By so doing, R2BEAT yields obviously different results from those seen in scenario 1.In Table 1 and in Figure 6 are reported the results obtained by the three packages.To be sure of the results of R2BEAT, simulations have been carried out, and the resulting CVs are always below the precision threshold.
Analyzing the table, it is evident that R2BEAT is always the best performer, in terms of sample size, considering both numbers of PSUs and SSUs.

Concluding remarks
Concluding, we would like to focus on the main strength of the R2BEAT, which could be considered its completeness, regarding all the phases of the statistical data production process.The package deals with the design, stratification, allocation among strata and, finally, the selection of sample units.
Furthermore, the facilities provided by R2BEAT are extremely flexible and generalizable: for instance, R2BEAT is the first R package on repositories to provide the optimal allocation, both for one-stage and two-stage sampling designs.These features make the package particularly helpful and valuable both for NSIs and private statistical institutions, such as marketing researchers, universities or national government organizations.
Moreover, those who deal with statistics often have a large data availability, coming from registers, previous surveys or other data sources: the package, requiring auxiliary variables for designing and allocating the sample, makes this auxiliary information useful and profitable during the sampling planning process.
The output provided to users has been thought to be as clear as possible and to help them to carry out analysis and checks on the obtained allocations and on the sample on which the survey will be based.
Last, but not least, R2BEAT can be considered more efficient than the other available software and packages which deal with sample design: in fact, on equal errors, the sample size allocated is lower, both in terms of Primary and Secondary Stage Units (PSUs and SSUs. 1 and c 0 = 0, the global cost amounts to the sample size (C = n).

Algorithm 1 :
R2BEAT optimal allocation of PSUs and SSUs in sampling strata Input : a. precision constraints in terms of CV; b. information on sampling strata (mean and stdev of target variables, N, ...); c. information on previous design: deff, effst, rho ; d. information on PSUs in sampling strata (measure of size); e. minimum number of SSUs per PSU; Output: a. for each stratum: number of PSUs and SSUs to be selected; b. expected CVs for target estimates; c. item sensitivity of the solution; REM First iteration;

Figure 1 :
Figure 1: Sensitivity analysis for minimum parameter.
obtaining the sample.calobject.These two objects are what is needed to obtain, in an automated way, all the inputs required by the optimization step.4.2.2Step 2: preparation of the inputs for the optimal sample design The preparation of all the inputs required by the optimization step is a straightforward operation by using the prepareInputToAllocation2 function: The selection of first and second stage units proceeds in exactly the same way than in the scenario 1, first selecting the PSUs, and then the SSUs.

Table 1 :
Two-stage sample design obtained by different packages.