A Framework for Producing Small Area Estimates Based on Area-Level Models in R

The R package emdi facilitates the estimation of regionally disaggregated indicators using small area estimation methods and provides tools for model building, diagnostics, presenting, and exporting the results. The package version 1.1.7 includes unit-level small area models that rely on access to micro data. The area-level model by Fay and Herriot (1979) and various extensions have been added to the package since the release of version 2.0.0. These extensions include (a) area-level models with back-transformations, (b) spatial and robust extensions, (c) adjusted variance estimation methods, and (d) area-level models that account for measurement errors. Corresponding mean squared error estimators are implemented for assessing the uncertainty. User-friendly tools like a stepwise variable selection, model diagnostics, benchmarking options, high quality maps and results exportation options enable a complete analysis procedure. The functionality of the package is illustrated by examples based on synthetic data for Austrian districts.


Introduction
Small area estimation (SAE) enables better insight at smaller scales, for which it has gained importance in both academic and applied research.Among others, SAE is used for estimating socio-economic measures like income, poverty and health or indicators for agriculture (Datta et al., 1991;Tzavidis et al., 2012;Zhang et al., 2015;Pratesi, 2016).Economic or political decision-makers and official statistics practitioners especially benefit from reliable estimation of disaggregated indicators and thus SAE methods.Existing surveys were often not planned for analysis at disaggregated levels and show only small sample sizes, which leads to a low precision of the estimates.SAE methods can be employed to avoid expensive and time-consuming enlargements of the sample size of surveys.The idea is to combine data sources with model-based approaches.Existing survey data will be enriched by auxiliary information, e.g., from census or register data, to improve the accuracy of the indicator estimation on an area-or domain-level.The terms area and domain are used interchangeably and refer either to a geographic area or to any subpopulation of a population of interest, like socio-demographic groups.Among others, Pfeffermann (2013), Rao and Molina (2015), Tzavidis et al. (2018) and Jiang and Rao (2020) give comprehensive overviews of SAE methods.
The main goal of the package emdi is the simplification of estimating these regionally disaggregated indicators.The package version 1.1.7contains direct estimation based exclusively on survey data and model-based estimation using the unit-level empirical best predictor (EBP) method (Molina and Rao, 2010).The EBP approach is powerful since it enables the simultaneous estimation of various indicators.For this, it relies on unit-level information, i.e. information about each unit in each domain.Though survey data often provides unit-level information, access to census or register data at unit-level is less likely.Hence, area-level models provide a valuable alternative, with the following benefits: First, only area-level aggregates are needed to estimate the regional indicators.Second, area-level models can consider the survey design by integrating the sampling weights.Third, the computation is faster compared to the computational intensive EBP approach.
Various R packages that employ different area-level models are available on the Comprehensive R Archive Network (CRAN).The package smallarea (Nandy, 2015) offers several variance estimation methods for the standard Fay-Herriot (FH) model: maximum likelihood (ML), residual maximum likelihood (REML), and both Prasad-Rao and Fay-Herriot method-of-moment.Estimation of unknown sampling variances is also offered."The ability to estimate unit-and area-level models under heteroscedasticity is implemented by the JoSAE package (Breidenbach, 2018).Robust estimation of area-level models with spatial and/or temporal structures in the random effects is supported by package saeRobust (Warnholz, 2022).The mcmcsae package (Boonstra, 2021) also takes spatial and temporal correlation of the random effects into account, but fits unit-and area-level models via Markov Chain Monte Carlo simulation.Estimation of univariate and multivariate FH models is possible with package msae (Permatasari and Ubaidillah, 2022).The package hbsae (Boonstra, 2022) allows for the fitting of unit-and area-level models by frequentist or hierarchical Bayesian approaches.The possibility of estimating FH models and some of its extensions in a Bayesian framework is also given by the BayesSAE package (Developer, 2018).The tipsae package (De Nicolò and Gardini, 2022) provides estimation and mapping tools within a Bayesian setting for proportions that are defined The R Journal Vol.15/1, March 2023 ISSN 2073-4859 on the unit interval.The mme package (Lopez-Vizcaino et al., 2019) implements Gaussian area-level multinomial mixed-effects models in the SAE context.The saeME package (Mubarak and Ubaidillah, 2022) can fit an area-level model when the auxiliary variables are measured with error.The NSAE package (Chandra et al., 2022) can fit stationary and nonstationary FH models.One of the commonly used packages is the sae package (Molina and Marhuenda, 2015).It includes a wide range of area-level models (the standard FH model with REML, ML and FH method-of-moment model fitting and a spatial and a spatio-temporal extension of the FH model) and unit-level models (the nested error linear regression model of Battese et al. (1988) and the EBP approach).Table 1 gives an overview of selected packages and the implemented methodology.
Besides packages that include particular area-level models, the packages saeMSPE (Xiao et al., 2022) and SAEval (Fasulo, 2022) offer different analytical-and resampling-based MSE estimators and tools for diagnostics and graphical evaluation of SAE models, respectively.
The latest version of package emdi 2.1.3combines a wide range of SAE models with several tools that enable a complete analysis, and therefore adds to the space of useful packages, for the following reasons: • None of the existing packages contains such a variety of different area-level models.
• In addition to models that are already available in existing R packages, emdi includes: adjusted variance estimation methods and transformation options for the standard FH model.Adjusted variance estimation methods are of particular importance when working in a non-Bayesian framework.In a Bayesian context, the variance will always be estimated as strictly positive, so packages providing a Bayesian approach do not need adjusted variance estimation methods.
• Package emdi offers user-friendly tools that go beyond model estimation: diagnostic tools with both summary and graphical results, benchmarking options, high-quality geographical visualization of results, and export of results to Excel and OpenDocument Spreadsheet formats.
• Plus a stepwise variable selection algorithm for area-level models is included in emdi to allow the user to build a model based on information criteria.
Thus, since package version 2.0.0, version 1.1.7 has been extended by various area-level models, but stays in line with the user-friendly orientation of the existing version.
The structure of the paper can be described as follows.The section Statistical methodology introduces the statistical methods implemented in the package.The example data sets included in the package are presented in the section Data sets.The section Functionality and case studies provides an illustrative description of the functions using the example data sets.The section Estimation procedure for the standard Fay-Herriot model guides the reader from model-building to diagnostics of a standard FH model and creating maps of the results.The section Estimation of the extended area-level models follows with short descriptions of how to build the different extended area-level models.Finally, the section Conclusion and outlook summarizes our contributions and gives an outlook.

Statistical methodology
Area-level models for the estimation of indicators like means, totals or shares have been added to the package since the release of version 2.0.0.These comprise the area-level model by Fay and Herriot (1979) and several extensions of this standard model which account for issues that may come up in real data applications.To measure the precision of those models, MSE estimators have been integrated following the literature.

Standard Fay-Herriot model
Throughout the paper, a finite population U is assumed that consists of N units that are subdivided into D domains or areas of specific sizes N 1 , ..., N D .Then a random sample of size n can be drawn from U and partitioned into D areas with n 1 , ..., n D observations per domain.
The FH model links area-level direct estimators based on survey data to covariates aggregated on an area level that may stem from administrative data (e.g.register or census) or alternative data sources (e.g.satellite, social media or mobile phone data).The FH model is composed of two levels.The first level is the sampling model θDir i is an unbiased direct estimator for a population indicator of interest θ i , for instance, a mean or a ratio.e i represents independent and normally distributed sampling errors with e i ind ∼ N 0, σ 2 e i .Though the model assumes known sampling variances, in practical applications σ 2 e i are usually unknown, and have to be estimated from the unit-level sample data (Rivest and Vandal, 2003;Wang and Fuller, 2003;You and Chapman, 2006).Package emdi provides a non-parametric bootstrap for estimating the variances of the direct estimator (Alfons and Templ, 2013).To allow for complex survey designs, sampling weights (w) can be considered in the direct estimation (Horvitz and Thompson, 1952).For example, an estimator for the population mean θ i of a continuous variable of interest y for each area i is estimated by , where the index j indicates an individual with j = 1, ..., n i in the i-th area.The second FH level links the target indicator θ i linearly to area-specific covariates x i , where β is a vector of unknown fixed-effect parameters, and u i is an independent and identically normally distributed random effect with u i iid ∼ N 0, σ 2 u .The combination of the sampling and the linking model leads to a special linear mixed model The empirical best linear unbiased estimators β are computed by weighted least square theory.
The empirical best linear unbiased predictor (EBLUP) of θ i is obtained by substituting the variance parameter σ 2 u with an estimate.The resulting estimator can then be written as The EBLUP/FH estimator can be understood as a weighted average of the direct estimator θDir puts more weight on the direct estimator when the sampling variance is small and vice versa.Areas for which no direct estimation results exist, because the sample size is zero or the results may not be published, are called out-of-sample domains.For those domains, the prediction reduces to the regression-synthetic component θFH i,out = x ⊤ i β (Rao and Molina, 2015).

Estimation methods for σ 2 u
The variance of the random effects has to be estimated.Commonly used approaches are the FH method-of-moment estimator (Fay and Herriot, 1979), the ML, and the REML estimators (Rao and Molina, 2015).The likelihood methods are known to perform more efficiently than the methods of moments (Rao and Molina, 2015).The commonly used methods can produce negative variance The R Journal Vol.15/1, March 2023 ISSN 2073-4859 estimates that should be strictly positive.In the estimation methods mentioned above, negative variance estimates are set to zero ( σ2 u = max σ2 u , 0 ) resulting in zero estimates of the shrinkage factor γ i .Therefore, no weight is put on the direct estimator, ignoring its possible reliability.This poses a problem, especially when the number of areas is small.To avoid this so-called over-shrinkage problem, Li and Lahiri (2010) and Yoshimori and Lahiri (2014) proposed methods that adjust the respective likelihoods of the standard ML and REML approaches by some factor: where A denotes the adjustment factor and L(σ 2 u ) the given likelihood function.The proposed adjustment factors are: • by Li and Lahiri (2010) • by Yoshimori and Lahiri (2014): Simulation studies conducted by Yoshimori and Lahiri (2014) showed that the adjusted Yoshimori-Lahiri methods are preferable when the variance of the random effect is small relative to the sampling variance.Otherwise, the adjusted Li-Lahiri methods are recommended.Package emdi offers six different variance estimation methods: standard ML (ml) and REML (reml), and adjusted ML and REML following either Li and Lahiri (2010) (amrl, ampl) or Yoshimori and Lahiri (2014) (amrl_yl, ampl_yl).

Extended area-level models
In real data applications, problems might occur that were not theoretically expected.There may also be some violation of the assumptions of the standard FH model, e.g., normality and independence of the error terms.The following section outlines the extensions of the standard FH model that are implemented in the package emdi to address these issues.
Transformations When working with right-skewed data like income, wealth or business data, the assumptions of a linear relation between the response and the explanatory variables and normality of both error terms (u i and e i ) of the FH model may be violated.Applying a log-transformation could be a reasonable solution to meet these model assumptions (Neves et al., 2013;Kreutzmann et al., 2022).In the emdi package, the direct estimates and their variances are transformed following Neves et al. (2013) ) serves as estimate for the sampling variances (σ 2 e i ) in Equation 2. Since the logarithm is a nonlinear transformation, the final FH estimates on the original scale require a bias-corrected back-transformation (Slud and Maiti, 2006;Sugawasa and Kubokawa, 2017).The emdi package provides two options: 1.A crude method (bc_crude) that takes the properties of the log-normal distribution into account: ) stands for a MSE estimator on the transformed scale, e.g., the Prasad-Rao or Datta-Lahiri MSE (cf.following subsection).The Slud-Maiti back-transformation is derived for the ML variance estimation of the random effect and is implemented for in-sample domains in emdi.In the presence of out-of-sample domains, the crude method can be applied, which allows to use also other variance estimation methods.
The R Journal Vol.15/1, March 2023 ISSN 2073-4859 Another transformation provided by the emdi package is the arcsin transformation, which is widely used when the direct estimator of the FH model is a ratio (Casas-Cordero et al., 2016;Schmid et al., 2017).The emdi package automatically transforms the direct estimates and the sampling variances as suggested by Jiang et al. (2001): where the *arcsin denotes the arcsin transformed scale, and ñi the effective sample size: the sample size adjusted by the sampling design (Jiang et al., 2001).The FH model is estimated using Equation 2and, if necessary, the results are truncated to the interval [0, π/2] to ensure final results between 0 and 1.To obtain final estimates on the original scale, the final estimation results must be subjected to a back-transformation.Two different back-transformations are available in emdi: 2. A back-transformation with bias-correction (bc) following Sugawasa and Kubokawa (2017) and Hadam et al. (2020):

Spatial FH model
The standard FH model assumes independence of the random effects.However, when working with geographical areas, assuming correlated random effects to incorporate a certain neighbouring structure can be valuable.The emdi package contains the spatial FH model introduced by Petrucci and Salvati (2006) that considers a simultaneously autoregressive process of order one, SAR(1).Compared to the standard model, the estimation differs mainly by discarding the assumptions of independent random effects and estimating a spatial autoregressive coefficient (ρ) which takes values between −1 and 1. Greater absolute values of (ρ) indicate a stronger relationship with the neighboring areas.The random effect u i in Equation 1 is replaced by with W being the D × D row standardized proximity matrix that describes the neighbourhood structure of the areas, 0 D a vector of zeros and I D the D × D identity matrix.The random effects u of Equation 3 follow a SAR(1).When normality of the random effects is assumed, the model can be fitted by ML and REML.The application of spatial FH models should be considered when no geographic auxiliary variables are available to capture the spatial relation, or when ρ 1 is larger than 0.5 (Bertarelli et al., 2021).Even before estimating the model, the emdi package enables testing for spatial correlation by Moran's I and Geary's C statistics (Cliff and Ord, 1981;Pratesi and Salvati, 2008).While Moran's I mimics a typical correlation coefficient whose values range from −1 and 1, Geary's C takes values between 0 and 2 (0: positive, 1: no, 2: negative spatial autocorrelation).The two statistics behave inversely to each other.

Robust area-level models
In the case of influential outlying observations, the emdi package allows for robust versions of the standard and the spatial FH model.The theory is extensively studied in Warnholz (2016), wherein the robust estimation procedure for linear mixed models suggested by Sinha and Rao (2009) was extended to area-level models.The model fitting can be understood as a robustified ML version that also contains an influence function with a tuning constant k. 1.345 is recommended as an initial value for the tuning constant (Sinha and Rao, 2009).When non-symmetric outliers are expected to influence the robust estimation, a bias correction should be involved.This correction can be controlled by a multiplier constant (mult_constant).For further details, we also refer to Chambers et al. (2014) and Schmid et al. (2016).

Measurement error model
The standard FH model is based on the assumption that the covariates are measured without error (Fay and Herriot, 1979).This characteristic is typically assumed because census or register data are used as auxiliary information.However, when the covariate information stems from larger The R Journal Vol.surveys or alternative data sources, this assumption can be violated.The emdi package includes an implementation of the measurement error (ME) model developed by Ybarra and Lohr (2008).To account for the ME in the covariates x i , they modified the shrinkage factor as follows: , where the C i stands for the variance-covariance matrix of the covariates, which is a required prerequisite for the model.The modified shrinkage factor pulls more weight on the direct estimator when the variances of the covariates are large.A modified weighted least squares method and a moment estimator were used to estimate βs and the σ 2 u , respectively.Additional details are available in Ybarra and Lohr (2008).

Mean squared error estimation
To evaluate the accuracy of the EBLUP estimates, the MSE is the most common measure used in SAE (Rao and Molina, 2015).The emdi package offers a variety of MSE estimators stemming from both analytical determination and resampling strategies, like the bootstrap and jackknife methods.Table 2 gives an overview of the included MSE approaches.For each area-level model presented in the previous sections, the provided MSE types are shown.The quoted references detail extensive formulas and derivations.As an additional measure of variability of the direct and FH estimates, within various functions and methods of the emdi package, the coefficient of variation (CV) is provided: CV = MSE( θi )/ θi , where θi either stands for θDir i or θFH i .

Data sets
The emdi package version 1.   et al., 2010) serves as basis for our data sets.The lowest regional level in the eusilcP data set consists of the nine Austrian states.Based on certain population size and income criteria, households were allocated to 94 Austrian districts resulting in the synthetic population data set 'eusilcA_pop'.For the 'eusilcA_smp' data set, a sample was drawn following a stratified random sampling process using the districts as strata.To show the usage of the FH model and its extensions, area-level data is required.
The area-level survey and population data sets, 'eusilcA_smpAgg' and 'eusilcA_popAgg', are obtained by aggregation on the district level with the help of the direct function defined by the emdi package.The direct estimates in 'eusilcA_smpAgg' are the weighted mean equivalized household income Mean, the ratio of households that earn more than the national median income (MTMED) and their variances.These are based on the equivalized household income eqIncome in 'eusilcA_smp', defined as the total income of a household divided by the size of the household, with household size equalized by the modified equivalence scale of the Organisation for Economic Co-operation and Development (OECD) (Hagenaars et al., 1994).Additionally, the mean of the variable cash, its variance and the sample sizes are included in 'eusilcA_smpAgg', being required by the model extensions.The population data set 'eusilcA_popAgg' contains a variety of variables that describe different income sources of households, and a variable ratio_n that describes the ratios of the population sizes per area and the total population size.The variable Domain exists in both data sets and identifies the different districts.Both data sets have an observation for each of the 94 Austrian districts, with the sample data set 'eusilcA_smpAgg' containing eight variables and the population data set 'eusilcA_popAgg' containing fifteen.Table 3 The R Journal Vol.15/1, March 2023 ISSN 2073-4859 provides an overview of all included variables of the sample and population data sets.For the creation of the proximity matrix used in the spatial FH model and the usage of the map_plot function, a shape file is needed.A shape file 'shape_austria_dis' (.rda format, "SpatialPolygonsDataFrame") for the 94 districts of Austria is provided.This file was sourced from the SynerGIS website (Bundesamt für Eichund Vermessungswesen, 2017).The data set 'eusilcA_prox', an example proximity matrix, has also been added to the emdi package.The creation of 'eusilcA_prox' is described in the following section.

Functionality and case studies
While the theoretical background of the implemented area-level models has been introduced in the section Statistical methodology, the focus of this section lies on the functionality and the workflow of their usage in R. All of the contained area-level models can be applied by one function: fh. on the area-level model, different arguments have to be determined (see Table 6 in Appendix A).
Figure 1 demonstrates the estimation possibilities of a standard FH model (for the extended area-level models see Figure 6 in Appendix A).In line with the direct and ebp functions of package version 1.1.7,the S3 object system is used for function fh (Chambers and Hastie, 1992).All three return objects of class "emdi".The application of function direct leads to a "direct" object, and of functions ebp and fh to objects of classes "ebp" and "fh", respectively.Though all of the returned objects contain ten components, not every component is available for each estimation method.In these cases they are indicated as NULL (see Table 5).Furthermore, the model component differs for the two classes "ebp" and "fh".The components of objects of class "fh" are provided in  the components are available for every area-level model, e.g., the shrinkage factors per domain are not provided for the spatial and robust model extensions, as they do not have an intuitive interpretation in those cases.Due to the consistent structure, all functions and methods of emdi version 1.1.7can be applied to objects of class "fh".Additionally, new functions and methods are available for the area-level models.Furthermore, a variety of methods that are available in base R and used by other model fitting R packages are included in the latest package version 2.1.3for the different "emdi" objects.Two examples of the new generic functions used are coef and logLik.Figure 2 demonstrates the steps of a full data analysis procedure and the respective functions, from model building and diagnostics to presenting the results.The section Estimation procedure for the standard Fay-Herriot model demonstrates the procedure shown in Figure 2 by applying the standard FH model to the Austrian EU-SILC data described in the section Data sets.To demonstrate how the different extended area-level models are fitted with function fh, the section Estimation of the extended area-level models follows.

√ √
Table 5: The ten "emdi" object components distinguished in "direct", "ebp" and "fh".More detailed information is provided by the package documentation.

Estimation procedure for the standard Fay-Herriot model
The aim of this example is to estimate the equivalized income for the 94 Austrian districts.The package and the example data sets are loaded as follows: > library("emdi") > data("eusilcA_popAgg") > data("eusilcA_smpAgg") The

Combine input data
The function fh requires one data set (argument combined_data) that comprises the sample and population data.Thus, the data set must contain all variables of the formula object fixed, the variances of the direct estimates and, optionally, a domain identifier.For cases where the sample and population data are only available separately, a merging function combine_data is provided.The necessary arguments are the two data sets and characters specifying the domain indicator for the respective data sets.

Identify spatial structures
With the help of a proximity matrix, Moran's I and Geary's C test statistics can be computed to identify spatial structures by the spatialcor.testscommand.For the creation of the proximity matrix, the shapefile must be loaded.We load the Austrian shapefile that is provided and merge it to the sample data set by using the respective domain identifiers with the help of the merge method from the sp package (Pebesma and Bivand, 2005).Before merging, we sort the Austrian shapefile by the domains in the sample data.
> library("spdep") > rel <-poly2nb(austria_shape, + row.names= austria_shape$PB) > eusilcA_prox <-nb2mat(rel, style = "W", + zero.policy= TRUE) Thus, a row standardized proximity matrix is generated that initially had weights of one if an area shares a boundary with another area and zero if not.Function spatialcor.testsmakes use of the moran.testand geary.testfunctions with their respective default settings, from the spdep package.The input arguments are the created matrix and the direct estimates.(Casas-Cordero et al., 2016;Schmid et al., 2017).Instead, the emdi package provides the AIC, BIC, Kullback information criterion (KIC) and their bootstrap and bias corrected versions (AICc, AICb1, AICb2, KICc, KICb1, KICb2) especially developed for FH models by Marhuenda et al. (2014).These criteria are also included in the sae package, but the emdi package enables a stepwise variable selection procedure based on the chosen information criteria, comparable to the step function for lm models of package stats.The most important input arguments are an object of class "fh" and the direction of the stepwise search (both, backward, forward).In this example, the default setting backward and the KICb2 information criterion is used.In the fixed argument of the fh function, the variables employee cash (cash), cash benefits from self-employment (self_empl) and unemployment benefits (unempl_ben) are included.For a valid comparison of models based on information criteria, the model fitting method must be ml.To activate the estimation of the information criteria by Marhuenda et al. (2014), we set the number of bootstrap iterations to 50.The output shows the stepwise removal of variables until the lowest KICb2 is reached, the function call and an overview of the estimated coefficients of the final recommended model.

Estimate EBLUPs and MSEs
The standard FH model is built.In addition to the fixed part, required arguments are vardir and combined_data.We specify the domains (if the domains argument is set to NULL, the domains are numbered consecutively) and activate the estimation of the MSE and of the information criteria by Marhuenda et al. (2014).
> fh_std <-fh(fixed = Mean ~cash + self_empl, vardir = "Var_Mean", combined_data = + combined_data, domains = "Domain", method = "ml", MSE = TRUE, B = c(0,50)) The R Journal Vol.15/1, March 2023 ISSN 2073-4859 Assess the estimated model In many publications using FH models, model diagnostics are discussed only briefly, if at all.One reason for this might be the lack of an existing implementation of desired diagnostic measures in R or other statistical software.The summary method of emdi provides additional information about the data and model components, in particular the chosen estimation methods, the number of domains, the log-likelihood, the information criteria by Marhuenda et al. (2014), the adjusted R 2 of a standard linear model and the adjusted R 2 especially for FH models proposed by Lahiri and Suntornchost (2015).
Additionally, measures to validate model assumptions about the standardized realized residuals and the random effects are provided: skewness and kurtosis (skewness and kurtosis of package moments, Komsta and Novomestky, 2015) of the standardized realized residuals and the random effects and the test statistics with corresponding p value of the Shapiro-Wilks-test for normality of both error terms.
As the introduced area-level models do not assume a homoscedastic sampling distribution, the realized residuals ( êi ) are standardized by êstd i = êi /σ e i for the summary and plot methods.The summary output differs slightly for the different implemented area-level models.For example, log-likehoods and thus information criteria are not available in theory for the robust and the ME model.

> summary(fh_std)
Call: fh(fixed = Mean ~cash + self_empl, vardir = "Var_Mean", combined_data = combined_data, domains = "Domain", method = "ml", MSE = TRUE, B = c(0, 50)) > plot(fh_std) Figure 3 shows normal quantile-quantile (Q-Q) plots of the standardized realized residuals and random effects (Figure 3a) as well as plots of the kernel densities of the distribution of both error terms and, for comparison, a standard normal distribution (Figure 3b and 3c).Like in emdi version 1.1.7,the user is free to modify the interface of the plots.The label and color arguments are easy to edit.Additionally, the overall appearance of the plots are changeable by the gg_theme argument, as the plots are built with the ggplot2 package (Wickham, 2016).We refer to the package documentation The R Journal Vol.15/1, March 2023 ISSN 2073-4859 for a detailed description of how to customize the plot arguments.Figure 3 supports the results of the normality tests provided in the summary output, the distribution of the standardized random effects may be slightly skewed (Figure 3c).If one is not be satisfied with the results, applying a log-transformation could improve the distribution of the error terms.

Compare results with direct estimates
The FH results should be consistent with the direct estimates for domains with a small direct MSE and/or large sample sizes.Further, the precision of the direct estimates should be improved by using auxiliary information.The comparison of the direct and model-based (FH) estimates can be done graphically by the generic function compare_plot.For the fh method the required input argument is an object of class "fh".When the default settings of the command are used, the output consists of two plots: a scatter plot proposed by Brown et al. (2001) and a line plot.Besides the direct and FH estimates, the plot contains the fitted regression and the identity line.These two lines should not differ too much.Preferably, the model-based (FH) estimates should track the direct estimates within the line plot especially for domains with a large sample size or small MSE of the direct estimator.The points are ordered by decreasing MSE of the direct estimates.In addition, the input arguments MSE and CV can be set to TRUE leading to two extra plots, respectively.The MSE/CV estimates of the direct and model-based (FH) estimates are compared first via boxplots and second via ordered scatter plots (ordered by increasing CV of the direct estimates).Like for the plot command, a variety of customization options are offered, e.g., the label options (label), the format of the points (shape) and the style of the line (line_type).
> compare_plot(fh_std, CV = TRUE, label = "no_title") Except one high value, the fitted regression and identity line of the scatter plot (Figure 4a) are relatively close.Note that the high value corresponds to the domain Eisenstadt (Stadt) with a very small sample size of 10 and the highest MSE of the direct estimates, so the direct estimator is very uncertain.Also the direct estimates are well tracked by the model-based (FH) estimates within the line plot (Figure 4b).
The R Journal Vol.15/1, March 2023 ISSN 2073-4859 The boxplot (Figure 4c) and the ordered scatter plot (Figure 4d) show that the precision of the direct estimates could be improved by the usage of the FH model in terms of CVs.Additionally, all of the CV values are less than 20% which is a common rule of the UK Office for National Statistics in order to determine whether estimation results should be published (Miltiadou, 2020).Further on, the function compare enables the user to compute a goodness of fit diagnostic (Brown et al., 2001) and a correlation coefficient of the direct estimates and the estimates of the regression-synthetic part of the FH model (Chandra et al., 2015).Following Brown et al. (2001), the difference between the model-based estimates and the direct estimates should not be significant (null hypothesis).The Wald test statistic is specified as Correlation between synthetic part and direct estimator: 0.94 The results of the goodness of fit statistic and the correlation coefficient confirm what the scatter and the line plot already indicated.In the example the null hypothesis is not rejected and the correlation coefficient indicates a strong positive correlation (0.94) between the direct and model-based (FH) estimates.

Benchmarking for consistent estimates
The idea of benchmarking is that the aggregated FH estimates should sum up to estimates of a higher regional level (τ): where ξ i stands for the share of the population size of each area in the total population size (N i /N).In our example, the EBLUP estimates could get aggregated on a national level and then compared to or benchmarked with the Austrian mean equivalized income.The emdi package contains a benchmark function that allows the user to select three different options suggested by Datta et al. (2011).A general estimator of the three options can be written as follows: θFH, bench i Depending on the weight ϕ i , the formula leads to different benchmarking options.If ϕ i equals ξ i , all FH estimates are adjusted by the same value (raking).A ratio adjustment (ratio) is being conducted if ϕ i = ξ i / θFH i .For the last option (MSE_adj), While the first option is a relatively naive approach, the latter two conduct larger adjustments for the areas with larger FH and MSE estimates, respectively.Thus, for the benchmark function the following arguments have to be specified: an object of class "fh", a benchmark value, a vector containing the ξ i s (share) and the type of benchmarking.The output is a data frame with an extra column FH_Bench for the benchmarked EBLUP values.If the optional argument overwrite is set to TRUE, the benchmarked results are added to the "fh" object and the MSE estimates of the non benchmarked FH estimates are set to NULL.For the used example, the benchmark value is calculated by taking the mean of the variable eqIncome of the 'eusilcA_smp' data frame.The ξ i s can be found in 'eusilcA_popAgg' as ratio_n.

Extract and visualize the results
To gain an overview of the point, MSE and CV results of the direct estimates compared to the model-based (FH) results the generic function estimators (Kreutzmann et al., 2019) can be used, but differences among areas or hotspots of special interest are usually easier to detect on maps.With function map_plot, the emdi package offers a user-friendly way to produce maps since creating maps can often become a time consuming task.The input arguments mainly consist of an object of class "emdi" and a spatial polygon of a shape file.The only issue that might come up is if domain identifiers in the data do not match to the respective identifiers of the shape file.In those cases, the input argument map_tab is required, which is a data frame that contains the matching of the domain indentifiers of the population and the shape file data sets.For detailed instructions, we refer to Kreutzmann et al. (2019) and to the help page of function map_plot.
For producing maps of the 94 Austrian districts, the Austrian shape file has to be loaded.In addition to the "emdi" object, the "SpatialPolygonsDataFrame" object (map_obj) and a domain indicator (map_dom_id) have to be specified.The map_tab argument is not necessary since the identifiers match in our example.To allow for an easier comparison of the results, we adjust the scales of the maps  5d).The MSE of the direct and the FH estimates are quite high.Probably a single wealthy household raised the mean income and also the variance.Figure 5b contains some districts with MSEs larger than the customized scaling (gray areas).Without the scaling it would have been hard to identify any differences in Figure 5d.

Export the results
Some users might have an interest to store the results separately or to use them for presentations.Excel and OpenDocument Spreadsheets provide many opportunities for that.In contrast to some existing R packages, the emdi functions write.excel/write.odsdo not only export the estimation results, but also the output of summary.Usage of the functions is comprehensively described in Kreutzmann et al. (2019).

Estimation of the extended area-level models FH model with transformation
If the indicator of interest needs a transformation, either log or arcsin, in addition to the function used in the previous subsection, the arguments transformation and backtransformation must be specified.If, for example, the share of households per area that earn more than the national median income (MTMED) is the indicator of interest, an arcsin transformation can be used.The bias-corrected back-transformation bc is chosen in the example.Two more arguments are needed when using an arcsin transformation: the name of the variable describing the effective sample sizes (eff_smpsize) which needs to be contained in the combined_data frame.Because of having chosen the bias-corrected back-transformation, the only possible mse_type is boot, if the MSE estimation is activated.

Spatial FH model
If the spatial correlation tests indicated a spatial correlation of the domains, a spatial FH model for incorporating the spatial structure in the model could be used.For that, the correlation has to be The R Journal Vol.15/1, March 2023 ISSN 2073-4859 set to spatial and the example proximity matrix has to be given to the model within the corMatrix argument.The possible variance estimation methods are ml and reml.

Robust FH model
If extreme values could influence the estimation, the application of a robust model might be appropriate.Within the robust framework, package emdi allows the user to choose between a standard and a spatial model (defaults to correlation = "no").The estimation method must be reblup or reblupbc which includes a bias correction that can be modified by the argument mult_constant.Further, the tuning constant k defaults to 1.345 as proposed by Sinha and Rao (2009) and Warnholz (2016) and can be changed if desired.The functions of the package saeRobust are utilized for the robust extensions.

Measurement error model
If other data sources than register data, e.g., data from larger surveys or big data sources are used as auxiliary information, the ME model should be applied.For the estimation of the ME model, the model fitting method must be set to me and the only possible MSE estimation method is jackknife.
The most complex input argument consists of the creation of the MSE array Ci.The variability of the auxiliary variables that is taken into account by the ME model is expressed by the variance-covariance matrices per domain (Ci).For example, for three covariates a, b and c the array should look like The first row and column contain zeros, because the intercept is considered.The variances and covariances can be computed by standard approaches like, for example, the Horvitz-Thompson estimator.
For the Austrian EUSILC data example, the equalized income can also be explained by a variable of the sample data set.The code below demonstrates how the MSE array Ci is created for one covariate (variable Cash and its variance Var_Cash) and how the final ME model is built.

Conclusion and outlook
In this paper, we have presented how the emdi package version 1.1.7 has been extended with various area-level models.Along with the well-known FH model, adjusted variance estimation methods and transformation options are offered to the user.In addition, spatial, robust, and ME model extensions of the standard model allow the user to address various issues that arise in practical data applications.All of these methods can be estimated conveniently by using a single function that provides EBLUP and the respective MSE estimates to measure their precision.Especially in the section Functionality and case studies, it is clear that the package does not only contain tools for estimation of the different SAE models.Instead, it additionally provides user-friendly tools to enable a whole data analysis procedure: 1. starting with model building and estimation, moving on to 2. model assessment and diagnostics, 3. presentation of the results, and finishing with 4. exporting the results to Excel or OpenDocument Spreadsheet.
For future package versions, it is planned to expand the options in the field of area-level models.In some practical applications, the incorporation of random effects is redundant.Therefore, an area-level estimator that considers a preliminary testing for the random effects following Molina et al. (2015) The R Journal Vol.15/1, March 2023 ISSN 2073-4859 will be included.Since version 2.0.0 emdi accounts for spatial structures of the random effects.Future developments may also account for out-of-sample EBLUP and MSE estimation for the spatial model proposed by Saei and Chambers (2005) and for temporal and spatio-temporal extensions (Rao and Yu, 1994;Marhuenda et al., 2013).For the existing ME model, a bootstrap MSE estimation option may be added to the package since the Jackknife MSE estimator may produce negative MSE estimates (Marchetti et al., 2015).Furthermore, cross-validation options additional to the model assessment via information criteria and the R 2 will be investigated. Y

Figure 1 :
Figure 1: Overview of the standard FH model and adjusted variance estimation methods.

Figure 2 :
Figure 2: Estimation procedure for area-level models.
indicates only a weak positive spatial autocorrelation, the following estimation procedure does not consider the integration of a correlation structure for the random effects.The R Journal Vol.15/1, March 2023 ISSN 2073-4859 Perform model selection Besides theoretical considerations on which auxiliary variables should be part of the model, the decision for the best model should be based on information criteria like the Akaike or Bayesian information criterion (AIC, BIC).Many applications use selection techniques based on linear regression

Figure 3 :
Figure 3: Output of plot(fh_std): (a) normal quantile-quantile (Q-Q) plots of the standardized realized residuals and random effects, (b) and (c): kernel densities of the distribution of the standardized realized residuals and random effects (blue) in comparison to a standard normal distribution (black).

Figure 4 :
Figure 4: Output of compare_plot(fh_std): (a) and (b) scatter and line plots of direct and model-based point estimates, (c) and (d) boxplot and scatter plots of the CV estimates of the direct and model-based (FH) estimates.

i
and is approximately χ 2 -distributed with D degrees of freedom.When working with out-of-sample domains, those are not taken into account, because the direct estimates and their variances are missing.The input argument of function compare is an "fh" object.>compare(fh_std) Brown test Null hypothesis: EBLUP estimates do not differ significantly from the direct estimates W.value Df p.value 46.97181 94 0.9999874 The R Journal Vol.15/1, March 2023 ISSN 2073-4859

Table 1 :
Overview of selected implemented area-level models in R packages available on CRAN.

Table 2 :
Overview of the MSE estimation options of the fh function.

Table 3 :
Variables of the aggregated data sets.The Domain variables are factors, the rest of the variables are numeric.Except for the variables Domain and ratio_n, the observations of all variables of the population data set consist of the mean values per district.
area-level version of the data sets.The Austrian European Union Statistics on Income and Living Conditions (EU-SILC) synthetic 2006 data set (eusilcP) sourced from the simFrame package (Alfons

Table 4 :
Table4gives an overview of the 20 input arguments of function fh, together with a short description and any default settings.Not every argument needs a specification for every estimated model.Depending Input arguments of function fh.
The output of the example shows that all domains have survey information and the variance of σ 2 u amounts to 1371195.Further, all of the included auxiliary variables are quite significant and their explanatory power is large with an adjusted R 2 (for FH models) of around 0.95.The results of the Shapiro-Wilk-test indicate that normality is not rejected for both errors.Graphical residual diagnostics are possible by the plot method.
It is recognizable that for the first six Austrian districts the original estimates are slightly modified by the benchmarking.

Table 8 :
Installed packages for the computation of the results in this paper.