msae: An R Package of Multivariate Fay-Herriot Models for Small Area Estimation

The paper introduces an R Package of multivariate Fay-Herriot models for small area estimation named msae. This package implements four types of Fay-Herriot models, including univariate Fay-Herriot model (model 0), multivariate Fay-Herriot model (model 1), autoregressive multivariate Fay-Herriot model (model 2), and heteroskedastic autoregressive multivariate Fay-Herriot model (model 3). It also contains some datasets generated based on multivariate Fay-Herriot models. We describe and implement functions through various practical examples. Multivariate Fay-Herriot models produce a more efficient parameter estimation than direct estimation and univariate model.

Novia Permatasari (Politeknik Statistika STIS) , Azka Ubaidillah (Politeknik Statistika STIS)
2021-11-15

1 Introduction

Survey sampling is a data collection method by observing several units of observation to obtain information from the entire population. Compared to other data collection methods, survey sampling has advantages in cost, time, and human resources. Survey sampling is designed for a certain size of the domain, commonly a large area. However, data demand for small areas is increasing and has become high issue (Ghosh and Rao 1994). The inadequate sample size causes a large standard error of parameter estimates. This problem is overcome by indirect estimation, namely Small Area Estimation (SAE).

Rao and Molina (2015) said that SAE increases the effectiveness of sample size using the strength of neighboring areas and information on other variables that are related to the variable of interest. There are some estimation methods in SAE, namely Best Linear Unbiased Predictors (BLUP), Empirical Best Linear Unbiased Predictors (EBLUP), Hierarchical Bayes (HB), and Empirical Bayes (EB). The most common SAE estimator is the EBLUP (Krieg et al. 2015). EBLUP has advantages over EB and HB methods. It is a development of the BLUP method that minimizes the MSE among other unbiased linear estimators (Ghosh and Rao 1994). Area level of EBLUP application for the continuous response variable, called Fay-Herriot model, was firstly employed for estimating log per-capita income (PCI) in small places in the US (Rao and Molina 2015).

The Fay-Herriot model has extended into a multivariate Fay-Herriot model, which is a model with several correlated response variables. Datta et al. (1991) firstly applied the multivariate model to estimate the median income of four-person families in the US states. Benavent and Morales (2016) developed multivariate Fay-Herriot models with the EBLUP method and introduced four estimation models based on the estimated variance matrix structure. Ubaidillah et al. (2019) also implemented the multivariate Fay-Herriot model and indicated that the multivariate Fay-Herriot model produces a more efficient parameter estimation than the univariate model.

On the Comprehensive R Archive Network (CRAN), there are several packages implementing small area estimation. Some of them are included in the Small Area Estimation subsection of The CRAN Task View: Official Statistics & Survey Methodology (Templ 2014), including sae (Molina and Marhuenda 2018), rsae (Schoch 2014), nlme (Pinheiro et al. 2020), hbsae (Boonstra 2012), JoSAE (Breidenbach 2018), and BayesSAE (Chengchun Shi Developer 2018). Other popular SAE packages not included in that subsection are mme (Lopez-Vizcaino et al. 2019) and saery (Lefler et al. 2014).

In this paper, we introduce our R package of multivariate Fay-Herriot models for small area estimation, named msae. This package and its details are available on CRAN at http://CRAN.R-project.org/package=msae. Functions in this package implement four Fay-Herriot Models, namely Model 0, Model 1, Model 2, and Model 3, as proposed by Benavent and Morales (2016).

The paper is structured as follows. First, we explain multivariate Fay-Herriot models in Section 2. Then, we describe msae package and illustrate the use of this package for SAE estimation by employing simulation studies and applying it to a real dataset in the next Sections 3 and 4. Finally, we provide a conclusion in Section 8.

2 Multivariate Fay-Herriot model

The multivariate Fay-Herriot model is an extension of the Fay-Herriot model, which utilizes some correlated responses. Fay-Herriot is a combination of two model components. The first component is called the sampling model, and the second component is called the linking model.

Suppose we want to estimate characteristics of R variables in D areas, \(\mu_{d} = (\mu_{1d}, \dotsc ,\mu_{Rd})^T\), with \(d = 1, \dotsc ,D\). Let \(y_d= (y_{1d}, \dotsc ,y_{Rd})^T\), be a direct estimator of \(\mu_d\). The first component, i.e., sampling model is \[y_d=\mu_d+e_d, \quad e_d \sim N(0,V_{e_d}), \quad d=1,\dotsc,D\] , where \(e_d\) is sampling error with a covariance matrix, \(V_{e_d}\) , that is assumed to be known. In the second component, we assume that \(\mu_d\) is linearly related with \(p_d\) area-specific auxiliary variables \(X_d= (X_1,\dotsc,X_{p_d})^T\) as follows: \[\mu_d=X_d \beta_d + u_d, \quad u_d \sim N(0,V_{u_d}), \quad d=1,\dotsc,D\] , where \(u_d\) is area random effects, and \(\beta_d\) is a vector of regression coefficient corresponding with \(X_d\). This second component is called the linking model. The combination of the two components forms a multivariate linear mixed model as follows: \[y_d=X_d\beta_d + u_d + e_d,\quad e_d \sim N(0,V_{e_d}),\quad d=1,\dotsc,D\] , where \(u\) and \(e\) are independent.

Benavent and Morales (2016) proposed Fay-Herriot models using four different variance matrices, Model 0, Model 1, Model 2, and Model 3. Model 0 is a univariate Fay-Herriot Model, of which the sampling error and the random effect of target variables are independent. Sampling error and random effect variance matrix are written as follows: \[V_{ud} = diag_{a \leq r \leq R}(\sigma_{ur}^2)\]

\[V_{ed} = diag_{a \leq r \leq R}(\sigma_{edr}^2)\] , \[\text{where} \: d=1,\dotsc,D\] .

Model 1, Model 2, and Model 3 are multivariate Fay-Herriot models, of which the variance matrices are no longer diagonal matrices. Model 1 is a multivariate form of Model 0, where the random effect variance of Model 1 is still a diagonal matrix. Model 2 is called the autoregressive multivariate Fay-Herriot model (AR(1)), in which the random variance matrix is written as follows : \[V_{ud} = \sigma_{ur}^2 \Omega_d (\rho)\]

\[\Omega_d (\rho) = \frac{1}{1- \rho^2} \begin{bmatrix} 1 & \rho & \cdots & \rho^{R-1}\\ \rho & 1 & & \rho^{R-2}\\ \vdots & & \ddots & \vdots \\ \rho^{R-1} & \rho^{R-2} & \cdots & 1 \end{bmatrix}\] Model 3 is called heteroskedastic autoregressive multivariate Fay-Herriot model (HAR(1)), which the element of random error is written as follows:

\[u_{dr} = \rho u_{dr-1} + a_{dr}\]

\[u_{d0} \sim N(0, \sigma_{u0}^2) \quad \text{and} \quad a_{dr} \sim N(0, \sigma_{r}^2)\] , where \(\sigma_{u0}^2=1 , a_{dr}, \text{ and } u_{d0}\) are independent. The element of random variance matrix is written as follow: \[\sigma_{drii} = \sum_{k=1}^{i} \rho^{2k} \sigma_{i-k}^2\]

\[\sigma_{drij} = \sum_{k=0}^{|i-j|} \rho^{2k+|i-j|} \sigma_{|i-j|-k}^2\] .

BLUP and EBLUP

The best linear unbiased prediction (BLUP) of \(\mu\) is \[\hat{\mu} = X \hat{\beta} + Z^T \hat{V_u} Z \Omega^{-1} (y-X \hat{\beta)}\] , where \(\hat{\beta} = (X^T \Omega^{-1} X)^{-1} X^T \Omega^{-1} y\) is the best linear unbiased estimator (BLUE) of \(\beta\) with the covariance matrix \(cov(\hat{\beta})=(X^T \Omega^{-1} X)^{-1}\). BLUP estimator depends on the random effect variance that is usually unknown. Using Restricted Maximum Likelihood (REML), we estimate and substitute random effect variance estimator for obtaining the multivariate EBLUB estimator. The estimation formula with EBLUP is written as follows: \[\hat{\mu} = X \hat{\beta} + Z^T \hat{V_u} Z \hat{\Omega}^{-1} (y-X\hat{\beta})\]

\[\hat{\Omega} = Z^T \hat{V_u} Z+ V_e\] , where \(\hat{\beta}=(X^T \hat{\Omega}^{-1} X)^{-1} X^T \hat{\Omega}^{-1} y\) is the best linear unbiased estimator (BLUE) of \(\beta\) with the covariance matrix \(cov(\hat{\beta})=(X^T \hat{\Omega}^{-1} X)^{-1}\).

MSE

Benavent and Morales (2016) also proposed MSE estimation for the multivariate Fay-Herriot models as follows: \[mse(\hat{\mu}) = g_{1i}(\hat{\sigma}_u^2) + g_{2i}(\hat{\sigma}_u^2) +g_{3i}(\hat{\sigma}_u^2)\] , where each component can be described as follows: \[g_{1i}(\hat{\sigma}_u^2) = \Gamma V_e\] , \[g_{2i}(\hat{\sigma}_u^2) = (1-\Gamma)X(X^T \Omega^{-1} X)^{-1} X^T (I - \Gamma)^T\] , \[g_{3i}(\hat{\sigma}_u^3) \approx \Sigma\Sigma cov(\hat{\sigma}_{uk}^2 , \hat{\sigma}_{ul}^2) \Gamma_{(k)} \Omega \Gamma^T_{(k)} , k,l = 1,2,\dotsc, q\] , where \(\Gamma = Z^T\hat{V_u} Z\), \(\Gamma_{(k)} = \frac{\delta \Gamma}{\delta \sigma_u^2}\), and \(cov(\hat{\sigma}_{uk}^2 , \hat{\sigma}_{ul}^2)\) is the inverse of the Fisher information matrix in the estimation of REML.

3 Overview of R package msae

The R Package msae implements multivariate Fay-Herriot models for small area estimation. Here are some functions and the descriptions at a glance:

Those functions return a list of five elements:

This package also provides datasets generated for each multivariate model. The datasets are generated based on Model 1, Model 2, and Model 3 following steps:

  1. Generate sampling error \(e\) and auxiliary variables « \(X1, X2\) ». Set the parameter as follows:

    • For sampling error \(e\) in Model 1, we set \(e_d \sim N_3(0,V_{ed})\), where \(V_{ed} = (\sigma_{dij})_{i,j=1,2,3}\) with \(\sigma_{e11} \sim InvGamma(11,1), \: \sigma_{e22} \sim InvGamma(11,2), \: \sigma_{e33} \sim InvGamma(11,3), \text{ and } \rho_e = 0.5\). We generate different MSE of direct estimates for each area.
    • For sampling error \(e\) in Model 2 and Model 3, we set \(e \sim N_3(0,V_e)\), where \(V_e = (\sigma_{ij})_{i,j=1,2,3}\) with \(\sigma_{e11} = 0.1, \: \sigma_{e22} = 0.2, \: \sigma_{e33} = 0.3 , \text{ and } \rho_e = 0.5\). It is shown that all the areas have the same MSE of direct estimates.
    • For auxiliary variables « \(X1 \: X2\) », we set \(X1 \sim N(5,0.1) \text{ and } X2 \sim N(10,0.2)\)
  2. Generate random effect \(u\), where \(u \sim N_3(0,V_u)\). For each dataset, parameter for generating random effect \(u\) are as follows:

    • For Model 1, \(\sigma_{u11} = 0.2, \: \sigma_{u22} = 0.4, \text{ and } \sigma_{u33} = 1.2\)
    • For Model 2, \(\sigma_{u} = 0.4, \text{ and } \rho_{u} = 0.8\)
    • For Model 3, \(\sigma_{u11} = 0.2, \: \sigma_{u22} = 0.4, \: \sigma_{u33} = 1.2, \text{ and } \rho_{u} = 0.8\)
  3. Set \(\beta_1=5\) and \(\beta_2=10\) to calculate direct estimation « \(Y1,\: Y2, \text{ and } Y3\) », where \(Y_i = X\beta + u_i + e_i\).

  4. Auxiliary variables « \(X1\) and \(X2\) », direct estimates « \(Y1, Y2,\) and \(Y3\) », and sampling variance-covariance « \(v1, v2, v3, v12, v13,\) and \(v23\) » are combined into a data frame called datasae1 for Model 1, datasae2 for Model 2, and datasae3 for Model 3.

4 Example 1. The multivariate Fay-Herriot model (Model 1)

datasae1, which is generated based on Model 1, contains 50 observations on the following 11 variables: 3 dependent variables « \(Y1, Y2,\) and \(Y3\) », 2 auxiliary variables « \(X1\) and \(X2\) », and 6 variance-covariance of direct estimation « \(v1, v2, v3, v12, v13,\) and \(v23\) ». The procedures for generating such datasets are provided in the previous section. The following R commands are run to obtain EBLUPs of the univariate Fay-Herriot model (Model 0) and the multivariate Fay-Herriot model (Model 1), plot the EBLUPs of the univariate and multivarate model, and plot the MSEs of EBLUPs in the univariate and multivarate model:

data('datasae1')

# model specifications
Fo <- list(f1=Y1~X1+X2,
            f2=Y2~X1+X2,
            f3=Y3~X1+X2)
vardir <- c("v1", "v2", "v3", "v12", "v13", "v23")

# EBLUP based on Model 0 and Model 1
u <- eblupUFH(Fo, vardir, data=datasae1)  # Model 0
m1 <- eblupMFH1(Fo, vardir, data=datasae1)  #Model 1

# Figure 1: EBLUPs under Model 0 and Model 1
par(mfrow=c(1,3))

plot(u$eblup$Y1, type = "o", col = "blue", pch = 15, xlab = "area", ylab = "Y1",
     cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(m1$eblup$Y1, type = "o", col = "red", pch = 18) 
axis(1, at=1:50, labels = 1:50)
legend("topleft",legend=c("Model 0","Model 1"),ncol=2 , col=c("blue","red"), 
    pch=c(15,18), inset=c(0.617,-0.1), xpd=TRUE)
plot(u$eblup$Y2, type = "o", col = "blue", pch = 15, xlab = "area", ylab = "Y2",
     cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(m1$eblup$Y2, type = "o", col = "red", pch = 18)
axis(1, at=1:50, labels = 1:50)
legend("topleft",legend=c("Model 0","Model 1"),ncol=2 , col=c("blue","red"), 
    pch=c(15,18), inset=c(0.617,-0.1), xpd=TRUE)
plot(u$eblup$Y3, type = "o", col = "blue", pch = 15, xlab = "area", ylab = "Y3",
     cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(m1$eblup$Y3, type = "o", col = "red", pch = 18)
axis(1, at=1:50, labels = 1:50)
legend("topleft",legend=c("Model 0","Model 1"),ncol=2 , col=c("blue","red"), 
    pch=c(15,18), inset=c(0.617,-0.1), xpd=TRUE)


# Figure 2: MSE of Model 0 and Model 1
par(mfrow=c(1,3))

plot(u$MSE$Y1, type = "o", col = "blue", pch = 15, xlab = "area", ylab = "MSE of Y1",
     cex.axis = 1.5, cex.lab = 1.5, xaxt = "n", ylim=c(0.038,0.12))
points(m1$MSE$Y1, type = "o", col = "red", pch = 18)
axis(1, at=1:50, labels = 1:50)
legend("topleft",legend=c("Model 0","Model 1"),ncol=2 , col=c("blue","red"),
       pch=c(15,18), inset=c(0.617,-0.1), xpd=TRUE)
plot(u$MSE$Y2, type = "o", col = "blue", pch = 15, xlab = "area", ylab = "MSE of Y2",
     cex.axis = 1.5, cex.lab = 1.5, xaxt = "n", ylim=c(0.05,0.28))
points(m1$MSE$Y2, type = "o", col = "red", pch = 18)
axis(1, at=1:50, labels = 1:50)
legend("topleft",legend=c("Model 0","Model 1"),ncol=2 , col=c("blue","red"),
       pch=c(15,18), inset=c(0.617,-0.1), xpd=TRUE)
plot(u$MSE$Y3, type = "o", col = "blue", pch = 15, xlab = "area", ylab = "MSE of Y3",
     cex.axis = 1.5, cex.lab = 1.5, xaxt = "n", ylim=c(0.1,0.42))
points(m1$MSE$Y3, type = "o", col = "red", pch = 18)
axis(1, at=1:50, labels = 1:50)
legend("topleft",legend=c("Model 0","Model 1"),ncol=2 , col=c("blue","red"),
       pch=c(15,18), inset=c(0.617,-0.1), xpd=TRUE)

Figure 1 illustrates the EBLUPs based on Model 0 (univariate model) and Model 1 (multivariate model) for all variables of interest. Figure 2 shows the MSEs of Model 1 compared with the MSEs of Model 0. It can be seen that the estimates of both methods show a similar pattern. Meanwhile, EBLUPs based on Model 1 has a lower MSE than Model 0. From this example, we can conclude that the multivariate Fay-Herriot model (Model 1) is more efficient than the univariate Fay-Herriot model (Model 0).

graphic without alt text
Figure 1: EBLUPs under Model 0 and Model 1.
graphic without alt text
Figure 2: MSE of EBLUPs under Model 0 and Model 1.

5 Example 2. The autoregressive multivariate Fay-Herriot model (Model 2)

datasae2, which was generated based on Model 2, contains 50 observations on the following 11 variables: 3 dependent variables « \(Y1, Y2\) and \(Y3\) », 2 auxiliary variables « \(X1\) and \(X2\) », and 6 variance-covariance of direct estimation « \(v1, v2, v3, v12, v13,\) and \(v23\) ». We compare the effectiveness of the univariate model (Model 0) and autoregressive multivariate Fay-Herriot Model (Model 2) by their MSE. We use eblupMFH2() to estimate parameters based on Model 2. Then, we plot the EBLUP and MSE of these methods to compare them.

data('datasae2')

# model specifications
Fo <- list(f1=Y1~X1+X2,
            f2=Y2~X1+X2,
            f3=Y3~X1+X2)
vardir <- c("v1", "v2", "v3", "v12", "v13", "v23")

# EBLUP based on Model 0 and Model 2
u <- eblupUFH(Fo, vardir, data=datasae2)  # Model 0
m2 <- eblupMFH2(Fo, vardir, data=datasae2)  # Model 2

The EBLUPs based on Model 0 (univariate model) and Model 1 (autoregressive multivariate model) are shown in Figure 3. As it can be seen, both methods show an almost similar result. However, EBLUPs based on Model 2 has a lower MSE than the EBLUPs based on Model 0, as shown in Figure 4. In this example, the autoregressive multivariate Fay-Herriot model (Model 2) seems to be more efficient than the univariate Fay-Herriot model (Model 0).

graphic without alt text
Figure 3: EBLUPs under Model 0 and Model 2.
graphic without alt text
Figure 4: MSE of EBLUPs under Model 0 and Model 2.

6 Example 3. Heteroskedastic autoregressive multivariate Fay-Herriot model (Model 3)

datasae3, which was generated based on Model 3, is structured the same as datasae1 and datasae2. We compare the effectiveness of the univariate model (Model 0) and heteroskedastic autoregressive multivariate Fay-Herriot Model (Model 3) by their MSE. We use eblupMFH3() to estimate parameters based on Model 3. Then, we plot the EBLUP and MSE of these methods to compare them.

data('datasae3')

# model specifications 
Fo <- list(f1=Y1~X1+X2,
            f2=Y2~X1+X2,
            f3=Y3~X1+X2)
vardir <- c("v1", "v2", "v3", "v12", "v13", "v23")

# EBLUP based on Model 0 and Model 3
u <- eblupUFH(Fo, vardir, data=datasae3)  # Model 0
m3 <- eblupMFH3(Fo, vardir, data=datasae3)  # Model 3

Figure 5 shows EBLUPs based on Model 3 compared to Model 0. The MSEs of both methods are shown in Figure 6. It can be seen that the multivariate EBLUPs will follow the pattern of univariate EBLUPs with smaller MSE values. It can be concluded that the heteroskedastic autoregressive multivariate Fay-Herriot model (Model 3) is more efficient than the univariate model (Model 0).

graphic without alt text
Figure 5: EBLUPs under Model 0 and Model 3.
graphic without alt text
Figure 6: MSE of EBLUPs under Model 0 and Model 3.

7 Real data example: Poverty index

In this section, we use incomedata dataset, which is provided in library sae. The dataset contains unit-level data on income and other related variables in Spain. We will demonstrate how to estimate the EBLUP of Foster-Greer-Thorbecke (FGT) poverty index for each province using the multivariate Fay-Herriot Models. The index consists of three indicators, i.e., poverty proportion (p0), poverty gap (p1), and poverty severity (p2).

library(sae)
data("incomedata")

Firstly, we obtain area-level data by calculating poverty indicators for each unit and aggregating it by province based on Benavent and Morales (2016). We use poverty line z=6557.143 (Molina and Marhuenda 2015). These following R commands are run to obtain \(p0\), \(p1\), and \(p2\) as variables of interest.

library(tidyverse)

pov.line <- rep(6557.143, dim(incomedata)[1]) # poverty line

# calculate unit indictators
incomedata$y <- (pov.line - incomedata$income)/pov.line
incomedata = incomedata %>% mutate(poverty = ifelse(incomedata$y > 0, TRUE, FALSE),
                                   y0 = ifelse(incomedata$y > 0, incomedata$y^0, 0),
                                   y1 = ifelse(incomedata$y > 0, incomedata$y^1, 0),
                                   y2 = ifelse(incomedata$y > 0, incomedata$y^2, 0))

# estimated domain size
est.Nd <- aggregate(incomedata$weight, list(incomedata$prov), sum)[,2]

## estimate P0 P1 dan P2
prov.est = incomedata %>% group_by(prov) %>%
  summarise(p0.prov = sum(weight*y0),
            p1.prov = sum(weight*y1),
            p2.prov = sum(weight*y2)) %>%
  mutate(p0.prov = p0.prov / est.Nd,
         p1.prov = p1.prov / est.Nd,
         p2.prov = p2.prov / est.Nd)
incomedata <- incomedata %>% left_join(prov.est, by = c("prov" = "prov"))

We also need variance and covariance of variables of interest to estimate using the multivariate Fay-Herriot model. The following R commands are run to obtain variance and covariance of direct estimation based on Benavent and Morales (2016).

# estimate  direct estimation variance-covariance
prov.variance = incomedata %>%
  mutate(v11 = ifelse(incomedata$poverty,
                      (weight*(weight-1))*(y0-p0.prov)*(y0-p0.prov), 0),
         v12 = ifelse(incomedata$poverty,
                      (weight*(weight-1))*(y0-p0.prov)*(y1-p1.prov), 0),
         v13 = ifelse(incomedata$poverty,
                      (weight*(weight-1))*(y0-p0.prov)*(y2-p2.prov), 0),
         v22 = ifelse(incomedata$poverty,
                      (weight*(weight-1))*(y1-p1.prov)*(y1-p1.prov), 0),
         v23 = ifelse(incomedata$poverty,
                      (weight*(weight-1))*(y1-p1.prov)*(y2-p2.prov), 0),
         v33 = ifelse(incomedata$poverty,
                      (weight*(weight-1))*(y2-p2.prov)*(y2-p2.prov), 0)) %>%
  group_by(prov) %>%
  summarise_at(c("v11","v12", "v13", 'v22',"v23","v33"), sum) %>%
  mutate_at(c("v11","v12", "v13", 'v22',"v23","v33"),
            function(x){x/est.Nd^2})

We use six explanatory variables selected by stepwise method, i.e., an indicator of age group 50-64 (\(age4\)), an indicator of age group >=65 (\(age5\)), an indicator of education level 1 (\(educ1\)), an indicator of education level 2 (\(educ2\)), an indicator of education level 3 (\(educ3\)), and an indicator of Spanish nationality (\(nat1\)). The model specifications are written as follow:

formula <- list(f1=p0.prov~age4+age5+educ1+educ2+educ3+nat1,
            f2=p1.prov~age4+age5+educ1+educ2+educ3+nat1,
            f3=p2.prov~age4+age5+educ1+educ2+educ3+nat1)
vardir <- c("v11", "v22", "v33", "v12", "v13", "v23")

Next, we select the most suitable multivariate Fay-Herriot model using variance homogeneity test and random effect \(\rho\) parameter test. In the variance homogeneity test, we test \(H0 : \hat{\sigma}_{ui}^2 = \hat{\sigma}_{uj}^2 \: ; \: i,j=1,2,3\) using model 3. We obtain a non-convergent model, the p-values are 0.98674 and 0.98996. It shows that the difference between variance of random effects is statistically not significant. After that, we test \(H0 : \rho = 0\) using model 2. We get the t-statistics value of 19.26 with p-value of 0.00. It shows that there is a correlation between random effects. These results indicate the model that fits the data is Model 2.

The following codes are run to obtain the EBLUPs (under Model 2), to plot the direct estimates and the EBLUPs, and to plot the MSEs of direct estimates and the MSEs of EBLUPs ordered by sample size:


# EBLUP based on Model 2
eblup.pov <- eblupMFH2(formula, vardir, data=prov.data)

# Dataframe of Result 
result_eblup = data.frame(prov = prov.data$prov,
                          est.Nd = est.Nd,
                          p0 = prov.data$p0.prov, p0.eblup = eblup.pov$eblup$p0.prov,
                          p1 = prov.data$p1.prov, p1.eblup = eblup.pov$eblup$p1.prov,
                          p2 = prov.data$p2.prov, p2.eblup = eblup.pov$eblup$p2.prov,
                          v11 = prov.data$v11, p0.mse = eblup.pov$MSE$p0.prov,
                          v22 = prov.data$v22, p1.mse = eblup.pov$MSE$p1.prov,
                          v33 = prov.data$v33, p2.mse = eblup.pov$MSE$p2.prov)
result_eblup = result_eblup %>% arrange(est.Nd)
result_eblup$prov = as.factor(result_eblup$prov)
result_eblup$id = 1:nrow(result_eblup)

# Figure 7: Direct estimates estimates and EBLUPs (under Model 2) ordered by sample size.
par(mfrow=c(1,3))

plot(result_eblup$id, result_eblup$p0,  type = "o", col = "blue", pch = 15, xlab = "area", 
    ylab = "p0", cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(result_eblup$p0.eblup, type = "o", col = "red", pch = 18)
axis(1, at=result_eblup$id, labels = result_eblup$prov)
legend("topright",legend=c("Direct estimates","Model 2"),col=c("blue","red"), pch=c(15,18))
plot(result_eblup$id, result_eblup$p1, type = "o", col = "blue", pch = 15, xlab = "area", 
    ylab = "p1", cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(result_eblup$p1.eblup, type = "o", col = "red", pch = 18)
axis(1, at=result_eblup$id, labels = result_eblup$prov)
legend("topright",legend=c("Direct estimates","Model 2"),col=c("blue","red"), pch=c(15,18))
plot(result_eblup$id, result_eblup$p2, type = "o", col = "blue", pch = 15, xlab = "area", 
    ylab = "p2", cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(result_eblup$p2.eblup, type = "o", col = "red", pch = 18)
axis(1, at=result_eblup$id, labels = result_eblup$prov)
legend("topright",legend=c("Direct estimates","Model 2"),col=c("blue","red"), pch=c(15,18))

# Figure 8: MSE of Direct estimates and MSE of EBLUPs (under Model 2) ordered by sample size.
par(mfrow=c(1,3))

plot(result_eblup$id, result_eblup$v11, type = "o", col = "blue", pch = 15, xlab = "area", 
    ylab = "p0", cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(result_eblup$p0.mse, type = "o", col = "red", pch = 18)
axis(1, at=result_eblup$id, labels = result_eblup$prov)
legend("topright",legend=c("Direct estimates","Model 2"),col=c("blue","red"), pch=c(15,18))
plot(result_eblup$id, result_eblup$v22, type = "o", col = "blue", pch = 15, xlab = "area", 
    ylab = "p1", cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(result_eblup$p1.mse, type = "o", col = "red", pch = 18)
axis(1, at=result_eblup$id, labels = result_eblup$prov)
legend("topright",legend=c("Direct estimates","Model 2"),col=c("blue","red"), pch=c(15,18))
plot(result_eblup$id, result_eblup$v33, type = "o", col = "blue", pch = 15, xlab = "area", 
    ylab = "p2", cex.axis = 1.5, cex.lab = 1.5, xaxt = "n")
points(result_eblup$p2.mse, type = "o", col = "red", pch = 18)
axis(1, at=result_eblup$id, labels = result_eblup$prov)
legend("topright",legend=c("Direct estimates","Model 2"),col=c("blue","red"), pch=c(15,18))
Table 1: Statistics of direct estimation and Model 2
Variable of Interest Statistic Direct Estimation Model 2
p0 Min 0.05244 0.04601
Quartil 1 0.17296 0.18947
Median 0.21850 0.21937
Mean 0.22659 0.22020
Quartil 3 0.27753 0.25468
Max 0.43588 0.35011
Standard Deviation 0.08173 0.056597
p1 Min 0.01891 0.02551
Quartil 1 0.05336 0.05969
Median 0.06810 0.06866
Mean 0.07567 0.07407
Quartil 3 0.09208 0.08866
Max 0.15973 0.14023
Standard Deviation 0.03227 0.02506
p2 Min 0.005449 0.008481
Quartil 1 0.025819 0.026867
Median 0.032344 0.033600
Mean 0.039042 0.038179
Quartil 3 0.051379 0.049755
Max 0.099308 0.087194
Standard Deviation 0.02083 0.017137

We will compare EBLUPs based on Model 2 with the direct estimates. Both of the estimation results can be seen in Table 1. On the median value, the result of estimated poverty indicators are relatively similar, ranging from 0.218-0.219 for p0, 0.0681-0.0687 for p1, and 0.0323-0.0336 for p2. Model 2 has a lower range and smaller standard deviation than the direct estimation. It means that, in general, the multivariate Fay-Herriot model has lower variability of small area estimates than direct estimation.

graphic without alt text
Figure 7: Direct estimates and EBLUPs (under Model 2) of p0, p1, and p2 ordered by sample size.
graphic without alt text
Figure 8: MSE of direct estimates and MSE of EBLUPs (under Model 2) of p0, p1, and p2 ordered by sample size.

The results of direct estimates and EBLUPs under Model 2 ordered by sample size are shown in Figure 7. The patterns of small area estimates for both methods are almost the same for all areas. The MSEs of the direct estimates and the EBLUP estimates ordered by sample size are shown in Figure 8. These plots show that the multivariate Fay-Herriot model has lower MSE than the direct estimation. Thus, we can conclude that the multivariate Fay-Herriot model is more efficient than direct estimation.

8 Conclusion

This paper introduces the first R package of multivariate Fay-Herriot model for small area estimation named msae. The package is available on Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=msae. This package contains a number of functions for estimating the EBLUP and MSE of EBLUP of each Fay-Herriot Model. This package accommodates the univariate Fay-Herriot model (model 0), multivariate Fay-Herriot model (model 1), autoregressive multivariate Fay-Herriot model (model 2), and heteroskedastic autoregressive multivariate Fay-Herriot model (model 3). The functions are described and implemented using three examples of datasets provided in msae package and a real dataset provided in sae package, incomedata. By these examples, we show that the multivariate Fay-herriot models produce more efficient parameter estimates than direct estimation and univariate model.

CRAN packages used

sae, rsae, nlme, hbsae, JoSAE, BayesSAE, mme, saery, msae

CRAN Task Views implied by cited packages

Bayesian, ChemPhys, Econometrics, Environmetrics, Finance, MixedModels, OfficialStatistics, Psychometrics, Spatial, SpatioTemporal

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

R. Benavent and D. Morales. Multivariate fay–herriot models for small area estimation. Computational Statistics and Data Analysis, 372–390, 2016. URL https://doi.org/10.1016/j.csda.2015.07.013.
H. J. Boonstra. Hbsae: Hierarchical bayesian small area estimation. 2012. URL https://CRAN.R-project.org/package=hbsae. R package version 1.0.
J. Breidenbach. JoSAE: Unit-level and area-level small area estimation. 2018. URL https://CRAN.R-project.org/package=JoSAE. R package version 0.3.0.
Chengchun Shi Developer. BayesSAE: Bayesian analysis of small area estimation. 2018. URL https://CRAN.R-project.org/package=BayesSAE. R package version 1.0-2.
G. S. Datta, R. E. Fay and M. Ghosh. Hierarchical and empirical multivariate bayes analysis in small area estimation. In Proc. Of the bureau of the census annual research conference, pages. 63–79 1991. Bureau of the Census, Washington D.C.
M. Ghosh and J. N. K. Rao. Small area estimation: An appraisal. Statistical Science, 55–93, 1994.
S. Krieg, H. J. Boonstra and M. Smeets. Small area estimation with zero-inflated data - a simulation study. Statistics Netherland, 1–45, 2015.
M. D. E. Lefler, D. M. Gonzalez and A. P. Martin. Saery: Small area estimation for rao and yu model. 2014. URL https://CRAN.R-project.org/package=saery. R package version 1.0.
E. Lopez-Vizcaino, M. J. Lombardia and D. Morales. Mme: Multinomial mixed effects models. 2019. URL https://CRAN.R-project.org/package=mme. R package version 0.1-6.
I. Molina and Y. Marhuenda. Sae: An r package for small area estimation. The R Journal, 81–98, 2015. URL https://doi.org/10.32614/RJ-2015-007.
I. Molina and Y. Marhuenda. Sae: Small area estimation. 2018. URL https://CRAN.R-project.org/package=sae. R package version 1.2.
J. Pinheiro, D. Bates and R-core. Nlme: Linear and nonlinear mixed effects models. 2020. URL https://CRAN.R-project.org/package=nlme. R package version 3.1-145.
J. N. K. Rao and I. Molina. Small area estimation 2nd edition. Hoboken, New Jersey: John Wiley; Sons Inc., 2015.
T. Schoch. Rsae: Robust small area estimation. 2014. URL https://CRAN.R-project.org/package=rsae. R package version 0.1-5.
M. Templ. CRAN task view: Official statistics & survey methodology. 2014. URL https://CRAN.R-project.org/view=OfficialStatistics. Version 2014-08-18.
A. Ubaidillah, K. A. Notodiputro, A. Kurnia and I. W. Mangku. Multivariate fay-herriot models for small area estimation with application to household consumption per capita expenditure in indonesia. Journal of Applied Statistics, 2845–2861, 2019. URL https://doi.org/10.1080/02664763.2019.1615420.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Permatasari & Ubaidillah, "msae: An R Package of Multivariate Fay-Herriot Models for Small Area Estimation", The R Journal, 2021

BibTeX citation

@article{RJ-2021-096,
  author = {Permatasari, Novia and Ubaidillah, Azka},
  title = {msae: An R Package of Multivariate Fay-Herriot Models for Small Area Estimation},
  journal = {The R Journal},
  year = {2021},
  note = {https://rjournal.github.io/},
  volume = {13},
  issue = {2},
  issn = {2073-4859},
  pages = {111-122}
}