The paper presents a new R package qape for prediction, accuracy estimation of various predictors and Monte Carlo simulation studies of properties of both predictors and estimators of accuracy measures. It allows to predict any population and subpopulation characteristics of the response variable based on the Linear Mixed Model (LMM). The response variable can be transformed, e.g. to logarithm and the data can be in the cross-sectional or longitudinal framework. Three bootstrap algorithms are developed: parametric, residual and double, allowing to estimate the prediction accuracy. Analyses can also include Monte Carlo simulation studies of properties of the methods used. Unlike other packages, in the prediction process the user can flexibly define the predictor, the model, the transformation function of the response variable, the predicted characteristics and the method of accuracy estimation.
One of the tasks in application of mixed models in the real-life problems is the prediction of random effects. Then, the predicted values give the possibility for further prediction, e.g. characteristics of interest such as sum, mean or quantiles or the future value of the response variable for cross-sectional or longitudinal data.
Three main predictors of these characteristics are proposed in the literature: Empirical Best Linear Unbiased Predictors - EBLUPs (see e.g. (Henderson 1950) and (Royall 1976)), PLUG-IN predictors (see e.g. (Boubeta et al. 2016), (Chwila and Żądło 2019), (Hobza and Morales 2016)) and Empirical Best Predictors - EBPs (see e.g. (Molina and Rao 2010)). Each assumes the LMM to model the response variable.
The numerous successful applications of these three predictors for cross-sectional and longitudinal data can be found in the model approach in survey sampling, including the small area estimation. In paper (Fay III and Herriot 1979) the Authors introduce the prediction of the mean income for small places based on the special case of the LMM model called Fay-Herriot model and the EBLUP. The analysis of poverty is extended in many works, e.g. in (Molina and Rao 2010) and (Christiaensen et al. 2012). In turn, in (Battese et al. 1988) the Authors analyse the total crop areas based on survey and satellite data using EBLUPs. The proposed LMM model is known as the Battese-Harter-Fuller model. The predictors are also exploited in the subject of experience rating in non-life insurance, see (Frees et al. 1999) and (Bühlmann and Gisler 2005), where the longitudinal data are under consideration. The insurance premium for the next period for every policy in the insurance portfolio is predicted.
A major challenge in this type of prediction is the estimation of the prediction accuracy measure. Most often it is the Root Mean Squared Error (RMSE), which is given in analytical form or can be e.g. estimated using bootstrap. A feature of the distribution of the squared prediction error is usually a very strong positive asymmetry. Because the mean is not recommended as the appropriate measure of the central tendency in such distributions, the alternative prediction accuracy measure called the Quantile of Absolute Prediction Errors (QAPE), proposed by (Żądło 2013) and (Wolny-Dominiak and Żądło 2020), can be applied.
There is a variety of R packages to calculate the considered predictors together with the accuracy measure of prediction, usually the RMSE. The package sae, see (Molina and Marhuenda 2015), provides EBLUPs based on Fay-Herriot and Battese-Harter-Fuller models. In turn, the multivariate EBLUP for Fay-Herriot models is implemented in msae, see (Permatasari and Ubaidillah 2021). Several EBLUPs introduced in (Rao and Yu 1994) are implemented in package saery introduced by (Lefler et al. 2014), likewise in JoSAE, see (Breidenbach 2018), but with additional heteroscedasticity analysis. The EBP is provided in the package emdi described in (Kreutzmann et al. 2019).
A new package in this area is our proposed package qape. It allows the prediction of flexibly defined characteristics of the response variable using the above three predictors, assuming an appropriate LMM. A novel feature of the package qape, compared to those already in place, is the ability of bootstrap estimation of the prediction accuracy measures, both the RMSE and QAPE. Three types of bootstrap procedures are provided: parametric, residual and double.
There are three groups of functions in this package: predictors values
calculation, bootstrap estimation of RMSE and QAPE measures, and Monte
Carlo (MC) analysis of properties of predictors and prediction accuracy
estimators. The prediction is based on a LMM model defined by the user
and allows to predict the population characteristics of the response
variable, which can be defined by a linear combination (in the case of
EBLUP), by any R function (e.g. sum
) or any function defined by the
user (in the case of the EBP and PLUG-IN predictors). The package allows
for full flexibility in defining: the model, the predicted
characteristic, and the transformation of the response variable.
This paper is organized as follows. Firstly, the background of the LMM
is presented together with the theoretical foundations of the prediction
including prediction accuracy measures. Then, the package functionality
in the area of prediction is presented and illustrated. A short
application based on radon
data, a cross-sectional dataset available
in HLMdiag package, to
predict three subpopulation characteristics is shown. Subsequently, the
theoretical background of the prediction accuracy measures estimation
based on bootstrap is presented. Implementations of bootstrap algorithms
in qape are briefly
introduced. Finally, the procedure of the model-based Monte Carlo
simulation study is discussed. The paper ends with a conclusion.
We consider the problem of prediction of any given function of the
population vector
To assess the accuracy of the particular predictor
The above described accuracy prediction measures RMSE and QAPE can be estimated using the bootstrap techniques. Their estimators as well as the bootstrap distributions of the prediction errors based on any (assumed or misspecified) model are provided in qape package, including algorithms where the parallel computing is used.
In the qape package, the whole prediction process has its own specific procedure, which can be presented in the following steps.
Procedure 1. The process of prediction, accuracy measures estimation and Monte Carlo simulation analyses in qape
Define the characteristics of the response variable to predict,
provide the information on sample and population values,
define the LMM,
estimate parameters of the LMM,
predict the random variable
estimate the prediction accuracy measures RMSE and QAPE using one of the developed bootstrap algorithms,
conduct simulation analyses of properties of predictors and accuracy measures estimators under any (also misspecified) LMM model.
The main functions of the qape package provide the bootstrap estimation of prediction accuracy measures. However, it must be preceded by the prediction process, including the choice of the LMM and the predictor.
Let
Then, let
In the residual bootstrap implemented in
qape, there is a need to
re-write the LMM model to take account of the specific structure of
data, i.e. the grouping variables taken into account in the random part
of the model. In this case, without a loss of the generality, the LMM
model can be written as follows:
In the qape package, in
the general case the predicted characteristic is given by any function
of response variables:
Empirical Best Linear Unbiased Predictor (EBLUP),
Empirical Best Predictor (EBP) under nested error LMM,
PLUG-IN predictor under the LMM.
The first predictor (EBLUP) allows to predict the linear combination of
the response variables:
To introduce the second predictor, called EBP, considered e.g. by
(Molina and Rao 2010), firstly, the Best Predictor (BP)
The last predictor is the PLUG-IN predictor defined as (e.g.
(Chwila and Żądło 2019)):
To deal with the LMM model, the
qape package uses the
lmer()
function from the
lme4 package, see
(Bates et al. 2015). Assuming ((4)) and based on
In order to obtain the predictor of EBLUP()
, ebpLMMne()
or plugInLMM()
. Firstly, the
characteristic of response variables of interest has to be defined. It
is actually obvious for EBLUP, which can be used only to predict the
population/subpopulation linear combination (e.g. the sum) by using the
argument gamma
equivalent to the population vector of weights
thetaFun
has to be
given (see thetaFun
could define one characteristic or a vector of characteristics, for
example:
> thetaFun1 <- function(x) median(x)
> thetaFun2 <- function(x) c(sum(x), mean(x), sd(x))
Secondly, two groups of input arguments, common to all three predictors, has to be provided:
group 1 - arguments defining the sample and the population
YS
- values of the dependent variable in the sample
(
reg
- the population matrix of auxiliary variables named in
fixed.part
, random.part
and division
,
con
- the population
group 2 - arguments defining the model
fixed.part
- fixed-effects terms declared as in lm4::lmer
function,
random.part
- random-effects terms declared as in lm4::lmer
function,
weights
- the population vector of weights.
The weights make it possible to include heteroscedasticity of random components in the LMM.
In EBLUP()
and plugInLMM()
the random-effects terms of the LMM have
to be declared as the input argument random.part
. The form of the
ebpLMMne
predictor, in turn, requires defining in the ebpLMMne()
function the so-called division
argument instead of random.part
.
This input represents the variable dividing the population dataset into
subsets, which are taken into account in the nested error linear mixed
model with ‘division
’-specific random components (presented in
supplementary document for this paper).
In the process of prediction, it is often necessary to perform data
transformation before estimating the model parameters. An example is the
logarithmic scaling of the variable of interest. The
qape package offers the
possibility for declaring the argument backTrans
to conduct the data
back-transformation. Hence, a very flexible solution is used which
allows to use any transformation of the response variable such that the
back-transformation can be defined. This argument (available in R or
defined by the user function) should be the back-transformation function
of the already transformed dependent variable used to define the model,
e.g. for log-transformed YS
used as the response variable:
> backTrans <- function(x) exp(x)
The main output is the value of predictor thetaP
. For each class of
predictors, there are two S3 methods registered for existing generic
functions print
and summary
. The full list of output arguments is
presented in detail in the qape-manual
file, cf. (Wolny-Dominiak and Żądło 2023).
In order to demonstrate the functionality of the package’s main
functions, in the following examples the radon
dataset available in
HLMdiag package
((Loy and Hofmann 2014)) is analyzed. It contains the results of a survey measuring
radon concentrations in 919 owner-occupied homes in 85 counties of
Minnesota (see Figure 1). A study was conducted in
1987-1988 by the Minnesota Department of Health, showing that indoor
radon levels are higher in Minnesota compared to typical levels in the
U.S. In the data, the response variable log.radon
(denoted in
((10)) by uranium
(basement
(county
(denoted by subscript
Figure 1: The maps of characteristics of radon concentration in counties in picoCurie per liter. The gray colour means that the value is NA (Not Available)
In all considered examples, the prediction for the county no. 26
(county == 26
) is conducted and it is assumed that the observations in
this county from the first floor (basement == 1
) are not available
(see Figure 2).
Figure 2: The distributions of radon concentration in picoCurie per liter in counties. The red line indicates county no. 26
The radon
dataset is widely discussed in the literature. In the paper
(Nero et al. 1994), the Authors used an ordinary regression model
to predict county geometric means of radon concentration using surficial
soil radium data from the National Uranium Resource Evaluation. In turn,
the paper (Price et al. 1996) focuses on the prediction of the
geometric mean of radon for each county, but using a Bayesian approach.
For the radon
data we use the following model
county
-specific random effects. Its syntax written using the package
lme4 notation is as
follows:
radon.model <- lmer(log.radon ~ basement + uranium + (basement | county), data = radon)
This and similar LMMs are considered, analyzed, and used for the
considered dataset in many publications, with a good overview presented
in (Gelman and Hill 2006). In (Gelman and Pardoe 2006), based on their
preceding research (Price et al. 1996), (Lin et al. 1999),
(Price and Gelman 2005), a very similar model but with additional
multivariate normality assumptions is studied, verified and chosen as
fitting well to the data within a Bayesian framework. The same model as
in (Gelman and Pardoe 2006) with its special cases is considered in
(Cantoni et al. 2021) but within the frequentist approach. Based on 25
measures of explained variation and model selection, the Authors
conclude that the same model as considered in our paper (with additional
normality assumption, however, which is not used in all cases considered
in that paper), "seems the best" (Cantoni et al. 2021 10) for the
radon
data. Further tests of the model are presented by
(Loy 2013), (Loy and Hofmann 2015) and (Loy et al. 2017) (see also
(Cook et al. 2007) for the introduction of the methodology) showing
among others: the normality and homescedasticity of random components,
the normality of the distribution of the random slope but – what is
important for our further considerations – the lack of the normality of
the random intercept. Since the problem of choosing and verifying a
model for the considered dataset is widely discussed in the literature,
we will focus on the issues that are new in this case, namely the
problem of prediction and estimation of the prediction accuracy as well
as the Monte Carlo analysis of predictors’ properties.
This example shows the prediction procedure in the package qape. In the first step, it is needed to define all the input arguments that will then be passed to the prediction functions.
> Ypop <- radon$log.radon # the population vector of the dependent variable
> # It is assumed that observations from the first floor
> # in county no. 26 are not available:
> con <- rep(1, nrow(radon))
> con[radon$county == 26 & radon$basement == 1] <- 0
> YS <- Ypop[con == 1] # sample vector of the dependent variable
> reg <- dplyr::select(radon, -log.radon) # the population matrix of auxiliary variables
> fixed.part <- 'basement + uranium' # the fixed part of the considered model
> random.part <- '(basement|county)' # the random part of the considered model
> # The vector of weights to define
> # the predicted linear combination - the mean for county == 26:
> gamma <-
+ (1 / sum((radon$county == 26))) * ifelse((radon$county == 26), 1, 0)
> estMSE <- TRUE # to include the naive MSE estimator of the EBLUP in the output
Then the functions corresponding to each predictor can be used. First,
the EBLUP prediction in the package
qape is presented. As the
EBLUP is limited to the linear combination of random variables, the
predicted characteristic is simply the arithmetic mean. To be precise,
it is the mean of logarithms of measurements (instead of the mean of
measurements), because the EBLUP can be used only under the linear
(linearized) models. As in the LMM the homescedasticity of random
components is assumed, the input argument weights = NULL
is set up.
> myeblup <- EBLUP(YS, fixed.part, random.part, reg, con, gamma, weights = NULL, estMSE)
> # the value of the predictor of the arithmetic mean
> # of logarithms of radon measurements:
> myeblup$thetaP
[1] 1.306916
> myeblup$neMSE # the value of the naive MSE estimator
[1] 0.002292732
Hence, the predicted value of the arithmetic mean of logarithms of radon
measurements equals
The second part of this example shows the prediction of the arithmetic
mean, geometric mean and median of radon measurements (not logarithm of
radon measurements) in county no. 26 with the use of the PLUG-IN
predictor. It requires the setting of two input arguments: thetaFun
and backTrans
.
> thetaFun <- function(x) {
+ c(mean(x[radon$county == 26]), psych::geometric.mean(x[radon$county == 26]),
+ median(x[radon$county == 26]))
+ }
> backTransExp <- function(x) exp(x) # back-transformation
> myplugin <- plugInLMM(YS, fixed.part, random.part, reg, con, weights = NULL,
+ backTrans = backTransExp, thetaFun)
> # values of the predictor of arithmetic mean, geometric mean
> # and median of radon measurements:
> myplugin$thetaP
[1] 3.694761 4.553745 3.900000
In this case we can conclude that the predicted values of the
aritmethmic mean, geometric mean and median in county no. 26 equal:
The qape package allows
to use the Empirical Best Predictor (EBP) (see the supplementary
document for this paper) as well. It provides predicted values of any
function of the variable of interest, as the PLUG-IN predictor. However,
this requires stronger assumptions to be met. The EBP procedure
available in qape package
is prepared under the assumption of the normality of the variable of
interest after any transformation. However, in the case of the
considered model for logarithms of radon measurements, the assumption is
not met as we mentioned before based on the results presented in the
literature. It can also be verified using normCholTest
function
(available in qape
package) as follows:
> normCholTest(radon.model, shapiro.test)$p.value
[1] 2.589407e-08
Moreover, due to the fact of very time-consuming iterative procedure
used to compute the EBP for the general case, in the
qape package the function
ebpLMMne
uses a very fast procedure working only for nested error
Linear Mixed Models (see (Molina and Rao 2010)).
The prediction of any function of the random variables based on
cross-sectional data has been considered. Its special case, not
presented above but widely discussed in the econometric literature, is
the prediction of one random variable, in this case a radon measurement
for one non-observed owner-occupied home. Furthermore, the
qape package is also
designed for prediction based on longitudinal data for current or future
periods as shown in examples for the EBLUP
, plugInLMM
and ebpLMMne
functions in the qape-manual
file, cf. (Wolny-Dominiak and Żądło 2023).
The qape package provides three main types of bootstrap algorithms: the parametric bootstrap, the residual bootstrap and the double-bootstrap.
The parametric bootstrap procedure is implemented according to (González-Manteiga et al. 2007) and (González-Manteiga et al. 2008) and could be described in the following steps:
based on
generate
decompose the vector
in the
compute the bootstrap realization
obtain the vector of estimates
compute bootstrap realizations of prediction error
compute the parametric bootstrap estimators of prediction accuracy
measures: RMSE and QAPE replacing prediction errors
Another possible method to estimate the prediction accuracy measures is
the residual bootstrap. In what follows, we use the notation
To obtain the algorithm of the residual bootstrap, it is enough to replace step 2 of the parametric bootstrap procedure presented above with the following procedure of the population data generation based on ((5)):
The next 3–5 steps in this procedure are analogous to steps in the parametric bootstrap procedure.
In the above-described step, it can be seen that if more than one vector
of random effect is assumed at the
The residual bootstrap algorithm can also be performed with so-called "correction procedure". This procedure, which can improve the properties of the residual bootstrap estimators due to the underdispersion of the uncorrected residual bootstrap distributions, is presented in the supplementary document for this paper.
Two bootstrap procedures are implemented in separate functions:
bootPar()
(the parametric bootstrap) and bootRes()
(the residual
bootstrap). According to the general Procedure 1, the step
preceding the bootstrap procedure in both functions is the definition of
the predictor object. It must be one of the following: EBLUP
,
ebpLMMne
or plugInLMM
. This object has to be passed to bootPar()
or bootRes()
as the input parameter predictor
. The other input
parameters are intuitive: B
- the number of bootstrap iterations and
p
- order of quantiles in the estimated QAPEs.
The additional input parameter in bootRes()
is a logical condition
called correction
, which makes it possible to include an additional
correction term for both random effects and random components, presented
in the supplementary document for this paper, to avoid the problem of
underdispersion of residual bootstrap distributions.
The main output values in both functions are basically the measures:
estRMSE
and estQAPE
computed based on ((2)) and
((3)), respectively, where prediction errors are replaced by
their bootstrap realizations. There is also the output error
being the
vector of bootstrap realizations of prediction errors, which is useful
e.g. in in-depth analysis of the prediction accuracy and for graphical
presentation of results. To estimate these accuracy measures, we use
below the residual bootstrap with the correction procedure.
As previously stated, our package utilizes the lmer()
function from
the lme4 package for
estimating model parameters. However, this function has been known to
generate convergence warnings in certain situations, listed for example
by (Bates et al. 2015) p. 25, when the estimated variances of random effects are
close to zero. Such scenarios may occur when models are estimated for
smaller or medium-sized datasets, when complex variance-covariance
structures are assumed, or when the grouping variable considered for
random effects has only a few levels. Although we have not observed such
issues estimating model parameters based on the original dataset
required to compute values of the predictors in previous sections,
bootstrapping or Monte Carlo simulations are more complex cases. This is
because, based on the estimates of model parameters, the values of the
dependent variables are generated
The analyses presented in Example 1 are continued. We extend the previous results to include the issue of estimating the prediction accuracy of the considered predictors. The use of functions for this estimation primarily requires an object of class predictor, here "myplugin".
> class(myplugin)
[1] "plugInLMM"
The short chunk of the R code presents the residual bootstrap estimators
of the RMSE (estRMSE
) and the QAPE (estQAPE
) of the PLUG-IN
predictors (plugin
) of previously analyzed three characteristics of
radon measurements in county no. 26: the arithmetic mean, geometric mean
and median. In this and subsequent examples we make the computations for
relatively high number of iterations allowing, in our opinion, to get
reliable results. These results are also used to prepare Figure
3. However, the computations are time-consuming. The
supplementary R file contains the same chunks of the code but the number
of iterations applied is smaller in order to execute the code swiftly.
> # accuracy measures estimates based on
> # the residual bootstrap with the correction:
> B <- 500 # number of bootstrap iterations
> p <- c(0.75, 0.9) # orders of Quantiles of Absolute Prediction Error
> set.seed(1056)
> residBoot <- bootRes(myplugin, B, p, correction = TRUE)
> # values of estimated RMSEs of the predictor of three characteristics:
> # the arithmetic mean, geometric mean and median of radon measurements, respectively:
> residBoot$estRMSE
[1] 0.1848028 0.2003681 0.2824359
> # values of estimated QAPEs
> # (of order 0.75 in the first row, and of order 0.9 in the second row)
> # of the predictor of three characteristics:
> # the arithmetic mean, geometric mean and median of radon measurements,
> # in the 1st, 2nd and 3rd column, respectively:
> residBoot$estQAPE
[,1] [,2] [,3]
75% 0.1533405 0.2135476 0.2908988
90% 0.2813886 0.3397411 0.4374534
Let us concentrate on interpretations of estimators of accuracy measures
for the predictor of the geometric mean, i.e. the second value of
residBoot$estRMSE
, and values in the second column of
residBoot$estQAPE
. It is estimated that the average difference between
predicted values of the geometric mean and their unknown realizations
equals
Figure 3: The histograms of bootstrap absolute prediction errors for myplugin (for PLUG-IN predictors of the arithmetic mean, geometric mean and median) for B=500
Since the assumption of normality is not met, the parametric bootstrap
should not be used in this case. For this reason, we do not present the
results for this method below, although – but for illustrative purposes
only – they are presented in the supplementary R file. Moreover, these
analyses can also be conducted using bootParFuture()
and
bootResFuture()
functions where parallel computing algorithms are
applied. The input arguments and the output of these functions are the
same as in bootPar()
and bootRes()
. Examples based on these
functions are also included in the supplementary R file.
The qape package also
allows to use predictors under a model different from the assumed one
(e.g. a simpler or more robust model), but estimate its accuracy under
the assumed model. In this case, the parametric and residual bootstrap
procedures are implemented in bootParMis()
and bootResMis()
functions. These functions allow to estimate the accuracy of two
predictors under the model correctly specified for the first of them. Of
course, it is expected that the estimated accuracy of the first
predictor will be better than of the second one, but the key issue can
be the difference between estimates of accuracy measures. A small
difference, even to the second predictor’s disadvantage, may be treated
by the user as an argument for using the second predictor due to its
properties, such as robustness or simplicity.
The considered functions allow to estimate the accuracy of two
predictors, which belong to the class plugInLMM
, under the model used
to define the first of them. The remaining arguments are the same as in
bootPar()
and bootRes()
functions: B
- the number of bootstrap
iterations, and p
- orders of QAPE estimates to be taken into account.
The output results of bootParMis()
and bootResMis()
include –
similarly to bootPar()
and bootRes()
functions – estimates of the
RMSEs and QAPEs of both predictors (denoted here by: estRMSElmm
,
estRMSElmmMis
, estQAPElmm
and estQAPElmmMis
), and boostrap
realizations of their prediction errors (errorLMM
and errorLMMmis
).
In this example, we study the same accuracy measures as in Example 2,
but the aim is to compare the predictor myplugin
and other predictor
defined under the misspecified LMM. First, the misspecified model has to
be defined, and a relevant predictor has to be computed.
> fixed.part.mis <- '1'
> random.part.mis <- '(1|county)'
> myplugin.mis <- plugInLMM(YS, fixed.part.mis, random.part.mis, reg, con,
+ weights = NULL, backTrans = backTransExp, thetaFun)
Having two objects: myplugin
and myplugin.mis
, one can proceed to a
comparison by estimating bootstrap prediction accuracy performed using
the residual bootstrap with correction procedure. In this case, we
estimate the prediction accuracy of these two predictors under the model
used to define the first of them.
> set.seed(1056)
> residBootMis <- bootResMis(myplugin, myplugin.mis, B, p, correction = TRUE)
> # residual bootstrap with the correction RMSE estimators
> # of 'plugin' of: arithmetic mean, geometric mean and median
> # of radon measurements in county 26:
> residBootMis$estRMSElmm
[1] 0.1848028 0.2003681 0.2824359
> # residual bootstrap with the correction RMSE estimators
> # of 'plugin.mis' of: arithmetic mean, geometric mean and median
> # of radon measurements in county 26:
> residBootMis$estRMSElmmMis
[1] 0.1919184 0.3192304 0.2762137
> # residual bootstrap with the correction QAPE estimators of order 0.75 and 0.9
> # of 'plugin' of: arithmetic mean, geometric mean and median
> # of radon measurements in county 26:
> residBootMis$estQAPElmm
[,1] [,2] [,3]
75% 0.1533405 0.2135476 0.2908988
90% 0.2813886 0.3397411 0.4374534
> # residual bootstrap with the correction QAPE estimators of order 0.75 and 0.9
> # of 'plugin.mis' of: arithmetic mean, geometric mean and median
> # of radon measurements in county 26:
> residBootMis$estQAPElmmMis
[,1] [,2] [,3]
75% 0.2267062 0.3802836 0.3255197
90% 0.2813787 0.4970726 0.4489399
The results, presented above, were obtained for the same number of
bootstrap iterations as in Example 2 (plugin
, estimated RMSEs of plugin
and
plugin.mis
predictors of the geometric mean given by plugin.mis
of the geometric mean
equals to plugin
for the same
prediction problem. Hence, in this case, the accuracy comparison based
both on the RMSE and QAPE leads to the same finding.
In the previous paragraph, we have focused on the results for the case
of prediction of the geometric mean. If the comparison is made for the
case of prediction of the arithmetic mean (the first column of output
results) or the median (the third column of output results), we will
come to the same conclusion regarding the estimated accuracy of plugin
and plugin.mis
as in the case of prediction of the geometric mean.
Similarly to the residual bootstrap, the parametric bootstrap procedure
paramBootMis
available in
qape package can be
performed. However, in the considered case the normality assumption is
not met (as discussed above) and the procedure is not recommended. The
appropriate chunk of the R code is presented in the supplementary R
file, but it is solely intended for illustrative purposes.
In the previous section, our aim was to estimate the prediction accuracy under correctly specified or misspecified model. In this section, we do not estimate the accuracy, but we approximate the true prediction accuracy under the specified model in the Monte Carlo simulation study. The crucial difference is that in this case, the model parameters used are obtained based on the whole population dataset, not the sample. If the number of iterations is large enough, we can treat the computed values of the measures as their true values, which are unknown in practice.
The last step of the analysis in qape package presented in Procedure 1 is the Monte Carlo (MC) simulation analysis of:
properties of predictors
and properties of parametric, residual and double bootstrap estimators of accuracy measures.
The whole Monte Carlo procedure is as follows.
Procedure 2. Model-based Monte Carlo simulation analyses in qape
define the population vector of the dependent variable and the population matrix of auxiliary variables,
provide the information on the division of the population into the sampled and non-sampled part,
define
define the predictors
define the model to be used to generate realizations of the values of the dependent variable and estimate its parameters based on population data,
For k=1, 2, ..., K
generate the population vector of the response variable based on the assumed model,
based on population data, compute the characteristics
based on sample data, estimate the parameters of the LMM,
based on sample data, compute values of predictors
based on sample data, estimate the accuracy of
End For
compute accuracy measures of predictors using
compute accuracy measures of estimators of prediction accuracy measures.
In order to perform a Monte Carlo (MC) analysis on the properties of
predictors, it is necessary to have access to the entire population data
for both dependent and independent variables. The function mcLMMmis()
can be used with the following arguments. Firstly, the population values
of the dependent variable (after a necessary transformation) should be
declared as Ypop
. By using the Ypop
values, we can estimate the
model parameters based on the entire population data (assuming that they
are known). This allows us to generate values of the dependent variable
in the simulation study that can mimic its distribution in the entire
population, not just in the sample. This approach ensures that our
simulation study can be an accurate representation of the random process
in the entire population, resembling the real-world scenario. Secondly,
three predictors: predictorLMMmis
, predictorLMM
, predictorLMM2
,
which belong to the class plugInLMM
, are to be defined. The first one
is used only to define the (possibly misspecified) model used to
generate population values of the response variables. Accuracy of
predictorLMM
and predictorLMM2
is assessed in the simulation study.
The next two arguments include the number of MC iterations K
and
orders p
of QAPEs used to assess the prediction accuracy. Finally, it
should be noted that it is possible to modify covariance matrices of
random components and random effects based on the model defined in
predictorLMMmis
, which are used tThiso generate values of the
dependent variable. It is possible by declaring values of ratioR
and
ratioG
arguments, which the diagonal elements of covariance matrices
of random components and random effects, respectively, are divided by.
The output of this function covers the following statistics of both
predictors computed in the simulation study: relative biases (rBlmm
and rBlmm2
), relative RMSEs (rRMSElmm
and rRMSElmm2
) and QAPEs
(QAPElmm
and QAPElmm2
). Simulation-based prediction errors of both
predictors (errorLMM
and errorLMM2
) are also taken into account.
In the example, an MC simulation is carried out assuming the myplugin
predictor. The goal is to approximate the true accuracy of the
prediction assuming model ((10)). Hence, in the package
qape, all input predictor
objects in the function mcLMMmis
have to be defined as myplugin
.
> # input arguments:
predictorLMMmis <- myplugin # to define the model
predictorLMM <- myplugin # which properties are assessed in the simulation study
predictorLMM2 <- myplugin # which properties are assessed in the sim. study
Except that no modification of covariance matrices has to be used.
# diag. elements of the covariance matrix of random components are divided by:
ratioR <- 1
# diag. elements of the covariance matrix of random effects are divided by:
ratioG <- 1
We specify the number of Monte Carlo iterations.
K <- 500 # the number of MC iterations
The analysis is conducted in the object MC
.
> set.seed(1086)
> MC <- mcLMMmis(Ypop, predictorLMMmis, predictorLMM, predictorLMM2,
+ K, p, ratioR, ratioG)
> # relative bias of 'predictorLMM'
> # of the arithmetic mean, geometric mean and median in county 26 (in %):
> MC$rBlmm
[1] -1.73208393 -0.04053178 -5.22355236
Results of the relative biases are obtained. It is seen, that under the
assumed model the values of the considered predictor of the geometric
mean (the second value of MC$rBlmm
) are smaller than possible
realizations of the geometric mean on average by
> # relative RMSE of 'predictorLMM'
> # of the arithmetic mean, geometric mean and median in county 26 (in %):
> MC$rRMSElmm
[1] 3.429465 4.665810 7.146678
In the considered case, the average difference between predicted values
of the geometric mean and its possible realizations (the second value of
MC$rRMSElmm
) equals
Finally, QAPEs of orders 0.75 and 0.9 are considered.
> # QAPE of order 0.75 and 0.9 of 'predictorLMM'
> # of the arithmetic mean, geometric mean and median in county 26:
> MC$QAPElmm
[,1] [,2] [,3]
75% 0.1491262 0.1989504 0.2919221
90% 0.2895684 0.2959457 0.4728064
Let us interpret the results presented in the second column of
MC$QAPElmm
. At least
In Example 4, the accuracy of one predictor under the model used to
define this predictor was presented. A more complex version of the
simulation study, where the properties of two predictors are studied
under the model defined by the third predictor, is presented in the
supplementary R file. What is more, the
qape package also allows
to use mcBootMis()
function to conduct MC analyses of properties of
accuracy measure estimators (estimators of MSEs and QAPEs) of two
predictors (which belong to the class plugInLMM
) declared as
arguments. The model used in the simulation study is declared in the
first predictor, but the properties of accuracy measures estimators of
both predictors are studied. Output results of mcBootMis()
covers
simulation results on properties of different accuracy measures
estimators, including the relative biases and relative RMSEs of the
parametric bootstrap MSE estimators of both predictors. The same
simulation-based statistics but for parametric bootstrap QAPE estimators
are also included. Other bootstrap methods, including the residual
bootstrap with and without the correction procedure, are also taken into
account. The full list of output arguments of mcBootMis()
function are
presented in qape-manual
file, cf. (Wolny-Dominiak and Żądło 2023).
The package enables R users to make predictions and assess the accuracy under linear mixed models based on different methods in a fast and intuitive manner – not only based on the RMSE but also based on Quantiles of Absolute Prediction Errors. It also covers functions which allow to conduct Monte Carlo simulation analyses of properties of the methods of users interest. Its main advantage, compared to other packages, is the considerable flexibility in terms of defining the model (as in the lme4 package) and the predicted characteristic, but also the transformation of the response variable.
In our opinion, the package is useful for scientists, practitioners and decision-makers in all areas of research where accurate estimates and forecasts for different types of data (including cross-sectional and longitudinal data) and for different characteristics play the crucial role. We believe that it will be of special interest to survey statisticians interested in the prediction for subpopulations with small or even zero sample sizes, called small areas.
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2024-004.zip
qape, sae, msae, saery, JoSAE, emdi, HLMdiag, lme4
Econometrics, Environmetrics, MixedModels, OfficialStatistics, Psychometrics, SpatioTemporal
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wolny--Dominiak & Ża̧dło, "Prediction, Bootstrapping and Monte Carlo Analyses Based on Linear Mixed Models with QAPE 2.0 Package", The R Journal, 2025
BibTeX citation
@article{RJ-2024-004, author = {Wolny--Dominiak, Alicja and Ża̧dło, Tomasz}, title = {Prediction, Bootstrapping and Monte Carlo Analyses Based on Linear Mixed Models with QAPE 2.0 Package}, journal = {The R Journal}, year = {2025}, note = {https://doi.org/10.32614/RJ-2024-004}, doi = {10.32614/RJ-2024-004}, volume = {16}, issue = {1}, issn = {2073-4859}, pages = {67-82} }