There is a lack of robust statistical analyses for random effects linear models. In practice, statistical analyses, including estimation, prediction and inference, are not reliable when data are unbalanced, of small size, contain outliers, or not normally distributed. It is fortunate that rank-based regression analysis is a robust nonparametric alternative to likelihood and least squares analysis. We propose an R package that calculates rank-based statistical analyses for two- and three-level random effects nested designs. In this package, a new algorithm which recursively obtains robust predictions for both scale and random effects is used, along with three rank-based fitting methods.
Rank-based procedures retain distribution-free estimation and testing properties. These procedures are much less sensitive to outliers than the traditional analyses when random errors are not normally distributed. Alternative robust score functions can be accommodated with the rank-based methods to protect analyses from influential observations in factor and response spaces. Also, the choice of these score functions could depend on the prior knowledge on error distributions. The Wilcoxon score function is fairly efficient for moderate to heavy-tailed error distributions. For example, rank-based procedures with Wilcoxon scores achieve up to 95% efficiency relative to least squares methods when the data are normal and are much more efficient than least squares methods for heavy tailed error distributions. These properties make the rank-based methods appealing. However, to our knowledge, statistical analyses for random effects models using the rank-based methodology have not yet been considered in any statistical package. This article proposes an R package with three rank-based fitting methods that estimate fixed effects and predict random effects in two- and three-level random effects nested models.
The rank-based norm, analogous to the least squares norm, is briefly defined as
where the scores are generated as
The rank-based estimate of
with
Random effects nested models are frequently utilized in many research areas such as: education, survey sampling, meta-analysis, agriculture, and health. Survey sampling might happen within organizational units, communities, clusters, or hospitals. The experimental design of interest is expressed in terms of fixed effects but, for these designs, nested factors are a natural part of the experiment. These nested effects are generally considered random and must be taken into account in the statistical analysis. For example, repeated measures design, randomized block design, and cluster correlated data consider each subject as a cluster having a correlated structure so the observations are nested within the subject. This subject would be a block, a cluster, an institution or a region. These designs are examples of two-level nested structures. Compound symmetric error structure is a natural variance-covariance form of these designs. It implies that all distinct members of a cluster or a subcluster are equally correlated with each other. When there is one more correlated level (subclusters) within clusters, this design could be called a three-level nested structure. Hierarchical linear models can also deal with these nested designs.
To illustrate the issues involved in data analysis, consider a simple
example of a two-level nested design where students are nested within
schools (see Model (3)), and a three-level nested design
where students are nested within sections in schools (see Model
(4)). Students are measured on a continuous univariate
response variable of interest,
with
with
The main interest with these models would be to estimate the regression
parameters
This nested analog could be adopted for other organizational studies and hierarchical data. These designs often address questions related to
the examination of differences within and across clusters or contexts such as classrooms, schools, neighborhoods, or groups on individual outcomes;
the investigation of the degree to which individuals within a group or cluster are similar as measured through the ICC;
the study of the factors that explain institutional/cluster differences;
the effects of clusters and treatment on individual scores, e.g., student’s academic achievement — both random and fixed effects are addressed in these interests.
In this article, three rank-based fitting methods and a new prediction algorithm are briefly introduced. A data analysis example using our package, rlme, is also presented.
This section introduces three rank-based fitting methods to obtain fixed effects estimations: Joint Ranking (JR), Generalized Rank Estimate (GR) and Generalized Estimating Equation (GEER). The algorithm for robust variance estimates and random effects predictions in random effects nested models, called Rank Prediction Procedure (RPP), is then introduced. To sketch the calculation algorithms in these methods, Models ((3)) and ((4)) can be rewritten in the general mixed model matrix and vector notations as follows:
where
Alternatively, the model can be written in vectors of the observations
obtained from independent
This rank-based method for nested random effects models uses asymptotic
results of the study by Kloke et al. (2009) in estimating fixed effects and
standard errors. Kloke et al. (2009) developed the asymptotic theory for the
rank-based estimates of fixed effects in the general mixed model using
the general rank theory of Brunner and Denker (1994). The estimation of fixed effects in
the JR method uses the dispersion function as in the independent linear
model. However, the asymptotic distribution of
where
After estimating the fixed effects in Model ((5)), we
predict the nested random effects and estimate the variance components
using the random prediction procedure explained in the next section. In
this method, for each cluster in Model ((6)), simple moment
estimators of
The generalized rank-based fitting for the general mixed model is an
iteratively reweighted rank method based on the Newton-type
approximation. Hettmansperger and McKean (2011) developed the asymptotic properties of linearized
rank estimators for use in the linear model with the
Consider Model ((5)) where
Set
Obtain
where
Use
Use
Use
If
The estimators of the asymptotic variance-covariance matrix of
Considering an alternative representation of generalized linear models for correlated data in estimates, Abebe et al. (2013) extended the general estimating equations (GEE) method of Liang and Zeger (1986) for the general mixed models in the rank-based norm, and derived the asymptotic normality of the rank estimators. Briefly, as Abebe et al. (2013) describe, we can rewrite the general estimating equations expression proposed by Liang and Zeger in terms of the Euclidean norm in Model ((9)), and thus, the rank-based norm in Model ((10)) as follows:
Abebe et al. (2013) developed a class of nonlinear robust estimators minimizing
the rank-based norm of the residuals defined in the rank-based general
estimating equations utilizing a ‘working’ covariance structure in the
rank-based fitting as an analogue of the GEE. Thus, the estimate of
The asymptotic derivations for the proposed estimators are discussed in several papers; Kloke et al. (2009) for the JR method; Bilgic et al. (2013) for the GR method; and Abebe et al. (2013) for the GEER method. In these studies, the rank-based estimators are competitive with the traditional methods such as maximum likelihood (ML), restricted maximum likelihood (REML) and least squares in the normal case and outperform when random errors are contaminated and exhibit better efficiency properties of the estimates when outliers exist. Among the three methods, the JR method is unweighted so its empirical validity and efficiency for the fixed effects is reported to be poorer than the other two methods in the Monte Carlo study performed by Bilgic (2012). The empirical validity and efficiency of GR and GEER methods are reported to be very similar. The GR estimates and their standard errors are obtained from the rank-based norm properties, whereas the GEER combines the rank-based norm and least squares properties. For highly correlated data, the GR or GEER method would be preferred.
The rlme package uses the suite of R functions ww developed by Terpstra and McKean (2005) that computes fixed estimates for the rank analysis based on Wilcoxon scores when needed for independent linear models and initial fits. We plan to use the subroutines of the Rfit package for the next version of our package because it allows the user to choose the estimation algorithm for general scores.
So far, fixed effects estimations are calculated with the JR, GR and GEER methods described in the previous section. This section introduces robust predictions for random effects and variance components for two- and three-level nested designs. Robust predictions of random effects have been discussed in several papers, including Groggel (1983), Groggel et al. (1988), Dubnicka (2004), and Kloke et al. (2009). These predictions based on clusters use robust scale estimators. However, these papers only handle two-level random nested designs.
To illustrate how our recursive algorithm works, let us consider a two-level nested structure. The random effects model is defined as
for
For a vector
The estimator in Equation ((14)) is a consistent estimator of
a scale parameter when the errors have a normal distribution (Hettmansperger and McKean 2011).
Kloke et al. (2009) suggest that the
For a three-level nested structure, consider Model ((4)).
The residuals
The RPP algorithm can handle k-level nestings in a hierarchical
structure in the same manner. The algorithm needs only residuals from
the model fitting for predictions of the random effects. In
The rlme package uses linear model syntax in the
lme4 package for two- and
three-level models. To illustrate how our package does nested structured
data analysis, a data set was obtained from the OECD Programme for
International Student Assessment (PISA) conducted in 2009 (OECD, 2010).
The data set includes 334 observations of metacognitive scores in 11
private schools in four geographic regions in USA. Metacognitive score
is an index measure of the metacognitive aspect of learning. The
research questions to be answered are how metacognitive scores depend on
gender and age, and how the variability of the scores are explained by
regional differences and school differences. Student scores are nested
in the private schools which are nested within the regions. Data are
correlated within region and school, hence, regions and schools are
random effects on observations. This design would be considered
hierarchical with two- or three- levels, such as students nested in
regions, or students in schools nested within regions. In the package,
this data set is called schools
.
In our package, a 3-level nested design data analysis is done using the following syntax:
> library(rlme)
> data(schools)
> model = y ~ 1 + sex + age + (1 | region) + (1 | region:school)
> rlme.fit = rlme(model, schools, method="gr")
# For robust predictions, include rprpair="med-mad"
The formula syntax, the same as in the lmer
function, expects the
random effect terms to be enclosed with parenthesis, with nesting of
variables denoted by a colon. In this example, region and school are two
random effects, with school nested within region. The method is set to
the rank-based method gr
along with the prediction method hl-disp
and Wilcoxon scores wil
(the default). Alternatively the other
rank-based fitting methods, jr
or geer
, and the maximum likelihood
methods, reml
or ml
, may be called from method
.
The calculated fit can be examined using the summary
function:
> summary(rlme.fit)
Linear mixed model fit by GR
Formula: y ~ 1 + sex + age + (1 | region) + (1 | region:school)
Random effects:
Groups Name Variance
region:school (Intercept) 0.14703510
region (Intercept) 0.01497416
Residual 0.81229310
Number of obs:
334 observations, 4 clusters, 11 subclusters
Fixed effects:
Estimate Std. Error .. p value
(Intercept) 0.1586624 0.2686822 .. 0.554841676
sex -0.2953611 0.1071201 .. 0.005828259
age 0.2260327 0.1621609 .. 0.163354085
Intra-class correlation coefficients
Estimates
intra-cluster 0.1509132
intra-subcluster 0.1662823
cov-var (fixed effects)
sex age
7.219013e-02 -0.0002000644 -1.672882e-05
sex -2.000644e-04 0.0114747092 9.802962e-04
age -1.672882e-05 0.0009802962 2.629616e-02
Here, intra-cluster, reml
method, this result is calculated around
The plot function can be used to generate the standardized residuals vs. the fitted response and a normal Q-Q plot as shown in Figure 1:
> plot(rlme.fit)
The residuals and random effects can be extracted from the fit. For
example, we can extract the raw residuals through the list element
ehat
:
# Raw residuals
> rlme.fit$ehat
[,1]
1 -1.186129739
2 -1.641494973
3 -1.147704177
...
334 -0.327186795
Several other elements of interest would include the effects
estimates/predictions of the fixed effects, errors, clusters, and
subclusters, which are obtained from rlme.fit$fixed.effects
,
..$effect.err
, ..$effect.cluster
, and ..$effect.subcluster
,
respectively. A full list can be found in the help(rlme)
command or
the str
command. The same model can be also evaluated with the
likelihood methods, reml
or ml
:
> rlme.fit = rlme(model, schools, method="reml")
> summary(rlme.fit)
Linear mixed model fit by REML
...
Fixed effects:
Estimate Std. Error .. p value
(Intercept) -0.08756901 0.2681410 .. 0.7439869
sex -0.26286182 0.1092493 .. 0.0161250
age 0.24712016 0.1720753 .. 0.1509693
Intra-class correlation coefficients
Estimates
intra-cluster 0.2433171
intra-subcluster 0.2443941
...
The REML and GR results are slightly different, but do coincide in the inference at the 5% level. A 2-level random effects nested data analysis that students are nested within regions can be done in a similar syntax:
> data(schools)
> model = y ~ 1 + sex + age + (1 | region)
> rlme.fit = rlme(model, data = schools, method = "gr")
> summary(rlme.fit)
Linear mixed model fit by GR
Formula: y ~ 1 + sex + age + (1 | region)
...
Fixed effects:
Estimate Std. Error .. p value
(Intercept) -0.09298845 0.22061464 .. 6.733921e-01
sex -0.35939453 0.08977332 .. 6.245027e-05
age 0.11882249 0.13665390 .. 3.845660e-01
Intra-class correlation coefficients
Estimates
intra-cluster 0.1452482
intra-subcluster 1.0000000
...
Diagnostics, TDBETAS
and CFITS
, to detect differences in fits for
various methods can be obtained with the function fitdvcov
. To compare
the fixed effects estimates from any two fits, the covariance matrix
from one rank-based method is required. Here we compare the REML and GR
methods for the model:
> data(schools)
> model = y ~ 1 + sex + age + (1 | region) + (1 | region:school)
# Extract covariates into matrix
> cov = as.matrix(data.frame(rep(1, length(schools[,"y"])),
schools[,"sex"], schools[,"age"]))
# Fit the models using each method
> reml.fit = rlme(model, schools, method="reml")
> gr.fit = rlme(model, schools, method="gr")
# Extract fixed effects estimates
> reml.beta = reml.fit$fixed.effects$Estimate
> gr.beta = gr.fit$fixed.effects$Estimate
# Extract the covariance matrix of the fixed effects estimates
> var.b = reml.fit$var.b
> fitdvcov(cov, reml.beta, gr.beta, var.b)$tdbeta
[1] 0.9406339
# The following gets CFITS. For more info about diagnostics and
# benchmarks, see help(fitdvcov) and help(compare.fits)
> fitdvcov(cov, reml.beta, gr.beta, var.b)$cfits
# Graphing alone and getting standard residuals
> getgrstplot(gr.fit)
> getgrstplot(gr.fit)$sresid
> getlmestplot(reml.fit)
> getlmestplot(reml.fit)$sresid
In our package, in case of potential outliers in factor space, high
breakdown weights (hbr), weight="hbr"
, are specified along with
method="geer"
. We use the routines of the ww R codes for independent
models:
> data(schools)
> rlme.fit = rlme(y ~ 1 + sex + age, data = schools)
You have entered an independent linear model.
Continuing using the ww package.
Wald Test of H0: BETA1=BETA2=0
TS: 38.4414 PVAL: 0
Drop Test of H0: BETA1=BETA2=0
TS: 7.819 PVAL: 5e-04
EST SE TVAL PVAL
BETA0 0.2333 0.0000 9105.8041 0
BETA1 -0.4192 0.0478 -8.7664 0
BETA2 0.0000 0.0772 0.0000 1
The rlme package analyzes random effects nested models with respect to estimation, inference and prediction for two- and three-level designs. These designs are a class of mixed models. Hierarchical linear model users may be interested in our package for robust analysis as well.
In the rlme package, estimations and inferences are obtained from three different rank-based methods along with the likelihood methods obtained from the nlme package (Pinheiro et al. 2013). These three methods include the joint ranking (JR), iteratively reweighted generalized rank (GR) and rank-based generalized estimating equations (GEER). Variance estimates and random effects predictions are obtained along with a robust algorithm, called Rank Prediction Procedure (RPP). New rank-based estimators are employed in these algorithm and methods. Diagnostics are included in our package as well.
We are planning to extend statistical analysis of iteratively generalized rank-based methods using rank-norm properties to the general mixed models with various error structures. We would welcome any feedback and/or collaborations.
We would like to thank J. W. McKean for his invaluable guidance in theory and C. Leary for reviewing a preliminary version of this article.
Econometrics, Environmetrics, MixedModels, Psychometrics, Robust, SpatioTemporal
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bilgic & Susmann, "rlme: An R Package for Rank-Based Estimation and Prediction in Random Effects Nested Models", The R Journal, 2013
BibTeX citation
@article{RJ-2013-027, author = {Bilgic, Yusuf K. and Susmann, Herbert}, title = {rlme: An R Package for Rank-Based Estimation and Prediction in Random Effects Nested Models}, journal = {The R Journal}, year = {2013}, note = {https://rjournal.github.io/}, volume = {5}, issue = {2}, issn = {2073-4859}, pages = {71-79} }