In the nineteen seventies, Jurečková and Jaeckel proposed rank estimation for linear models. Since that time, several authors have developed inference and diagnostic methods for these estimators. These rank-based estimators and their associated inference are highly efficient and are robust to outliers in response space. The methods include estimation of standard errors, tests of general linear hypotheses, confidence intervals, diagnostic procedures including studentized residuals, and measures of influential cases. We have developed an R package, Rfit, for computing of these robust procedures. In this paper we highlight the main features of the package. The package uses standard linear model syntax and includes many of the main inference and diagnostic functions.
Rank-based estimators were developed as a robust, nonparametric
alternative to traditional likelihood or least squares estimators.
Rank-based regression was first introduced by Jurečková (1971) and
Jaeckel (1972). McKean and T. Hettmansperger (1978) developed a Newton step algorithm that led to feasible
computation of these rank-based estimates. Since then a complete
rank-based inference for linear models has been developed that is based
on rank-based estimation analogous to the way that traditional analysis
is based on least squares (LS) estimation; see Chapters 3-5 of the
monograph by Hettmansperger and McKean (2011) and Chapter 9 of Hollander and D. A. Wolfe (1999). Furthermore, robust
diagnostic procedures have been developed with which to ascertain
quality of fit and to locate outliers in the data; see McKean and S. Sheather (2009) for a
recent discussion. Kloke, J. McKean, and M. Rashid (2009) extended this rank-based inference to mixed
models. Thus rank-based analysis is a complete analysis analogous to the
traditional LS analysis for general linear models. This rank-based
analysis generalizes Wilcoxon procedures for simple location models and,
further, it inherits the same high efficiency that these simple
nonparametric procedures possess. In addition, weighted versions can
have high breakdown (up to 50%) in factor space (Chang et al. 1999). In this
paper, we discuss the Rfit
package that we have developed for rank-based (R) estimation and
inference for linear models. We illustrate its use on examples from
simple regression to
The geometry of rank-based estimation is similar to that of LS. In rank-based regression, however, we replace Euclidean distance with another measure of distance which we refer to as Jaeckel’s dispersion function; see Hettmansperger and McKean (2011) for details. For a brief overview see McKean (2004).
Jaeckel’s dispersion function depends on the choice of a score function. As discussed in Hettmansperger and McKean (2011), the rank-based fit and associated analysis can be optimized by a prudent choice of scores. If the form of the error distribution is known and the associated scores are used, then the the analysis is fully efficient. In Rfit we have included a library of score functions. The default option is to use Wilcoxon (linear) scores, however it is straightforward to create user-defined score functions. We discuss score functions further in a later section.
Others have developed software for rank-based estimation. Kapenga et al. (1995) developed a Fortran package and Crimin et al. (2008) developed a web interface (cgi) with Perl for this Fortran program. Terpstra and McKean (2005) developed a set of R functions to compute weighted Wilcoxon (WW) estimates including the high breakdown point rank-based (HBR) estimate proposed by Chang et al. (1999). See McKean, J. Terpstra, and J. Kloke (2009) for a recent review. Rfit differs from the WW estimates in that its estimation algorithms are available for general scores and it uses a standard linear models interface.
The package Rfit allows the user to implement rank-based estimation
and inference described in Chapters 3-5 of Hettmansperger and McKean (2011) and Chapter 9 of Hollander and D. A. Wolfe (1999).
There are other robust packages in R. For example, the R function rlm
of the R package MASS
(Venables and B. D. Ripley 2002) computes M estimates for linear models based on the
Rfit uses standard linear model syntax so that those familiar with traditional parametric analysis can easily begin running robust analyses. In this paper, discussion of the assumptions are kept to a minimum and we refer the interested reader to the literature. All data sets used in demonstrating Rfit are included in the package.
The rest of the paper is organized as follows. The next section discusses the general linear model and rank-based fitting and associated inference. The following section provides examples illustrating the computation of Rfit for linear regression. Later sections discuss Rfit’s computation of one-way and multi-way ANOVA as well as general scores and writing user-defined score functions for computation in Rfit. The final section discusses the future implementation in Rfit of rank-based procedures for models beyond the general linear model.
As with least squares, the goal of rank-based regression is to estimate
the vector of coefficients,
where
where
where
The R estimator of
This estimator is a highly efficient estimator which is robust in the
Under assumptions outlined in the previous section, it can be shown that
the solution to ((4)) is consistent and asymptotically
normal (Hettmansperger and McKean 2011). We summarize this result as follows:
where
It can be shown that Jaeckel’s dispersion function
((3)) is a convex function of optim
with option ‘BFGS’
to minimize the
dispersion function. We investigated other minimization methods
(e.g. Nelder-Mead or CG), however the quasi-Newton method works well in
terms of speed and convergence. An orthonormal basis matrix, for the
space spanned by the columns of qr
which leads to better performance in examples. The default
initial fit is based on an
Computations by Rfit of rank-based estimation and associated inference are illustrated in the examples of the next section.
For the general linear model (1) the package Rfit obtains the rank-based estimates and inference, as described in the previous section. In this section we illustrate this computation for two examples. The first is for a simple linear model while the second is for a multiple regression model.
We begin with a simple linear regression example using the telephone
data discussed in Rousseuw and Leroy (1987) These data represent the number of
telephone calls (in tens of millions) placed in Belgium over the years
1950–1973. The data are plotted in Figure 1. There
are several noticeable outliers which are due to a mistake in the
recording units for the years 1964–1969. This is a simple dataset,
containing only one explanatory variable, however it allows us to easily
highlight the package and also demonstrate the robustness to outliers of
the procedure. The main function of the package Rfit is rfit
which,
as the following code segment illustrates, uses syntax similar to lm
.
> library(Rfit)
> data(telephone)
> fit <- rfit(calls ~ year, data = telephone)
> summary(fit)
Call:
rfit(formula = calls ~ year, data = telephone)
Coefficients:
Estimate Std. Error t.value p.value
-284.313842 152.687751 -1.8621 0.07665 .
year 0.145861 0.077842 1.8738 0.07494 .
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared (Robust): 0.3543158
Reduction in Dispersion Test:
12.07238 p-value: 0.00215
> plot(telephone)
> abline(fit)
> abline(lm(calls ~ year, data = telephone),
+ col = 2, lty = 2)
> legend("topleft", legend = c("R", "LS"),
+ col = 1:2, lty = 1:2)
Further, the output is similar to that of lm
and can be interpreted in
the same way. The estimate of slope is 0.146 (tens of millions of calls
per year) with a standard error of 0.078. The
The overlaid fitted regression lines in the scatter plot in Figure 1 demonstrate the robustness of the Wilcoxon fit and the lack of robustness of the least squares fit.
This is a data set from Applied linear statistical models (1987 64) (c.f. Example 3.9.4 of Hettmansperger and McKean (2011)). The response variable is level of free fatty acid in a sample of prepubescent boys. The explanatory variables are age (in months), weight (in lbs), and skin fold thickness. For this discussion, we chose the Wilcoxon (default) scores for Rfit. Based on the residual and Q-Q plots below, however, the underlying error distribution appears to be right-skewed. In a later section we analyze this data set using more appropriate (bent) scores for a right-skewed distribution.
To begin with we demonstrate the reduction in dispersion test discussed in the previous section.
> fitF <- rfit(ffa ~ age + weight + skin,
+ data = ffa)
> fitR <- rfit(ffa ~ skin, data = ffa)
> drop.test(fitF, fitR)
Drop in Dispersion Test
F-Statistic p-value
1.0754e+01 2.0811e-04
As the code segment shows, the syntax is similar to that of the anova
function used for reduced model testing in many of the parametric
packages.
Studentized residuals for rank-based fits are calculated in a way
similar to the LS studentized residuals (see Chapter 3 of Hettmansperger and McKean 2011). We
have implemented these residuals in Rfit and demonstrate their use
next. These are the residuals plotted in the residual and Q-Q plots in
Figure 2 and Figure 3 respectively.
The code is similar to that of least squares analysis. The function
fitted.values
returns the fitted values and residuals
returns the
residuals from the full model fit. The function rstudent
calculates
the studentized residuals.
Common diagnostic tools are the residual plot (Figure 2)
> plot(fitted.values(fitF), rstudent(fitF))
> abline(h = c(-2, 2))
and normal probability plot of the studentized residuals (Figure 3).
> qqnorm(residuals(fitF))
As is shown in the plots, there are several outliers and perhaps the errors are from a right skewed distribution. We revist this example in a later section.
Suppose we want to determine the effect that a single factor
Under these assumptions, the full model can be written as
where the
Observational studies can also be modeled this way. Suppose
The analysis for this design is usually a test of the hypothesis that all the effects are 0, followed by individual comparisons of levels. The hypothesis can be written as
Confidence intervals for the simple contrasts oneway.rfit
is
illustrated in the following example.
Hettmansperger and McKean (2011 295) discuss a study that investigated the effect of four drug
compounds on low density lipid (LDL) cholesterol in quail. The drug
compounds are labeled as I, II, III, and IV. The sample size for each of
the first three levels is 10 while 9 quail received compound IV. The
boxplots shown in Figure 4 attest to a difference in
the LDL levels.
Using Wilcoxon scores, we fit the full model. The summary of the test of
hypotheses ((6)) as computed by the Rfit function
oneway.rfit
follows. The resulting Q-Q plot (Figure 5)
of the studentized residuals indicates that the random errors
> data(quail)
> oneway.rfit(quail$ldl, quail$treat)
Call:
oneway.rfit(y = quail$ldl, g = quail$treat)
Overall Test of All Locations Equal
Drop in Dispersion Test
F-Statistic p-value
3.916404 0.016403
Pairwise comparisons using Rfit
data: quail$ldl and quail$treat
2 3 4
1 - - -
2 1.00 - -
3 0.68 0.99 -
4 0.72 0.99 0.55
P value adjustment method: none
Robust fits based on scores more appropriate than the Wilcoxon for
skewed errors are discussed later. Note that the results from a call to
oneway.rfit
include the results from the call to rfit
.
> anovafit <- oneway.rfit(quail$ldl, quail$treat)
Which may then be used for diagnostic procedures, such as the Q-Q plot of the studentized residuals in Figure 5.
> qqnorm(rstudent(anovafit$fit))
With a
The second stage of an analysis of a one-way design usually consists of
pairwise comparisons of the treatments. The robust
Often there are many comparisons of interest. For example, in the case
of all pairwise comparisons there are
For the quail data, we selected the Tukey-Kramer procedure for all six pairwise comparisons. Use of the code and example output is given below. The multiple comparison part of the output is:
> summary(oneway.rfit(quail$ldl, quail$treat),
+ method = "tukey")
Multiple Comparisons
Method Used tukey
I J Estimate St Err Lower CI Upper CI
1 1 2 -25.00720 8.26813 -47.30553 -2.70886
2 1 3 -3.99983 8.26813 -26.29816 18.29851
3 1 4 -5.00027 8.49469 -27.90963 17.90909
4 2 3 -21.00737 8.26813 -43.30571 1.29096
5 2 4 -20.00693 8.49469 -42.91629 2.90243
6 3 4 1.00044 8.49469 -21.90892 23.90981
The Tukey-Kramer procedure declares that the Drug Compounds I and II differ significantly.
In this section, we consider a raov
computes the
rank-based analysis for all
Let
where
For the two-way model, the three hypotheses of immediate interest are
the main effects hypotheses and the interaction hypothesis. We have
chosen Type III hypotheses which are easy to interpret even for severely
unbalanced designs. Following Hocking (1985), the hypothesis matrices
Consider the data set discussed by Box and D. Cox (1964). The data are the results of a
The output below displays the Wilcoxon ANOVA table, which indicates that
interaction is highly significant,
> data(BoxCox)
> attach(BoxCox)
> fit <- raov(logSurv ~ Treatment + Poison)
> fit
Robust ANOVA Table
DF RD Mean RD F p-value
T 3 2.9814770 0.9938257 21.263421 4.246022e-08
P 2 3.6987828 1.8493914 39.568699 8.157360e-10
T:P 6 0.8773742 0.1462290 3.128647 1.428425e-02
As discussed earlier, we must choose a score function for rank-based
fitting. For most datasets the default option of Wilcoxon scores works
quite well, however, occasionally choosing a different score function
can lead to a more efficient analysis. In this section we first discuss
score functions in general and then illustrate how the user may create
his own score function. We have placed the score functions in an object
of class "scores"
. A "scores"
object consists of two objects of type
function and an optional numeric object. The functions are the score
function phi
and it’s derivative Dphi
. The derivative is necessary
in estimating
> wscores
An object of class "scores"
Slot "phi":
function(u) sqrt(12)*(u-0.5)
Slot "Dphi":
function(u) rep(sqrt(12),length(u))
Slot "param":
NULL
Other score functions included in Rfit are listed in
Table 1. A plot of the bent score functions is provided
in Figure 7. Other score functions can be plotted by
getting the scores using the method getScores
. For example the
commands u<-seq(0.01,0.99,by=0.01)
plot(u,getScores(nscores,u))
graphs the normal scores.
Score | Keyword | Recommended usage |
---|---|---|
Wilcoxon | wscores |
moderate tailed |
Normal | nscores |
light-moderate tailed |
Bent1 | bentscores1 |
highly right skewed |
Bent2 | bentscores2 |
light tailed |
Bent3 | bentscores3 |
highly left skewed |
Bent4 | bentscores4 |
moderately heavy |
tailed |
Next we illustrate how to create the score function for the bent
scores. Bent scores are recommended when the errors come from a skewed
distribution. An appropriate bent score function for skewed distribution
with a right heavy tail is
> bent.phi <- function(u, ...)
+ ifelse(u < 0.5, 8/3 * u - 1, 1/3)
> bent.Dphi <- function(u, ...)
+ ifelse(u < 0.5, 8/3, 0)
> bentscores <- new("scores", phi = bent.phi,
+ Dphi = bent.Dphi)
They are displayed graphically in the top left quadrant of Figure 7.
Below we implement the newly defined score functions using the free
fatty acid data previously analysed using Wilcoxon scores. One could
also use the scores provided by Rfit with option scores=bentscores1
to obtain the same result.
> summary(rfit(ffa ~ age + weight + skin,
+ scores = bentscores, data = ffa))
Call:
rfit.default(formula = ffa ~ age + weight + skin,
scores = bentscores, data = ffa)
Coefficients:
Estimate Std. Error t.value p.value
1.35957548 0.18882744 7.2001 1.797e-08 ***
age -0.00048157 0.00178449 -0.2699 0.7888044
weight -0.01539487 0.00260504 -5.9097 9.176e-07 ***
skin 0.35619596 0.09090132 3.9185 0.0003822 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared (Robust): 0.4757599
Reduction in Dispersion Test:
11.19278 p-value: 2e-05
The results are similar to those presented in Hettmansperger and McKean (2011).
This paper illustrates the usage of a new R package, Rfit, for rank-based estimation and inference. Rank-based methods are robust to outliers and offer the data analyst an alternative to least squares. Rfit includes algorithms for general scores and a library of score functions is included. Functions for regression as well as one-way and multi-way anova are included. We illustrated the use of Rfit on several real data sets.
We are in the process of extending Rfit to include other robust rank-based procedures which are discussed in Chapters 3 and 5 of Hettmansperger and McKean (2011). These include autoregressive timeseries models, cluster correlated data (mixed models), and nonlinear models. We are also developing weighted versions of rank-based estimation that can be used in mixed effects modeling as discussed in Kloke, J. McKean, and M. Rashid (2009) as well as the computation of high breakdown rank-based estimation discussed in Chang et al. (1999).
Distributions, Econometrics, Environmetrics, MixedModels, NumericalMathematics, Optimization, Psychometrics, ReproducibleResearch, Robust, Survival, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
D. Kloke & W. McKean, "Rfit: Rank-based Estimation for Linear Models", The R Journal, 2012
BibTeX citation
@article{RJ-2012-014, author = {D. Kloke, John and W. McKean, Joseph}, title = {Rfit: Rank-based Estimation for Linear Models}, journal = {The R Journal}, year = {2012}, note = {https://rjournal.github.io/}, volume = {4}, issue = {2}, issn = {2073-4859}, pages = {57-64} }