When a linear model is rank-deficient, then predictions based on that model become questionable because not all predictions are uniquely estimable. However, some of them are, and the estimability package provides tools that package developers can use to tell which is which. With the use of these tools, a model object’s predict
method could return estimable predictions as-is while flagging non-estimable ones in some way, so that the user can know which predictions to believe. The estimability package also provides, as a demonstration, an estimability-enhanced epredict
method to use in place of predict
for models fitted using the stats package.
Consider a linear regression or mixed model having fixed component of
the matrix form
An example will help illustrate the issues. In the following commands,
we create four predictors x1
–x4
and a response variable y
:
> x1 <- -4 : 4
> x2 <- c(-2, 1, -1, 2, 0, 2, -1, 1,-2)
> x3 <- 3 * x1 - 2 * x2
> x4 <- x2 - x1 + 4
> y <- 1 + x1 + x2 + x3 + x4 + c(-.5, .5, .5, -.5, 0, .5, -.5, -.5, .5)
Clearly, x3
and x4
depend linearly on x1
and x2
and the
intercept. Let us fit two versions of the same model to these data,
entering the predictors in different orders, and compare the regression
coefficients:
> mod1234 <- lm(y ~ x1 + x2 + x3 + x4)
> mod4321 <- lm(y ~ x4 + x3 + x2 + x1)
> zapsmall(rbind(b1234 = coef(mod1234), b4321 = coef(mod4321)[c(1, 5:2)]))
(Intercept) x1 x2 x3 x4
b1234 5 3 0 NA NA
b4321 -19 NA NA 3 6
Note that in each model, two regression coefficients are NA
. This
indicates that the associated predictors were excluded due to their
linear dependence on the others.
The problem that concerns us is making predictions on new values of the predictors. Here are some predictor combinations to try:
> testset <- data.frame(
+ x1 = c(3, 6, 6, 0, 0, 1),
+ x2 = c(1, 2, 2, 0, 0, 2),
+ x3 = c(7, 14, 14, 0, 0, 3),
+ x4 = c(2, 4, 0, 4, 0, 4))
And here is what happens when we make the predictions:
> cbind(testset,
+ pred1234 = predict(mod1234, newdata = testset),
+ pred4321 = suppressWarnings(predict(mod4321, newdata = testset)))
x1 x2 x3 x4 pred1234 pred4321
1 3 1 7 2 14 14
2 6 2 14 4 23 47
3 6 2 14 0 23 23
4 0 0 0 4 5 5
5 0 0 0 0 5 -19
6 1 2 3 4 8 14
Warning message:
In predict.lm(mod1234, new = testset) :
prediction from a rank-deficient fit may be misleading
Note that the two models produce identical predictions in some cases (rows 1, 3, and 4), and different ones in the others. There is also a warning message (we suppressed this the second time) telling us that this could happen.
It happens that cases 1, 3, and 4 are estimable, and the others are not.
It would be helpful to know which predictions to trust, rather than a
vague warning. The epredict
function in
estimability
(Lenth 2015) accomplishes this:
> require("estimability")
> rbind(epred1234 = epredict(mod1234, newdata = testset),
+ epred4321 = epredict(mod4321, newdata = testset))
1 2 3 4 5 6
epred1234 14 NA 23 5 NA NA
epred4321 14 NA 23 5 NA NA
The results for non-estimable cases are indicated by NA
s. Note that in
both models, the same new-data cases are identified as estimable, and
they are the ones that yield the same predictions. Note also that
estimability was determined separately from each model. We do not need
to compare two different fits to determine estimability.
It is a simple matter to add estimability checking to the predict
method(s) in a new or existing package.
The package should import the estimability package.
Use estimability::nonest.basis
to obtain the basis for
non-estimable functions of the regression coefficients, preferably
in the model-fitting code.
In the predict
method, call estimability::is.estble
on the model
matrix for new data. (It is not necessary to check estimability of
predictions on the original data.)
This is implemented in code in a manner something like this:
fitmodel <- function(...) {
...
code to create:
'X' (the fixed-effects model matrix),
'QR' (QR decomp of 'X'),
'object' (the fitted model object)
...
object$nonest <- estimability::nonest.basis(QR)
object
}
predict.mymodel <- function(object, newdata, tol = 1e-8, ...) {
...
if(!missing(newdata)) {
...
code to create:
'newX' (the model matrix for 'newdata')
...
estble <- estimability::is.estble(newX, object$nonest, tol)
...
code to flag non-estimable predictions
...
}
...
}
The nonest.basis
function returns a matrix whose columns span the null
space of the model matrix nonest.basis
method that can be called with
The is.estble
function tests each row of its first argument for
estimability against the null basis provided in its second argument, and
returns a logical vector. A third argument may be added to adjust the
tolerance used in this test. It is important to remember that all
columns of the model matrix be included. In particular, do not exclude
the columns corresponding to NA
s in the coefficient estimates, as they
are needed for the estimability testing. Typically, the results of
is.estble
would be used to skip any non-estimable predictions and
replace them with NA
.
If there is not a rank deficiency in the model, nonest.basis
returns
estimability::all.estble
, which is a trivial null basis (still of
‘matrix
’ class) with which is.estble
will return all TRUE
. The
nonest.basis
function, when called with a ‘qr
’ object, can
immediately detect if there is no rank deficiency and will return
all.estble
. If your model-fitting algorithm does not use the QR
decomposition, it may be worth including additional code to check for
non-singular models, in which case it sets the null basis to
estimability::all.estble
, rather than have nonest.basis
perform the
additional computation to figure this out.
The sample code above is typical for the S3 object model. If S4 objects
are being used, you may want to instead include a slot of class
‘matrix
’ for nonest
; or incorporate nonest
as part of some other
slot.
To illustrate briefly, consider the previous example. The null basis for
mod1234
is obtained as follows:
> (nb <- nonest.basis(mod1234$qr))
[,1] [,2]
[1,] 0.29555933 -0.9176629
[2,] 0.76845426 0.2294157
[3,] -0.48767290 -0.2294157
[4,] -0.28078136 0.0000000
[5,] -0.07388983 0.2294157
and we can test estimability for the data in testset
by converting it
to a matrix and appending a column for the intercept:
> newX <- cbind(1, as.matrix(testset))
> is.estble(newX, nb)
[1] TRUE FALSE TRUE TRUE FALSE FALSE
The theory of estimability in linear models is well established, and can be found in almost any text on linear models, such as classics like (Searle 1997) and (Seber and Lee 2003). In this discussion, I will make specific references to portions of a more recent reference, (Monahan 2008).
Before delving in, though, it is worth noting (based on what is said in
some contributed packages’ documentation and code) that there seems to
be some confusion in the R community between non-estimability of a
linear function NA
values in the
estimate NA
in the estimates for every predictor except the
intercept. In fact, while NA
usually refers to unknown values, that is
not the case here. An NA
regression coefficient signals a coefficient
that is constrained to zero in order to obtain an estimate. And there is
nothing wrong with multiplying some of the NA
s in the estimates.
When
From the above result, we can establish estimability of
SAS uses ESTIMATE
statement and its SINGULAR
option.
In the estimability package, we obtain the null basis in a different
way than shown above, in part because the QR decomposition of
lm
’ object. Suppose
that
Now, let
To test estimability of
The estimability package’s epredict
function serves as a
demonstration of adding estimability checking to the predict
methods
in the stats package. It works for lm
, aov
, glm
, and mlm
objects. It also provides additional options for the type
argument of
predict
. When newdata
is provided, type = "estimability"
returns a
logical vector showing which rows of newdata
are estimable; and
type = "matrix"
returns the model matrix for newdata
.
An accompanying enhancement is eupdate
, which runs update
on a model
object and also adds a nonest
member. Subsequent epredict
calls on
this object will use that as the null basis instead of having to
reconstruct it. For example,
> mod123 <- eupdate(mod1234, . ~ . - x4)
> mod123$nonest
[,1]
[1,] 0.0000000
[2,] -0.8017837
[3,] 0.5345225
[4,] 0.2672612
Now let’s find out which predictions in testset
are estimable with
this model:
> epredict(mod123, newdata = testset, type = "estimability")
1 2 3 4 5 6
TRUE TRUE TRUE TRUE TRUE FALSE
Thus, more of these test cases are estimable when we drop testset
, this makes sense because two of the
previously non-estimable rows differ from estimable ones only in their
values of
In rank-deficient linear models used for predictions on new predictor sets, it is important for users to know which predictions are to be trusted, and which of them would change if the same model had been specified in a different way. The estimability package provides an easy way to add this capability to new and existing model-fitting packages.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
V. Lenth, "Estimability Tools for Package Developers", The R Journal, 2015
BibTeX citation
@article{RJ-2015-016, author = {V. Lenth, Russell}, title = {Estimability Tools for Package Developers}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {7}, issue = {1}, issn = {2073-4859}, pages = {195-199} }