Residual diagnostics is an important topic in the classroom, but it is less often used in practice when the response is binary or ordinal. Part of the reason for this is that generalized models for discrete data, like cumulative link models and logistic regression, do not produce standard residuals that are easily interpreted as those in ordinary linear regression. In this paper, we introduce the R package sure, which implements a recently developed idea of SUrrogate REsiduals. We demonstrate the utility of the package in detection of cumulative link model misspecification with respect to mean structures, link functions, heteroscedasticity, proportionality, and interaction effects.
Categorical outcomes are encountered frequently in practice across
different fields. For example, in medical studies, the outcome of
interest is often binary (e.g., presence or absence of a particular
disease after applying a treatment). In other studies, the outcome may
be an ordinal variable; that is, a categorical outcome having a natural
ordering. For instance, in an opinion poll, the response may be
satisfaction with categories low, medium, and high. In this case, the
response is ordered: low
Logistic and probit regression are popular choices for modeling a binary
outcome. Although this paper focuses on models for ordinal responses,
the surrogate approach to constructing residuals actually applies to a
wide class of general models of the form
The cumulative link model is a natural choice for modeling a binary or
ordinal outcome. Consider an ordinal categorical outcome
Another way to interpret the cumulative link model is through a latent
continuous random variable
Link | Distribution of |
||
---|---|---|---|
logit | logistic | ||
probit | standard normal | ||
log-log | Gumbel (max) | ||
complementary log-log | Gumbel (min) | ||
cauchit | Cauchy |
There are a number of R packages that can be used to fit cumulative link
models (1) and (2). The recommended package
MASS (Venables and Ripley 2002)
contains the function polr
(proportional odds logistic regression)
which, despite the name, can be used with all of the link functions
described in Table 1. The
VGAM package
(Yee 2017) has the vglm
function for fitting vector generalized
linear models, which includes the broad class of cumulative link models.
By default, vglm
uses the same parameterization as in
Equation (1), but provides the option of using the
parameterization seen in Equation (2); this will result in
the estimated coefficients having the opposite sign. Package
ordinal
(Christensen 2015) has the clm
function for fitting cumulative link
models. The popular rms
package (Harrell 2017) has two functions: lrm
for fitting logistic
regression and cumulative link models using the logit link, and orm
for fitting ordinal regression models. Both of these functions use the
parameterization seen in Equation (2).
The remainder of the paper is organized as follows. In the next section, we discuss the idea of surrogate residuals (Liu and Zhang 2017) and talk about some important properties. Next, we briefly discuss jittering, bootstrapping, and how they apply to the surrogate approach. The Section “3” introduces the sure package and discusses the various modeling packages it supports. The sections following demonstrate how sure can be used to detect misspecified mean structures, heteroscedasticity, misspecified link functions, and interaction effects, as well as check the proportionality assumption. The Section “4” provides a real data analysis example. We end with a closing summary.
For a continuous outcome
Liu and Zhang (2017) propose a new type of residual that is based on a
continuous variable
Furthermore, it can be shown (Liu and Zhang 2017) that if the
hypothesized model agrees with the true model, then
symmetry around zero
homogeneity
reference distribution the empirical distribution of
According to property (a), if
The latent method discussed in Section “2” applies to
cumulative link models for ordinal outcomes. For more general models, we
can define a surrogate using a technique called jittering. Suppose the
true model for an ordinal outcome
Liu and Zhang (2017) suggest defining the surrogate
jittering on the outcome scale:
jittering on the probability scale:
Once a surrogate is obtained, we define the surrogate residuals in the
same way as Equation (4). In either case, if the
hypothesized model is correct, then symmetry around zero still holds;
that is
Since the surrogate residuals are based on random sampling, additional variability is introduced. One way to account for this sample variability and help stabilize any patterns in diagnostic plots is to use the bootstrap (Efron 1979).
The procedure for bootstrapping surrogate residuals is similar to the
model-based bootstrap algorithm used in linear regression. To obtain the
Perform a standard case-wise bootstrap of the original data to
obtain the bootstrap sample
Using the procedure outlined in the previous section, obtain a
sample of surrogate residuals
This procedure is repeated a total of
The sure package supports a variety of R packages for fitting cumulative link and other types of models. The supported packages and their corresponding functions are described in Table 2.
Package | Function(s) | Model | Parameterization |
---|---|---|---|
stats | glm |
binary regression | NA |
MASS | polr |
cumulative link | |
rms | lrm |
cumulative link | |
lrm |
logistic regression | NA | |
orm |
cumulative link | ||
ordinal | clm |
cumulative link | |
VGAM | vglm |
cumulative link | |
vgam |
cumulative link |
The sure package currently exports four functions:
resids
—for constructing surrogate residuals;
surrogate
—for generating the surrogate response values used in
the residuals;
autoplot
—for producing various diagnostic plots using
ggplot2 graphics
(Wickham 2009);
gof
—for simulating
In addition, the package also includes five simulated data sets: df1
,
df2
, df3
, df4
, and df5
. These data sets are used throughout the
paper to demonstrate how the surrogate residual can be useful as a
diagnostic tool for cumulative link models. The R code used to generate
these data sets is available on the projects GitHub page:
https://github.com/koalaverse/sure/blob/master/data-raw/data.R.
For illustration, the data frame df1
contains df1
data frame from the
sure package and are loaded automatically with the package; see
?df1
for details. Below, we fit a (correctly specified) probit model
using the polr
function from the MASS package.
# Fit a cumulative link model with probit link
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
fit.polr <- polr(y ~ x + I(x ^ 2), data = df1, method = "probit")
The code chunk below obtains the SBS residuals (3) from
the previously fitted probit model fit.polr
using the PResiduals
package and constructs a couple of diagnostic plots. The results are
displayed in Figure 1.
# Obtain the SBS/probability-scale residuals
library(PResiduals)
pres <- presid(fit.polr)
# Residual-vs-covariate plot and Q-Q plot
library(ggplot2) # for plotting
p1 <- ggplot(data.frame(x = df1$x, y = pres), aes(x, y)) +
geom_point(color = "#444444", shape = 19, size = 2, alpha = 0.5) +
geom_smooth(color = "red", se = FALSE) +
ylab("Probability-scale residual")
p2 <- ggplot(data.frame(y = pres), aes(sample = y)) +
stat_qq(distribution = qunif, dparams = list(min = -1, max = 1), alpha = 0.5) +
xlab("Sample quantile") +
ylab("Theoretical quantile")
grid.arrange(p1, p2, ncol = 2) # Figure 1
(Note: the reference distribution for the SBS residual is the
df1
data set. Left: Residual-vs-covariate plot with
a nonparametric smooth (red curve). Right: Q-Q plot of the
residuals.Similarly, we can use the resids
function in package sure to
obtain the surrogate residuals discussed in
Section “2.” This is illustrated in the following code
chunk; the results are displayed in
Figure 2. (Note: since the surrogate
residuals are based on random sampling, we specify the seed via the
set.seed
function throughout this paper for reproducibility.)
# Obtain surrogate residuals
library(sure)
set.seed(101) # for reproducibility
sres <- resids(fit.polr)
# Residual-vs-covariate plot and Q-Q plot
library(ggplot2) # needed for autoplot function
p1 <- autoplot(sres, what = "covariate", x = df1$x, xlab = "x")
p2 <- autoplot(sres, what = "qq", distribution = qnorm)
grid.arrange(p1, p2, ncol = 2) # Figure 2
df1
data set. Left: Residual-vs-covariate
plot with a nonparametric smooth (red curve). Right: Q-Q plot of the
residuals.The sure package also includes autoplot
methods for the various
classes of models listed in Table 2, so that the user can
give autoplot
the fitted model directly. The benefit of this approach
is that the fitted values and reference distribution (used in Q-Q plots)
are automatically extracted. For example, to reproduce the Q-Q plot in
Figure 2, we could have used the
following:
set.seed(101) # for reproducibility
autoplot(fit.polr, what = "qq") # same as top right of Figure 1
Suppose that we did not include the quadratic term in our fitted model.
We would expect a residual-vs-
fit.polr <- update(fit.polr, y ~ x) # remove quadratic term
The SBS residuals gives some indication of a misspecified mean
structure, but this only becomes more clear with increasing
One issue that often raises concerns in statistical inference is that of
heteroscedasticity; that is, when the error term has non constant
variance. Heteroscedasticity can bias the statistical inference and lead
to improper standard errors, confidence intervals, and
As discussed in Section “2,” one of the properties of
the surrogate residual
For this example, we generated df2
data frame loaded
with the sure package; see ?df2
for details.
The following block of code uses the orm
function from the popular
rms package to fit a probit model to the simulated data. Note: we
set x = TRUE
in the call to orm
in order to use the presid
function later.
# Fit a cumulative link model with probit link
library(rms) # for orm function
fit.orm <- orm(y ~ x, data = df2, family = "probit", x = TRUE)
If heteroscedasticity is present, we would expect this to show up in
various diagnostic plots, such as a residual-vs-covariate plot. Below we
obtain the SBS and surrogate residuals as before and plot them against
set.seed(102) # for reproducibility
p1 <- autoplot(resids(fit.orm), what = "covariate", x = df2$x, xlab = "x")
p2 <- ggplot(data.frame(x = df2$x, y = presid(fit.orm)), aes(x, y)) +
geom_point(color = "#444444", shape = 19, size = 2, alpha = 0.25) +
geom_smooth(col = "red", se = FALSE) +
ylab("Probability scale residual")
grid.arrange(p1, p2, ncol = 2) # Figure 4
In Figure 4, it is clear from the plot of the
surrogate residuals (left side of Figure 4)
that the variance increases with
As outlined in Section “2.1,” the jittering technique is
broadly applicable to virtually all parametric and nonparametric models
for ordinal responses. To illustrate, the code chunk below uses the
VGAM package to fit a vector generalized additive model to the
same data using a nonparametric smooth for
library(VGAM) # for vgam and vglm functions
fit.vgam <- vgam(y ~ s(x), family = cumulative(link = probit, parallel = TRUE),
data = df2)
To obtain a surrogate residual using the jittering technique, we can set
method = "jitter"
in the call to resids
or autoplot
. There is also
the option jitter.scale
which can be set to either "probability"
,
for jittering on the probability scale (the default), or "response"
,
for jittering on the response scale. In the code chunk below, we use the
autoplot
function to obtain residual-by-covariate plots using both
types of jittering. The results (Figure 5)
indicate that the variance increases with increasing
set.seed(103) # for reproducibility
p1 <- autoplot(fit.vgam, what = "covariate", x = df2$x, method = "jitter",
xlab = "x")
p2 <- autoplot(fit.vgam, what = "covariate", x = df2$x, method = "jitter",
jitter.scale = "response", xlab = "x")
grid.arrange(p1, p2, ncol = 2) # Figure 5
For this example, we simulated df3
within the package; see ?df3
for details.
We fit a model with various link functions, where the true link function
is log-log. From these output, we construct Q-Q plots of the residuals
using
# Fit models with various link functions to the simulated data
fit.probit <- polr(y ~ x + I(x ^ 2), data = df3, method = "probit")
fit.logistic <- polr(y ~ x + I(x ^ 2), data = df3, method = "logistic")
fit.loglog <- polr(y ~ x + I(x ^ 2), data = df3, method = "loglog") # correct link
fit.cloglog <- polr(y ~ x + I(x ^ 2), data = df3, method = "cloglog")
# Construct Q-Q plots of the surrogate residuals for each model
set.seed(1056) # for reproducibility
p1 <- autoplot(fit.probit, nsim = 100, what = "qq")
p2 <- autoplot(fit.logistic, nsim = 100, what = "qq")
p3 <- autoplot(fit.loglog, nsim = 100, what = "qq")
p4 <- autoplot(fit.cloglog, nsim = 100, what = "qq")
# Figure 6
grid.arrange(p1, p2, p3, p4, ncol = 2) # bottom left plot is correct model
From the Q-Q plots in Figure 6, it is clear the the model with the log-log link (which corresponds to Gumbel (max) errors in the latent variable formulation) is the most appropriate, while the other plots indicate deviations from the hypothesized model.
Alternatively, we could also use the surrogate residuals to make use of
existing distance-based goodness-of-fit (GOF) tests; for example, the
Kolmogorov-Smirnov distance. The gof
function in sure can be
used to produce simulated
Currently, the gof
function supports three goodness-of-fit tests: the
Kolmogorov-Smirnov test (test = "ks"
), the Anderson-Darling test
(test = "ad"
), and the Cramer-Von Mises test (test = "cvm"
). Below,
we use the gof
function to simulate nsim
to
100 to produce smoother plots and reduce the sampling error induced by
the surrogate procedure. The plot
method is then used to display the
empirical distribution function (EDF) of the simulated "gof"
objects uses base R graphics; hence, we
can use the par
function to set various graphical parameters.)
# Figure 7
par(mfrow = c(2, 2), mar = c(2, 4, 2, 2) + 0.1)
set.seed(8491) # for reproducibility
plot(gof(fit.probit, nsim = 100, test = "ad"), main = "")
plot(gof(fit.logistic, nsim = 100, test = "ad"), main = "")
plot(gof(fit.loglog, nsim = 100, test = "ad"), main = "")
plot(gof(fit.cloglog, nsim = 100, test = "ad"), main = "")
An important feature of the cumulative link model (1) is
the proportional odds assumption, which assumes that the mean structure,
To illustrate, we generated 2000 observations from each of the following
probit models
df4
within the package; see ?df4
for details.
Checking the proportionality assumption here amounts to checking whether
or not surrogate
function to generate the surrogate response
values directly (as opposed to the residuals) and generate the
# Fit separate models (VGAM should already be loaded)
fit1 <- vglm(y ~ x, data = df4[1:2000, ],
cumulative(link = probit, parallel = TRUE))
fit2 <- update(fit1, data = df4[2001:4000, ])
# Generate surrogate response values
set.seed(8671) # for reproducibility
s1 <- surrogate(fit1)
s2 <- surrogate(fit2)
# Figure 8
ggplot(data.frame(D = s1 - s2, x = df4[1:2000, ]$x) , aes(x = x, y = D)) +
geom_point(color = "#444444", shape = 19, size = 2) +
geom_smooth(se = FALSE, size = 1.2, color = "red")
From Figure 8, it is clear that
A common challenge in model building is determining whether or not there are important interactions between the predictors in the data. Using the surrogate residuals, it is rather straightforward to determine if such an interaction effect is missing from the assumed model.
For illustration, we generated Treatment
and Control
. The simulated data are available in
the df5
data frame loaded with the sure package; see ?df5
for
details. Below, we fit two probit models using the clm
function from
the ordinal package. The first model corresponds to the simulated
control group, while the second model corresponds to the treatment
group.
library(ordinal) # for clm function
fit1 <- clm(y ~ x1, data = df5[df5$x2 == "Control", ], link = "probit")
fit2 <- clm(y ~ x1, data = df5[df5$x2 == "Treatment", ], link = "probit")
If the true model contains an interaction term ggplot
along with the sure
function to construct such
a plot in Figure 9. The plot indicates a negative
association between
# Figure 9
set.seed(1105) # for reproducibility
df5$s <- c(surrogate(fit1), surrogate(fit2)) # surrogate response values
ggplot(df5, aes(x = x1, y = s)) +
geom_point(color = "#444444", shape = 19, size = 2, alpha = 0.5) +
geom_smooth(se = FALSE, size = 1.2, color = "red2") +
facet_wrap( ~ x2) +
ylab("Surrogate response") +
xlab(expression(x[1]))
Randall (1989) performed an experiment on factors determining the
bitterness of wine. Two binary treatment factors, temperature and
contact (between juice and skin), were controlled while crushing the
grapes during wine production. Nine judges each assessed wine from two
bottles from each of the four treatment conditions; for a total of
ordinal::wine
for details.
data(wine, package = "ordinal") # load wine data set
wine.clm <- clm(rating ~ temp + contact, data = wine, link = "probit")
Since both of the covariates in this model are binary factors,
scatterplots are not appropriate for displaying the
residual-by-covariate relationships. Instead, the autoplot
function in
sure
uses boxplots; a future release is likely to include the
additional option for producing nonparametric densities for each level
of a factor. The code chunk below uses autoplot
along with
grid.arrange
to produce some standard residual diagnostic plots. The
results are displayed in Figure 10. The Q-Q plot and
residual-vs-fitted value plot do not indicate any serious model
misspecifications. Furthermore, the boxplots reveal that the medians of
the surrogate residuals are very close to zero, and the distribution of
the residuals within each level appear to be symmetric and have
approximately the same variability (with the exception of a few
outliers).
set.seed(1225) # for reproducibility
grid.arrange( # Figure 10
autoplot(wine.clm, nsim = 10, what = "qq"),
autoplot(wine.clm, nsim = 10, what = "fitted", alpha = 0.5),
autoplot(wine.clm, nsim = 10, what = "covariate", x = wine$temp,
xlab = "Temperature"),
autoplot(wine.clm, nsim = 10, what = "covariate", x = wine$contact,
xlab = "Contact"),
ncol = 2
)
In this paper, we introduce the sure package, which implements the surrogate approach to residuals for ordinal regression models described in Liu and Zhang (2017). Using simulated data sets, we illustrate the various properties of these residuals and how they can be used to check assumptions in the ordinal regression model (e.g., heteroscedasticity, misspecified link functions, etc.) using continuous residuals which produce diagnostic plots not unlike those seen in ordinary linear regression. This offers a new way of performing typical diagnostic checks for ordinal regression models that are easily interpreted by the analyst. Furthermore, this new approach to constructing residuals for ordinal regression models is still under active development and new functionality will be added to sure in the future.
The authors would like to thank the anonymous reviewers and the Editor for their helpful comments and suggestions.
MASS, VGAM, ordinal, rms, PResiduals, sure, ggplot2
Distributions, Econometrics, Environmetrics, ExtremeValue, MixedModels, NumericalMathematics, Phylogenetics, Psychometrics, ReproducibleResearch, Robust, Spatial, Survival, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Greenwell, et al., "Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package", The R Journal, 2018
BibTeX citation
@article{RJ-2018-004, author = {Greenwell, Brandon M. and McCarthy, Andrew and Boehmke, Bradley C. and Liu, Dungang}, title = {Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {381-394} }