Weighted generalized ridge regression offers unique advantages in correlated high-dimensional problems. Such estimators can be efficiently computed using Bayesian spike and slab models and are effective for prediction. For sparse variable selection, a generalization of the elastic net can be used in tandem with these Bayesian estimates. In this article, we describe the R-software package spikeslab for implementing this new spike and slab prediction and variable selection methodology.
The expression spike and slab, originally coined by (Mitchell and J.J. Beauchamp 1988), refers to a type of prior used for the regression coefficients in linear regression models (see also (Lempers 1971)). In (Mitchell and J.J. Beauchamp 1988), this prior assumed that the regression coefficients were mutually independent with a two-point mixture distribution made up of a uniform flat distribution (the slab) and a degenerate distribution at zero (the spike). In (George and R.E. McCulloch 1993) a different prior for the regression coefficient was used. This involved a scale (variance) mixture of two normal distributions. In particular, the use of a normal prior was instrumental in facilitating efficient Gibbs sampling of the posterior. This made spike and slab variable selection computationally attractive and heavily popularized the method.
As pointed out in (Ishwaran and J.S. Rao 2005), normal-scale mixture priors, such as those used in (George and R.E. McCulloch 1993), constitute a wide class of models termed spike and slab models. Spike and slab models were extended to the class of rescaled spike and slab models (Ishwaran and J.S. Rao 2005). Rescaling was shown to induce a non-vanishing penalization effect, and when used in tandem with a continuous bimodal prior, confers useful model selection properties for the posterior mean of the regression coefficients (Ishwaran and J.S. Rao 2005, 2010).
Recently, (Ishwaran and J.S. Rao 2010) considered the geometry of generalized ridge regression (GRR), a method introduced by Hoerl and Kennard to overcome ill-conditioned regression settings (Hoerl and R.W. Kennard 1970b,a). This analysis showed that GRR possesses unique advantages in high-dimensional correlated settings, and that weighted GRR (WGRR) regression, a generalization of GRR, is potentially even more effective. Noting that the posterior mean of the regression coefficients from a rescaled spike and slab model is a type of WGRR estimator, they showed that this WGRR estimator, referred to as the Bayesian model averaged (BMA) estimator, when coupled with dimension reduction, yielded low test-set mean-squared-error when compared to the elastic net (Zou and T. Hastie 2005).
Additionally, (Ishwaran and J.S. Rao 2010) introduced a generalization of the elastic net,
which they coined the gnet (short for generalized elastic net). The gnet
is the solution to a least-squares penalization problem in which the
penalization involves an overall
The gnet and BMA estimators represent attractive solutions for modern day high-dimensional data settings. These problems often involve correlated variables, in part due to the nature of the data, and in part due to an artifact of the dimensionality [see (Cai. and J. Lv 2007; Fan and J. Lv 2008) for a detailed discussion about high-dimensional correlation]. The BMA is attractive because it addresses the issue of correlation by drawing upon the properties of WGRR estimation, a strength of the Bayesian approach, while the gnet achieves sparse variable selection by drawing upon the principle of soft-thresholding, a powerful frequentist regularization concept.
Because high-dimensional data is becoming increasingly common, it would be valuable to have user friendly software for computing the gnet and BMA estimator. With this in mind, we have developed an R package spikeslab for implementing this methodology (Ishwaran, U.B. Kogalur and J.S. Rao 2010).
The main purpose of this article is to describe this package. Because this new spike and slab approach may be unfamiliar to users in the R-community, we start by giving a brief high-level description of the algorithm [for further details readers should however consult (Ishwaran and J.S. Rao 2010)]. We then highlight some of the package’s key features, illustrating its use in both low- and high-dimensional settings.
The spikeslab R package implements the rescaled spike and slab algorithm described in (Ishwaran and J.S. Rao 2010). This algorithm involves three key steps:
Filtering (dimension reduction).
Model Averaging (BMA).
Variable Selection (gnet).
Step 1 filters all but the top bigp.smalln.factor
, which by default is set to the value
Step 2 fits a rescaled spike and slab model using only those
variables that are not filtered in Step 1. Model fitting is implemented
using a Gibbs sampler. Computational times are generally rapid as the
number of variables at this point are a fraction of the original size,
Step 3 calculates the gnet. In the optimization, the gnet’s
lars
wrapper is called with
type=”lar”
to produce the full LAR path solution (Efron, T. Hastie, I. Johnstone and R. Tibshirani 2004).
The gnet is defined as the model in this path solution minimizing the
AIC criterion. Note importantly, that cross-validation is not used to
optimize the
Although the spike and slab algorithm is especially adept in
high-dimensional settings, it can be used effectively in classical
settings as well. In these low-dimensional scenarios when
As illustration, we consider the benchmark diabetes data (
data(diabetesI, package = "spikeslab")
set.seed(103608)
obj <- spikeslab(Y ~ . , diabetesI)
print(obj)
The print
call outputs a basic summary of the analysis, including a
list of the selected variables and their parameter estimates (variables
selected are those having nonzero gnet coefficient estimates):
bma gnet bma.scale gnet.scale
bmi 24.076 23.959 506.163 503.700
ltg 23.004 22.592 483.641 474.965
map 14.235 12.894 299.279 271.089
hdl -11.495 -10.003 -241.660 -210.306
sex -7.789 -6.731 -163.761 -141.520
age.sex 6.523 5.913 137.143 124.322
bmi.map 3.363 4.359 70.694 91.640
glu.2 2.185 3.598 45.938 75.654
age.ltg 1.254 0.976 26.354 20.528
bmi.2 1.225 1.837 25.754 38.622
age.map 0.586 0.928 12.322 19.515
age.2 0.553 0.572 11.635 12.016
sex.map 0.540 0.254 11.349 5.344
glu 0.522 0.628 10.982 13.195
age.glu 0.417 0.222 8.757 4.677
In interpreting the table, we note the following:
The first column with the heading bma lists the coefficient estimates for the BMA estimator obtained from Step 2 of the algorithm. These values are given in terms of the standardized covariates (mean of 0 and variance of 1).
The second column with the heading gnet lists the coefficient estimates for gnet obtained from Step 3 of the algorithm. These values are also given in terms of the standardized covariates.
The last two columns are the BMA and gnet estimators given in terms of the original scale of the variables. These columns are used for prediction, while the first two columns are useful for assessing the relative importance of variables.
Note that all columns can be extracted from the spike and slab object,
obj
, if desired.
Even though the gnet accomplishes the goal of variable selection, it is
always useful to have a measure of stability of a variable. The wrapper
cv.spikeslab
can be used for this purpose.
The call to this wrapper is very simple. Here we illustrate its usage on the diabetes data:
y <- diabetesI[, 1]
x <- diabetesI[, -1]
cv.obj <- cv.spikeslab(x = x, y = y, K = 20)
This implements 20-fold validation (the number of folds is set by using
the option K
). The gnet estimator is fit using the training data and
its test-set mean-squared-error (MSE) for its entire solution-path is
determined. As well, for each fold, the optimal gnet model is determined
by minimizing test-set error. The average number of times a variable is
selected in this manner defines its stability (this is recorded in
percentage as a value from 0%-100%). Averaging the gnet’s test-set MSE
provides an estimate of its MSE as a function of the number of
variables.
The gnet’s coefficient values (estimated using the full data) and its
stability values can be obtained from the cv.obj
using the following
commands:
cv.stb <- as.data.frame(cv.obj$stability)
gnet <- cv.stb$gnet
stability <- cv.stb$stability
Figure 1 (top) plots the gnet’s cross-validated MSE curve as a function of the model size. The plot was produced with the command
plot(cv.obj, plot.type = "cv")
Close inspection (confirmed by considering the object, cv.obj
) shows
that the optimal model size is somewhere between 9 and 17, agreeing
closely with our previous analysis. The bottom plot shows how gnet
coefficient estimates vary in terms of their stability values (obtained
by plotting gnet
versus stability
). There are 10 variables having
stability values greater than 80%.
To analyze bigp.smalln=TRUE
in the call to spikeslab
. This will invoke the full spike and slab
algorithm including the filtering step (Step 1) which is crucial to
success in high-dimensional settings (note that
First, users should avoid using the formula and data-frame call to
spikeslab
when
Second, the final model size of the BMA and gnet are controlled by two
key options; these must be set properly to keep computations manageable.
These options are: bigp.smalln.factor
and max.var
. The first option
restricts the number of filtered variables in Step 1 of the algorithm to
be no larger than bigp.smalln.factor
.
The default setting max.var
, restricts the number of selected variables in both Steps 1
and 3 of the algorithm. Its function is similar to bigp.smalln.factor
,
although unlike bigp.smalln.factor
, it directly controls the size of
gnet. The default value is max.var=500
. In most examples, it will
suffice to work with bigp.smalln.factor
.
Thus, if bigp.smalln.factor
and max.var
,
then a generic call in high-dimensional settings would look like:
obj <- spikeslab(x=x, y=y, bigp.smalln = TRUE,
bigp.small.n.factor = f, max.var = m)
Although spikeslab
has several other options, most users will not need
these and the above call should suffice for most examples. However, if
computational times are a still of concern even after tuning n.iter1
and
n.iter2
. The first controls the number of burn-in iterations used by
the Gibbs sampler, and the second controls the number of Gibbs sampled
values following burn-in (these latter values are used for inference and
parameter estimation). The default setting is 500 in both cases.
Decreasing these values will decrease computational times, but accuracy
will suffer. Note that if computational times are not a concern, then
both values could be increased to 1000 (but not much more is needed) to
improve accuracy.
As illustration, we used a simulation with
The top plot in Figure 2 displays the path solution for the gnet. Such a
plot can be produced by a call to the lars wrapper plot.lars
using
the gnet.obj
obtained from the spikeslab
call. As gnet.obj
is a
lars-type object it is fully interpretable by the lars package, and
thus it can be parsed by the packages’ various wrappers. For
convenience, the path solution can be produced by a direct call to
plot
; a typical call being:
obj <- spikeslab(x=x, y=y, bigp.smalln = TRUE)
plot(obj, plot.type = "path")
Actually Figure 2 was not produced by a call to plot
but in fact was
obtained by slightly modifying the plot.lars
wrapper so as to display
the paths of a variable color coded by its true coefficient value (blue
for truly zero and red for truly nonzero). We did this in order to
facilitate comparison to the lasso. The lasso path (obtained using the
LAR-solution) is displayed in the bottom plot of Figure 2. Notice how in
contrast to gnet, the path solution for the lasso has a wiggly
"spaghetti"-like shape and that many of the truly nonzero coefficients
are hard to identify because of this. This a direct consequence of the
high-correlation in the
The gnet incorporates the strength of Bayesian WGRR estimation with that
of frequentist soft thresholding. These combined strengths make it an
effective tool for prediction and variable selection in correlated
high-dimensional settings. If variable selection is not of concern, and
the key issue is accurate prediction, than the BMA may be preferred.
Both the gnet and BMA can be computed using the spikeslab R package.
This package is computationally efficient, and scales effectively even
to massively large
As one example of this scalability, we added 100,000 noise variables to
the diabetes data set and then made a call to cv.spikeslab
with the
added options bigp.smalln = TRUE
, max.var = 100
and
parallel = TRUE
(as before we used parallel
option invokes parallel processing that is implemented via
the package snow (Tierney, A. J. Rossini, N. Li and H. Sevcikova 2008)
[note that sending in an integer for the option parallel
sets the
number of socket clusters on the local machine on which the session is
being initiated; in our example we actually used parallel = 8
]. The
snow package should be loaded prior to making the cv.spikeslab
call.
Figure 3 displays the gnet’s path solution (obtained using the full data). While only 4 variables have path-profiles that clearly stand out, impressively these variables are the top 4 from our previous analysis. The gnet estimates (scaled to standardized covariates), its averaged cv-estimates, and the stability values for the top 15 variables were:
gnet gnet.cv stability
ltg 20.1865498 20.14414604 100
bmi 18.7600433 22.79892835 100
map 4.7111022 4.83179363 95
hdl -2.5520177 -2.66839785 95
bmi.2 3.6204750 0.46308305 40
x.61746 0.0000000 -0.74646210 35
x.42036 4.8342736 0.41993669 30
x.99041 0.0000000 -0.70183515 30
x.82308 5.2728011 0.75420320 25
glu 1.3105751 0.16714059 25
x.46903 0.0000000 -0.65188451 25
x.57061 0.0000000 0.73203633 25
x.99367 -2.7695621 -0.22110463 20
tch 0.2542299 0.14837708 20
x.51837 0.0000000 -0.09707276 20
Importantly, note that the top 4 variables have greater than or equal to 95% stability (variables starting with “x.” are noise variables). It is also interesting that 3 other non-noise variables, "bmi.2", "glu", and "tch" were in the top 15 variables. In fact, when we inspected the 100 variables that passed the filtering step of the algorithm (applied to the full data), we found that 10 were from the original 64 variables, and 6 were from the top 15 variables from our earlier analysis. This demonstrates stability of the filtering algorithm even in ultra-high dimensional problems.
Finally, we remark that in illustrating the spikeslab package in this
article, we focused primarily on the spikeslab
wrapper, which is the
main entry point to the package. Other available wrappers include
predict.spikeslab
for prediction on test data, and
sparsePC.spikeslab
. The latter implements variable selection for
multiclass gene expression data (Ishwaran and J.S. Rao 2010).
In future work we plan to extend the rescaled spike and slab methodology to high-dimensional generalized linear models. At that time we will introduce a corresponding wrapper.
This work was partially funded by the National Science Foundation grant DMS-0705037. We thank Vincent Carey and two anonymous referees for helping to substantially improve this manuscript.
HighPerformanceComputing, MachineLearning
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ishwaran, et al., "spikeslab: Prediction and Variable Selection Using Spike and Slab Regression", The R Journal, 2010
BibTeX citation
@article{RJ-2010-018, author = {Ishwaran, Hemant and Kogalur, Udaya B. and Rao, J. Sunil}, title = {spikeslab: Prediction and Variable Selection Using Spike and Slab Regression}, journal = {The R Journal}, year = {2010}, note = {https://rjournal.github.io/}, volume = {2}, issue = {2}, issn = {2073-4859}, pages = {68-73} }