Dimension reduction is one of the biggest challenges in high-dimensional regression models. We recently introduced a new methodology based on variable clustering as a means to reduce dimensionality. We present here the R package clere that implements some refinements of this methodology. An overview of the package functionalities as well as examples to run an analysis are described. Numerical experiments on real data were performed to illustrate the good predictive performance of our parsimonious method compared to standard dimension reduction approaches.
High dimensionality is increasingly ubiquitous in numerous scientific fields including genetics, economics and physics. Reducing the dimensionality is a challenge that most statistical methodologies must meet not only to remain interpretable but also to achieve reliable predictions. In linear regression models, dimension reduction techniques often correspond to variable selection methods. Approaches for variable selection are already implemented in publicly available, open-source software, e.g., the well-known R packages glmnet (Friedman et al. 2010) and spikeslab (Ishwaran et al. 2013). The R package glmnet implements the Elastic net methodology (Zou and Hastie 2005), which is a generalization of both the LASSO (Tibshirani 1996) and the ridge regression [RR; Hoerl and Kennard (1970)]. The R package spikeslab in turn, implements the Spike and Slab methodology (Ishwaran and Rao 2005), which is a Bayesian approach for variable selection.
Dimension reduction cannot, however, be restricted to variable
selection. Indeed, the field can be extended to include approaches which
aim at creating surrogate covariates that summarize the information
contained in initial covariates. Since the emblematic principal
component regression [PCR; Jolliffe (1982)], many other methods spread in the
recent literature. As specific examples, we may refer to the OSCAR
methodology (Bondell and Reich 2008), or the PACS methodology (Sharma et al. 2013) which is
a generalization of the latter approach. Those methods mainly proposed
variable clustering within a regression model as a way to reduce the
dimensionality. Despite their theoretical and practical appeal,
implementations of those methods were often proposed only through MATLAB
(The MathWorks Inc. 2014) or R scripts, limiting thus the flexibility and the
computational efficiency of their use. The CLusterwise Effect REgression
(CLERE) methodology (Yengo et al. 2014), was recently introduced as a novel
methodology for simultaneous variable clustering and regression. The
CLERE methodology is based on the assumption that each regression
coefficient is an unobserved random variable sampled from a mixture of
Gaussian distributions with an arbitrary number
In this paper, we propose two new features for the CLERE model. First,
the stochastic EM (SEM) algorithm is proposed as a more computationally
efficient alternative to the Monte Carlo EM (MCEM) algorithm previously
introduced in Yengo et al. (2014). Secondly, the CLERE model is enhanced with the
possibility of constraining the first component to have its mean equal
to 0, i.e.
The outline of the present paper is the following. In the next section the definition of the model is recalled and the strategy to estimate the model parameters is explained. Afterwards, the main functionalities of the R package clere are presented. Real data analyses are then provided, aiming at illustrating the good predictive performances of CLERE, with noticeable parsimony ability, compared to standard dimension reduction methods. Finally, perspectives and further potential improvements of the package are discussed in the last section.
Our model is defined by the following hierarchical relationships:
where
Let
Since it requires integrating over
Nonetheless, maximum likelihood estimation is still achievable using the
expectation maximization (EM) algorithm (Dempster et al. 1977). The latter
algorithm is an iterative method which starts with an initial estimate
of the parameter and updates this estimate until convergence. Each
iteration of the algorithm consists of two steps, denoted as the E and
the M steps. At each iteration
In model ((1)), the E step is analytically intractable. A
broad literature devoted to intractable E steps recommends the use of
a stochastic approximation of
In this section, two algorithms for model inference are presented: the
Monte-Carlo Expectation Maximization (MCEM) algorithm and the Stochastic
Expectation Maximization (SEM) algorithm. The section starts with the
initialization strategy common to both algorithms and continues with the
detailed description of each algorithm. Then, model selection (for
choosing
The two algorithms presented in this section are initialized using a
primary estimate
Suppose at iteration
Sampling from
and, noting that
In Equation ((4)),
Using the
In most situations, the SEM algorithm can be considered as a special
case of the MCEM algorithm (Celeux et al. 1996), obtained by setting
Let the SVD decomposition of matrix
To sample from
where
To improve the mixing of the generated Markov chain, we start the
simulation step at each iteration by creating a random permutation of
where
Given a non-negative user-specified threshold
Absorbing states. The SEM algorithm described above defines a
Markov chain where the stationary distribution is concentrated
around values of
Attracting states. We empirically observed that attraction around
nsamp
in the function fitClere
).
Once the MLE
An additional criterion for model selection, namely the
ICL criterion (Biernacki et al. 2000) is also implemented in the R package
clere. The latter criterion can be written as
The constraint
The R package clere mainly implements a function for parameter
estimation and model selection: the function fitClere()
. Four
additional methods are also implemented in the package: for graphical
representation, plot()
; summarizing the results, summary()
; for
getting the predicted clusters of variables, clusters()
; and for
making predictions from new design matrices, predict()
. Examples of
calls to the functions presented in this section are given in the next
section.
fitClere()
The main function fitClere()
has only three mandatory arguments: the
vector of response y
(size x
(size analysis
, when
it takes the value "aic"
, "bic"
or "icl"
, allows to test all the
possible number of groups between algorithm
,
but we encourage the users to use the default value, the SEM algorithm,
which generally over-performs the MCEM algorithm (see the first
experiment of the next section).
Several other parameters allow to tune the different numbers of iterations of the estimation algorithm. In general, the higher are these parameter values, the better is the quality of the estimation but the heavier is also the computing time. What we advice is to use the default values, and to graphically check the quality of the estimation with plots as in Figure 1: If the values of the model parameters are quite stable for a sufficient large part of the iterations, this indicates that the results are ok. If the stability is not reached sufficiently early before the end of the iterations, a higher number of iterations should be chosen.
Finally, among the remaining parameters (note that the complete list can
be obtained with help("fitClere")
), two are especially important:
parallel
allows to run parallel computations (if compatible with the
user’s computer) and sparse
allows to impose that one of the
regression parameters is equal to 0 and thus to introduce a cluster of
not significant explanatory variables.
summary()
, plot()
, clusters()
and predict()
The summary()
method for an object returned by fitClere()
prints an
overview of the estimated parameters and returns the estimated
likelihood and information based model selection criteria (AIC, BIC and
ICL). The corresponding plot()
method produces graphs such as ones
presented in Figure 1.
nBurn
of iterations
discarded before calculating maximum likelihood
estimates.The clusters()
method takes one argument of class “Clere” as returned
by fitClere()
and a threshold
argument. This function assigns each
variable to the group where associated conditional probability of
membership is larger than the given threshold
. If conditional
probabilities of membership are larger than the specified threshold for
more than one group, then the group having the largest probability is
returned and a warning is printed. If, moreover, there are several ex
aequo on that largest probability, then the group with the smallest
index is returned. When threshold = NULL
, the
maximum a posteriori (MAP) strategy is used to infer the clusters.
The predict()
method has two arguments: a “Clere” object and a design
matrix predict()
method returns an approximation of
This section presents two sets of numerical experiments. The first set of experiments aims at comparing the MCEM and SEM algorithms in terms of computational time and estimation or prediction accuracy. The second set of experiments is aimed at comparing CLERE to standard dimension reduction techniques. The latter comparison is performed on both simulated and real data.
In this section, a comparison between the SEM algorithm and the MCEM algorithm is performed. This comparison is performed using the four following performance indicators:
Computational time (CT) to run a pre-defined number of SEM/MCEM iterations. This number was set to 2,000 in this simulation study.
Mean squared estimation error (MSEE) defined as
Mean squared prediction error (MSPE) defined as
Maximum log-likelihood (ML) reached. This quantity was approximated
using 1,000 samples from
Three versions of the MCEM algorithm were proposed for comparison with
the SEM algorithm, depending on the number nsamp
) of Gibbs
iterations used to approximate the E step. That number was varied
between 5, 25 and 125. We chose these iterations numbers so as to cover
different situations, from a situation in which the number of iterations
is too small to a situation in which that number seems sufficient to
expect having reached convergence of the simulated Markov
chain. Those versions were respectively denoted
MCEM
All algorithms were run using 10 different random starting points. Estimates yielding the largest likelihood were then used for the comparison.
Table 1 summarizes the results of the comparison
between the algorithms. The MCEM
The results of this simulation study were made available as an internal
data set named algoComp
in the R package clere. More details can be
obtained using the command help("algoComp")
.
% of times | Median | ||
Performance indicators | Algorithms | the algorithm was best | (Std. Err.) |
CT (seconds) | SEM | 0 | 2.5 ( 0.053 ) |
MCEM |
100 | 1.9 ( 0.016 ) | |
MCEM |
0 | 7.1 ( 0.027 ) | |
MCEM |
0 | 32.8 ( 0.121 ) | |
MSEE | SEM | 58 | 0.31 ( 10.4 ) |
MCEM |
12 | 20.14 ( 2843.3 ) | |
MCEM |
16.5 | 8.86 ( 3107.5 ) | |
MCEM |
13.5 | 4.02 ( 5664.2 ) | |
MSPE | SEM | 51.5 | 1.3 ( 46.1 ) |
MCEM |
12 | 75.7 ( 64.3 ) | |
MCEM |
15.5 | 58.7 ( 55.2 ) | |
MCEM |
21 | 51.6 ( 51.1 ) | |
True parameter | — | 1.1 ( 0.013 ) | |
ML | SEM | 66.5 | |
MCEM |
11.5 | ||
MCEM |
14.5 | ||
MCEM |
7.5 | ||
True parameter | — |
In this section, we compare our model to standard dimension reduction
approaches in terms of MSPE. Although a more detailed comparison was
suggested in Yengo et al. (2014), we propose here a quick illustration of the
relative predictive performance of our model. The comparison is achieved
using data simulated according to the scenario described above in
Section SEM algorithm versus MCEM algorithm . The methods selected for comparison are the Ridge regression
(Hoerl and Kennard 1970), the Elastic net (Zou and Hastie 2005), the LASSO (Tibshirani 1996),
PACS (Sharma et al. 2013), the method of Park and colleagues (Park et al. 2007 subsequently denoted AVG) and the Spike and Slab model (Ishwaran and Rao 2005 subsequently denoted SS). The first three methods are implemented in the
freely available R package glmnet. With the latter package, the tuning
parameter lambda
was selected using the function cv.glm
(with 5
folds) aiming at minimizing the mean squared error (option
type = "mse"
). In particular for the Elastic net, the second tuning
parameter alpha
(measuring the amount of mixture between the lambda
to minimize the mean
squared error. The R package glmnet proposes a procedure for
automatically selecting values for lambda
. We therefore used this
default procedure while we selected alpha
among
lambda
and betawt
. Our reimplementation of Bondell’s
script was included in the R package clere in the function
fitPacs()
. In Sharma et al. (2013), the authors suggest assigning betawt
with
the coefficients obtained from a ridge regression model after the tuning
parameter was selected using AIC. In this simulation study we used the
same strategy; however the ridge parameter was selected via 5-fold cross
validation. 5-fold CV was preferred to AIC since selecting the ridge
parameter using AIC always led to estimated coefficients equal to zero.
Once betawt
was selected, lambda
was chosen via 5-fold cross
validation among the following values: 0.01, 0.02, 0.05, 0.1, 0.2, 0.5,
1, 2, 5, 10, 20, 50, 100, 200 and 500. All other default parameters of
their script were unchanged. The AVG method is a two-step approach. The
first step uses hierarchical clustering of covariates to create
surrogate covariates by averaging the variables within each group. Those
new predictors are afterwards included in a linear regression model,
replacing the primary variables. A variable selection algorithm is then
applied to select the most predictive groups of covariates. To implement
this method, we followed the algorithm described in Park et al. (2007) and
programmed it in R. The Spike and Slab model is a Bayesian approach for
variable selection. It is based on the assumption that the regression
coefficients are distributed according to a mixture of two centered
Gaussian distributions with different variances. One component of the
mixture (the spike) is chosen to have a small variance, while the other
component (the slab) is allowed to have a large variance. Variables
assigned to the spike are dropped from the model. We used the R package
spikeslab to run the Spike and Slab models. Especially, we used the
function spikeslab from that package to detect influential variables.
The number of iterations used to run the function spikeslab was 2,000
(1,000 discarded).
When running fitClere()
, the number nItEM
of SEM iterations was set
to 2,000. The number g
of groups for CLERE was chosen between 1 and 5
using AIC (option analysis = "aic"
). Two versions of CLERE were
considered: the one with all parameters estimated and the one with sparse = TRUE
).
Figure 2 summarizes the comparison between the methods. In
this simulated scenario, CLERE outperformed the other methods in terms
of prediction error. These good performances were improved when
parameter
The results of this simulation study were made available as an internal
data set in the R package clere. The object is called numExpSimData
and more details can be obtained using the command
help("numExpSimData")
.
We used in this section the real data sets Prostate
and eyedata
from
the R packages lasso2
(Lokhorst et al. 2014) and flare
(Li et al. 2014) respectively. The Prostate
data set comes from a study that
examined the correlation between the level of prostate specific antigen
and a number of clinical measures in lpsa
) as response variable and the
The eyedata
data set is extracted from the published study of
Scheetz et al. (2006). This data set consists of gene expression levels
measured at
Those two data sets were utilized to compare CLERE to the same methods used in the previous section where the simulation study was presented. All methods were compared in terms of out-of-sample prediction error estimated using 5-fold cross validation (CV). Those CV statistics were then averaged and compared across the methods in Table 2.
Before presenting the results of the comparison between CLERE and its
competitors, we illustrate the commands to run the analysis of the
Prostate
data set. The data set is loaded by typing:
R> data("Prostate", package = "lasso2")
R> y <- Prostate[, "lpsa"]
R> x <- as.matrix(Prostate[, -which(colnames(Prostate) == "lpsa")])
Possible training (xt
and yt
) and validation (xv
and yv
) sets
are generated as follows:
R> itraining <- 1:(0.8*nrow(x))
R> xt <- x[ itraining,]; yt <- y[ itraining]
R> xv <- x[-itraining,]; yv <- y[-itraining]
The fitClere()
function is run using the AIC to select the number of
groups between 1 and 5. To lessen the impact of local minima in the
model selection, 5 random starting points are used. This can be
implemented by:
R> Seed <- 1234
R> mod <- fitClere(y = yt, x = xt, g = 5, analysis = "aic", parallel = TRUE,
+ nstart = 5, sparse = TRUE, nItEM = 2000, nBurn = 1000,
+ nItMC = 10, dp = 5, nsamp = 1000, seed = Seed)
R> summary(mod)
-------------------------------
| CLERE | Yengo et al. (2013) |
-------------------------------
Model object 2 groups of variables ( Selected using AIC criterion )
---
Estimated parameters using SEM algorithm are
intercept = -0.1339
b = 0.0000 0.4722
pi = 0.7153 0.2848
sigma2 = 0.395
gamma2 = 4.065e-08
---
Log-likelihood = -78.31
Entropy = 0.5464
AIC = 168.63
BIC = 182.69
ICL = 183.23
R> plot(mod)
Running the command plot(mod)
generates the plot given in
Figure 1. We can also access the cluster memberships by
running the command clusters()
. For example, running the command
clusters(mod, threshold = 0.7)
yields
R> clusters(mod, thresold = 0.7)
lcavol lweight age lbph svi lcp gleason pgg45
2 2 1 1 1 1 1 1
In the example above 2 variables, being the cancer volume (lcavol
) and
the prostate weight (lweight
), were assigned to group 2
(P
in the object of class “Clere”.
R> mod@P
Group 1 Group 2
lcavol 0.000 1.000
lweight 0.000 1.000
age 1.000 0.000
lbph 1.000 0.000
svi 0.764 0.236
lcp 1.000 0.000
gleason 1.000 0.000
pgg45 1.000 0.000
The covariates were respectively assigned to their group with a
probability larger than 0.7. Moreover, given that parameter predict()
as follows:
R> error <- mean((yv - predict(mod, xv))^2)
R> error
[1] 1.543122
Table 2 summarizes the prediction errors and the number
of parameters obtained for all the methods. CLEREProstate
data set and the
second best performance for the eyedata
data set. The AVG method was
also very competitive compared to the variable selection approaches
stressing thus the relevance of creating groups of variables to reduce
the dimensionality (especially in the eyedata
data set). It is worth
noting that in both data sets, imposing the constraint
In the Prostate
data set, CLERE robustly identified two groups of
variables representing influential (eyedata
data set in turn, AIC led to selecting only
one group of variables. However, this did not lessen the predictive
performance of the model since CLEREeyedata
data set and around 3s for
the Prostate
data set remains affordable in these examples.
The results of this analysis on real data were made available as an
internal data set named numExpRealData
in the R package clere. Using
the command help("numExpRealData")
more details can be obtained.
100 |
Number of parameters | CT (seconds) | |
(Std. Error) | (Std. Error) | (Std. Error) | |
Prostate data set |
|||
LASSO | 90.2 ( 29 ) | 5.6 ( 0.7 ) | 0.064 ( 0.007 ) |
RIDGE | 86.8 ( 24 ) | 8.0 ( 0 ) | 0.065 ( 0.002 ) |
Elastic net | 90.3 ( 24 ) | 8.0 ( 0 ) | 0.065 ( 0.002 ) |
STEP | 442.4 ( 137 ) | 8.0 ( 0 ) | 0.004 ( 0.001 ) |
CLERE | 82.4 ( 25 ) | 6.0 ( 0 ) | 1.1 ( 0.1 ) |
CLERE |
74.5 ( 26 ) | 5.0 ( 0 ) | 2.7 ( 0.8 ) |
Spike and Slab | 85.6 ( 26 ) | 5.6 ( 0.7 ) | 4.2 ( 0.03 ) |
AVG | 90.2 ( 27 ) | 6.2 ( 0.4 ) | 0.44 ( 0.06 ) |
PACS | 90.6 ( 34 ) | 5.6 ( 0.4 ) | 0.053 ( 0.002 ) |
eyedata |
|||
LASSO | 0.73 ( 0.1 ) | 21.2 ( 2 ) | 0.18 ( 0.01 ) |
RIDGE | 0.74 ( 0.1 ) | 200.0 ( 0 ) | 0.24 ( 0.004 ) |
Elastic net | 0.74 ( 0.1 ) | 200.0 ( 0 ) | 0.23 ( 0.003 ) |
STEP | 1142.6 ( 736 ) | 95.0 ( 0 ) | 0.083 ( 0.002 ) |
CLERE | 0.73 ( 0.1 ) | 4.0 ( 0 ) | 21.5 ( 0.2 ) |
CLERE |
0.72 ( 0.1 ) | 3.0 ( 0 ) | 21.1 ( 0.1 ) |
Spike and Slab | 0.81 ( 0.2 ) | 12.4 ( 0.9 ) | 10.3 ( 0.1 ) |
AVG | 0.70 ( 0.04 ) | 15.6 ( 2 ) | 10.6 ( 0.4 ) |
PACS | 2.0 ( 0.9 ) | 3.0 ( 0.3 ) | 108.9 ( 28 ) |
We presented in this paper the R package clere. This package
implements two efficient algorithms for fitting the CLusterwise Effect
REgression model: the MCEM and the SEM algorithms. The MCEM algorithm is
to be preferred when
Our model can easily be extended to the analysis of binary responses. This extension will be made available in a forthcoming version of the package. Another direction for future research will be to develop an efficient stopping rule for the proposed SEM algorithm, specific to our context. Such a criterion is expected to improve the computational performance of our estimation algorithm.
glmnet, spikeslab, clere, Rcpp, RcppEigen, lasso2, flare
Bayesian, HighPerformanceComputing, MachineLearning, NumericalMathematics, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Yengo, et al., "Variable Clustering in High-Dimensional Linear Regression: The R Package clere", The R Journal, 2016
BibTeX citation
@article{RJ-2016-006, author = {Yengo, Loïc and Jacques, Julien and Biernacki, Christophe and Canouil, Mickael}, title = {Variable Clustering in High-Dimensional Linear Regression: The R Package clere}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {1}, issn = {2073-4859}, pages = {92-106} }