An R package BSGS is provided for the integration of Bayesian variable and sparse group selection separately proposed by Chen et al. (2011) and Chen et al. (2014) for variable selection problems, even in the cases of large
Variable selection is a fundamental problem in regression analysis, and one that has become even more relevant in current applications where the number of variables can be very large, but it is commonly assumed that only a small number of variables are important for explaining the response variable. This sparsity assumption enables us to select the important variables, even in situations where the number of candidate variables is much greater than the number of observations.
BSGS is an R package
designed to carry out a variety of Markov chain Monte Carlo (MCMC)
sampling approaches for variable selection problems in regression models
based on a Bayesian framework. In this package, we consider two
structures of variables and create functions for the corresponding MCMC
sampling procedures. In the first case where the variables are treated
individually without grouping structure, two functions,
CompWiseGibbsSimple
and CompWiseGibbsSMP
, are provided to generate
the samples from the corresponding posterior distribution. In the second
case, it is assumed that the variables form certain group structures or
patterns, and thus the variables can be partitioned into different
disjoint groups. However, only a small number of groups are assumed to
be important for explaining the response variable, i.e. the condition of
the group sparsity, and we also assume that sparse assumption is held
for the variables within the groups. This problem is thus termed a
sparse group selection problem Simon et al. (2013; Chen et al. 2014), and the goal is
to select the important groups and also identify the active variables
within these important groups simultaneously. There are two functions to
handle the sparse group selection problems, BSGS.Simple
and
BSGS.Sample
, which are used to generate the corresponding posterior
samples. Once the posterior samples are available, we then can determine
the active groups and variables, estimate the parameters of interest and
make other statistical inferences.
This paper is organized as follows. We first briefly introduce statistical models that are used to deal with the problems of variable selection in the BSGS package. We then describe the tuning parameters in the functions in the BSGS package. Two simulations are used to illustrate the details of the implementations of the functions. Finally we present a real economic example to demonstrate the BSGS package.
We start with the introduction of individual variable selection problems, and then turn our attention to sparse group selection. For completeness, we describe the model and priors so that one may easily change the inputs of functions in the BSGS package for any purpose.
Consider a linear regression model given by
where
In this Bayesian framework we basically follow the prior assumption in
Chen et al. (2011). We assume the prior distribution of
and then given
where
Moreover, (
Based on the model setting given above, there are two Gibbs samplers for this variable selection. The first procedure is the componentwise Gibbs sampler (CGS) which was introduced by Geweke (1996) and also mentioned in Chen et al. (2011). The other is the stochastic matching pursuit (SMP) algorithm Chen et al. (2011) in which the variables compete to be selected.
In traditional variable selection problems, each variable
Here the goal is not only to select the influential groups, but also to
identify the important variables within these. To this end, a Bayesian
group variable selection approach, the Group-wise Gibbs sampler (GWGS)
Chen et al. (2014), is applied. Suppose that, in terms of expert or prior
knowledge, the explanatory variables
where
We assume the group indicator,
where
where
Two sampling procedures are proposed in Chen et al. (2014) for Bayesian sparse
group selection. The first is the GWGS. In the GWGS, simulating the
indicator variable
In this section, we describe the default tuning parameters and some details in the implementation of the functions in BSGS.
The tuning parameters,
The parameters in the prior distribution of
The parameter
Now we consider the assignment of the value of
The parameters
The default of the prior inclusion probabilities of groups and
variables is set equal to 0.5. In the case when
The posterior distribution is not available in explicit form so we use
the MCMC method, and specifically Gibbs sampling to simulate the
parameters from this distribution Brooks, A. Gelman, G. L. Jones, and X. Meng (2011). To implement the Gibbs
sampler, the full conditional distributions of all parameters must be
determined. A derivation of the full conditional distributions is
provided in Chen et al. (2014). When these have been obtained, the parameters
are then updated individually using a Gibbs sampler (where available),
or a Metropolis-Hastings sampling algorithm. An MCMC sample will
converge to the stationary distribution, i.e. the posterior
distribution. We use the batch mean method to estimate the Monte Carlo
standard error (MCSE) of the estimate of
Variable and group selection criteria:
In this package, we adopt the median probability criterion proposed
by Barbieri and Berger (2004) for group and variable selections. Specifically,
for the variable selection problem, we estimate the posterior
inclusion probability
Posterior estimates of regression coefficients:
We use the Rao-Blackwell method to estimate
where
Regarding the stability of the estimation, we compare the accuracy of selection of the variables by the following measures in the simulation studies: the True Classification Rate (TCR), the True Positive Rate (TPR), and the False Positive Rate (FPR). These are defined as follows
TCR is an overall evaluation of accuracy in the identification of the active and inactive variables. TPR is the average rate of active variables correctly identified, and is used to measure the power of the method. FPR is the average rate of inactive variables that are included in the regression, and it can be considered as the type I error rate of the selection approach. In these three criteria, it is preferred to have a larger value of TCR or TPR, or a smaller value of FPR.
We also report the mean squared error (MSE),
Two simulations and a real example are provided to demonstrate the use of functions in the BSGS package.
The traditional variable selection problem is illustrated in this
simulation. We use an example to illustrate the functions
CompWiseGibbsSimple
and CompWiseGibbsSMP
corresponding to the CGS
and SMP sampling procedures to simulate the sample from the posterior
distribution. Based on the samples, we can decide which variable is
important in the regression model. In this simulation, the data
require(BSGS)
set.seed(1)
## Generate data
num.of.obs <- 50
num.of.covariates <- 100
beta.g <- matrix(c(3, -3.5, 4, -2.8, 3.2, rep(0, num.of.covariates-5)), ncol = 1)
r.true <- (beta.g != 0) * 1
pair.corr <- 0.0 ## pair correlations between covariates
Sigma <- matrix(pair.corr, num.of.covariates, num.of.covariates)
diag(Sigma) <- rep(1, num.of.covariates)
x <- mvrnorm(n = num.of.obs, rep(0, num.of.covariates), Sigma)
sigma2 <- 1
mu <- x %*% beta.g
y <- rnorm(num.of.obs, mu, sigma2)
Regarding to the hyperparameters, we simply set
## Specify the values of hyperparameters and initial values of parameters
tau2 <- 10 ## hyperparameter in Eq. (1)
nu0 <- 10 ## hyperparameter in Eq. (2)
lambda0 <- sd(y) ## hyperparameter in Eq. (2)
## Initial values for parameters
beta.initial <- t(solve(t(x) %*% x + diag(1/5, num.of.covariates)) %*% t(x) %*% y)
sigma2.initial<-1
r.initial <- rbinom(num.of.covariates, 1, 1)
Now two functions, CompWiseGibbsSimple
and CompWiseGibbsSMP
, are
applied to simulate the samples from the posterior distribution. The
minimum number of iterations is 1000. The simulation will stop when the
MCSE of the estimate of num.of.inner.iter.default = 10
times for
num.of.iteration <- 1000
num.of.inner.iter.default <- 10
MCSE.Sigma2.Given <- 0.1
## Apply two sampling functions to generate the samples from the
## posterior distribution.
outputCGS <- CompWiseGibbsSimple(y, x, beta.initial, r.initial, tau2,
rho, sigma2.initial, nu0, lambda0, num.of.inner.iter.default,
num.of.iteration, MCSE.Sigma2.Given)
outputSMP <- CompWiseGibbsSMP(y, x, beta.initial, r.initial, tau2,
rho, sigma2.initial, nu0, lambda0, num.of.inner.iter.default,
num.of.iteration, MCSE.Sigma2.Given)
Once the simulation stops, the posterior samples are used to estimate the posterior quantities of interest. One can then check the number of iterations and computational times for both approaches.
## Output from the component-wise Gibbs sampling procedure
outputCGS$Iteration
[1] 1000
outputCGS$TimeElapsed
user system elapsed
61.558 4.815 66.813
## Output from the component-wise Gibbs sampling procedure
outputSMP$Iteration
[1] 1000
outputSMP$TimeElapsed
user system elapsed
45.970 5.907 52.383
One can then use the function CGS.SMP.PE
to identify the important
variables and to estimate the parameters. Due to the limitations of
space, we do not include the estimates here. A variable TCR.TPR.FPR.CGS.SMP
and
MSE.CGS.SMP
are carried out to evaluate the performance on the model
estimations in terms of TCP, TPR, FTP and MSE.
## Output from the component-wise Gibbs sampling procedure
CGS.SMP.PE(outputCGS)
MSE.CGS.SMP(outputCGS, Y=y, X=x)
[1] 2.921087
TCR.TPR.FPR.CGS.SMP(outputCGS, r.true, 0.5)
$TCR
[1] 1
$TPR
[1] 1
$FPR
[1] 0
## Output from the component-wise Gibbs sampling procedure
CGS.SMP.PE(outputSMP)
MSE.CGS.SMP(outputSMP, Y=y, X=x)
[1] 3.751427
TCR.TPR.FPR.CGS.SMP(outputSMP, r.true, 0.5)
$TCR
[1] 1
$TPR
[1] 1
$FPR
[1] 0
We provide another stimulation to illustrate the use of functions for
sparse group selection. In the following simulation, the data
require(BSGS)
set.seed(1)
Num.Of.Iteration <- 1000
Num.of.Iter.Inside.CompWise <- 10
num.of.obs <- 50
num.of.covariates <- 100
num.of.group.var <- 10
Group.Index <- rep(1:10, each = 10)
nu <- 0
lambda <- 1
pair.corr <- 0.
Sigma <- matrix(pair.corr, num.of.covariates, num.of.covariates)
diag(Sigma) <- rep(1,num.of.covariates)
X <- mvrnorm(n = num.of.obs, rep(0, num.of.covariates), Sigma)
beta.true <- rep(0, num.of.covariates)
beta.true[c(7, 8, 9, 11, 12, 43, 77)] <- c(3.2, 3.2, 3.2, 1.5, 1.5, -1.5, -2)
beta.true <- cbind(beta.true)
r.true <- (beta.true != 0) * 1
sigma2.true <-1
Y <- rnorm(num.of.obs, X %*% beta.true, sigma2.true)
Here we suppose that we have no prior information on the parameters. We
let
## hyperparameters
tau2.value <- rep(1, num.of.covariates)
rho.value <- rep(0.5, num.of.covariates)
theta.value <- rep(0.5, num.of.group.var)
With the reasonable assignment of the initial values of parameters, we
apply two functions, BSGS.Simple
and BSGS.Sample
, to estimate the
posterior quantities of interest and in turn identify the important
groups and variables. For illustration, we stop the simulation when the
MCSE of estimate of
## Initial values and stopping point
r.value <- rbinom(num.of.covariates, 1, 0.5)
eta.value <- rbinom(num.of.group.var, 1, 0.5)
beta.value <- cbind(c(t(solve(t(X) %*% X +
diag(1/5, num.of.covariates)) %*% t(X) %*% Y) )) # beta.true
sigma2.value <- 1
MCSE.Sigma2.Given <- 0.5
## Apply two sampling approaches to generate samples
outputSimple <- BSGS.Simple(Y, X, Group.Index, r.value, eta.value, beta.value,
tau2.value, rho.value, theta.value, sigma2.value, nu, lambda,
Num.of.Iter.Inside.CompWise, Num.Of.Iteration, MCSE.Sigma2.Given)
outputSample <- BSGS.Sample(Y, X, Group.Index, r.value, eta.value, beta.value,
tau2.value, rho.value, theta.value, sigma2.value, nu, lambda,
Num.of.Iter.Inside.CompWise, Num.Of.Iteration, MCSE.Sigma2.Given)
One can easily use the function BSGS.PE
to estimate the posterior
probabilities of MSE.BSGS
. Furthermore, the
function TCR.TPR.FPR.BSGS
is used to evaluate the performance on the
accuracy of selection on variables. All functions are illustrated as
follows. We take
## The posterior quantities estimated by two sampling approaches respectively.
## Output from the simple version of BSGS
outputSimple$Iteration
[1] 1000
outputSimple$TimeElapsed
user system elapsed
238.755 0.007 239.037
BSGS.PE(outputSimple)$eta.est
1 2 3 4 5 6 7 8 9 10
1.000 1.000 0.115 0.200 0.926 0.099 0.108 0.991 0.053 0.171
MSE.BSGS(outputSimple, Y=Y, X=X)
[1] 0.574
TCR.TPR.FPR.BSGS(outputSimple, r.true, 0.5)
$TCR
[1] 0.97
$TPR
[1] 1
$FPR
[1] 0.0323
## Output from the sample version of BSGS
outputSample$Iteration
[1] 3700
outputSample$TimeElapsed
user system elapsed
105.535 0.163 105.822
BSGS.PE(outputSample)$eta.est
1 2 3 4 5 6 7 8 9 10
1.00000 0.90514 0.04189 0.02459 0.88000 0.01000 0.00486 0.91135 0.00486 0.00730
MSE.BSGS(outputSample, Y=Y, X=X)
[1] 2.16
TCR.TPR.FPR.BSGS(outputSample, r.true, 0.5)
$TCR
[1] 0.97
$TPR
[1] 1
$FPR
[1] 0.0323
One may be interested in how the different stopping points would affect
the computational effort and model estimation. We thus compare the
computational times and accuracy of parameter estimation in terms of
TCR, TPR, FPR, and MSE for two different sampling approaches by using
different MCSEs to stop the simulation and the results are shown in
Tables 1 and 2. It has
been found that for the case of
MCSE | # of iterations | runtime (in sec) | MSE | TCR | TPR | FPR | |
---|---|---|---|---|---|---|---|
0.5 | 1000 | 239 | 0.58 | 0.97 | 1 | 0.03 | 1.143 |
0.25 | 1000 | 239 | 0.58 | 0.97 | 1 | 0.03 | 1.143 |
0.1 | 1100 | 263 | 0.55 | 0.98 | 1 | 0.02 | 1.112 |
MCSE | # of iterations | runtime (in sec) | MSE | TCR | TPR | FPR | |
---|---|---|---|---|---|---|---|
0.5 | 3700 | 105 | 2.16 | 0.97 | 1 | 0.03 | 1.418 |
0.25 | 9500 | 226 | 0.58 | 0.97 | 1 | 0.03 | 1.418 |
0.1 | 31200 | 4148 | 0.48 | 0.97 | 1 | 0.03 | 1.047 |
We further investigate the accuracy of parameter estimations and the
selection of the variables in terms of MSE, TCR, TPR, and FPR for the
two sampling approaches when the number of covariates increases but each
group has 10 variables. We consider
# of covariates | # of iterations | runtime (in sec) | MSE | TCR | TPR | FPR | |
---|---|---|---|---|---|---|---|
300 | 1000 | 212 | 0.56 | 0.95 | 0.86 | 0.05 | 1.94 |
500 | 1000 | 273 | 0.99 | 0.98 | 1 | 0.02 | 1.803 |
1000 | 1000 | 263 | 0.55 | 0.98 | 1 | 0.02 | 1.112 |
# of covariates | # of iterations | runtime (in sec) | MSE | TCR | TPR | FPR | |
---|---|---|---|---|---|---|---|
300 | 2100 | 127 | 1.63 | 0.99 | 1 | 0.01 | 1.819 |
500 | 14600 | 2258 | 4.73 | 0.97 | 1 | 0.03 | 1.601 |
1000 | 39900 | 39567 | 8.90 | 0.99 | 1 | <0.01 | 2.158 |
Next, we compare the computational time when the number of variables in the group increases. We perform an experiment in which there are seven groups, each containing 15 variables. The assignments of hyperparameters are the same as those in Simulation II. Table 5 shows the computational time plus the model estimations. It can be seen that the sample version is strongly recommended when the number of variables within a group is greater than 15.
Sampling version | # of iterations | runtime (in sec) | MSE | TCR | TPR | FPR | ||
---|---|---|---|---|---|---|---|---|
Simple | 1000 | 8330 | 0.11 | 0.93 | 1 | 0.08 | 0.90 | |
Sample | 2400 | 71 | 0.75 | 0.99 | 1 | 0.01 | 1.06 |
This subsection further illustrates the functions in the BSGS based on an economic dataset from Rose and Spiegel Rose and Spiegel (2010, 2011, 2012) which is available at http://faculty.haas.berkeley.edu/arose The response variable is the 2008-2009 growth rate for the crisis measure. Rose and Spiegel originally consider 119 explanatory factors for the crisis for as many as 107 countries, but there are missing data for a number of these.
require(BSGS)
## the whole data set
data(Crisis2008)
To maintain a balanced data set, we use 51 variables for a sample of 72
countries. For more information about the balanced data, please see the
description of Crisis2008
in the BSGS. The balanced data is then
analyzed to illustrate the main sampling function BSGS.Simple
to
simulate the sample from the posterior distribution. In the analysis, we
demean the response so that it is not necessary to include the intercept
into the design matrix. All variables are standardized except the dummy
variables.
set.seed(1)
data(Crisis2008BalancedData)
var.names <- colnames(Crisis2008BalancedData)[-1]
country.all <- rownames(Crisis2008BalancedData)
cov.of.interest <- colnames(Crisis2008BalancedData)[-1]
Y <- Crisis2008BalancedData[, 1]
Y <- Y - mean(Y)
X <- Crisis2008BalancedData[, -1]
if (NORMALIZATION) {
dummy.variable <- cov.of.interest[lapply(apply(X, 2, unique), length) == 2]
non.dummy.X <- X[, !(colnames(X) %in% dummy.variable)]
X.normalized <- apply(non.dummy.X, 2, function(XX) (XX - mean(XX))/sd(XX))
X[, !(colnames(X) %in% dummy.variable)] <- X.normalized
}
As discussed in Ho (2014), these variables can be classified into the nine theoretical groups of the crisis’ origin (the number in parentheses indicates the number of variables considered in the group): principal factors (10), financial policies (three), financial conditions (four), asset price appreciation (two), macroeconomic policies (four), institutions (11), geography (four), financial linkages (one), and trade linkages (12). Based on this information, we assign a group index to each variable.
Group.Index <- rep(1:9, c(10, 3, 4, 2, 4, 11, 4, 1, 12))
Since the number of covariates within the group is moderate, it is
recommended that the simple version GWGS be applied to generate the
samples. In this example, we have tested different values of Num.of.Iter.Inside.CompWise = 100
” times
within a group for stability.
Num.Of.Iteration <- 1000
Num.of.Iter.Inside.CompWise <- 100
num.of.obs <- nrow(X)
num.of.covariates <- ncol(X)
num.of.groups <- length(unique(Group.Index))
nu <- 0
lambda <- 1
beta.est <- lm(Y ~ X - 1)$coef
beta.est[is.na(beta.est)] <- 0
beta.value <- beta.est
tau2.value <- rep(1, num.of.covariates)
sigma2.value <- 1
r.value <- rep(0, num.of.covariates)
eta.value <- rep(0, num.of.groups)
tau2.value <- rep(10, num.of.covariates)
rho.value <- rep(0.5, num.of.covariates)
theta.value <- rep(0.5, num.of.groups)
MCSE.Sigma2.Given <- 0.5
outputCrisis2008 <- BSGS.Simple(Y, X, Group.Index, r.value, eta.value, beta.value,
tau2.value, rho.value, theta.value, sigma2.value, nu, lambda,
Num.of.Iter.Inside.CompWise, Num.Of.Iteration, MCSE.Sigma2.Given)
The posterior probabilities of
This paper illustrated the usage of a new R package, BSGS, for
identifying the important groups of variables and important variables in
linear regression models. Furthermore, BSGS can be easily implemented
with problems of a large
We are confident that this package can be applied to many important real-world problems by keeping flexibility with regard to selecting variables within a group based on the hierarchical assignment of two layers of indicator variables. For instance, in the gene-set selection problem, a biological pathway may be related to a certain biological process, but it may not necessarily mean all the genes in the pathway are all related to the biological process. We may want not only to remove unimportant pathways effectively, but also identify important genes within important pathways.
This work is partially supported by the Ministry of Science and Technology under grant MOST 103-2633-M-006 -002 (Lee); the Ministry of Science and Technology under grant MOST 103-2118-M-006-002-MY2 (Chen), the Mathematics Division of the National Center for Theoretical Sciences in Taiwan.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lee & Chen, "BSGS: Bayesian Sparse Group Selection", The R Journal, 2015
BibTeX citation
@article{RJ-2015-025, author = {Lee, Kuo-Jung and Chen, Ray-Bing}, title = {BSGS: Bayesian Sparse Group Selection}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {7}, issue = {2}, issn = {2073-4859}, pages = {122-133} }