In multiple regression models, when there are a large number (
In a multivariate regression framework, the target response
The question of how to choose a subset of predictors of size
Model selection (and variable selection in regression, in particular) is a trade-off between bias and variance. Inference based on models with few variables can be biased, however, models that take into account too many variables may result in a lack of precision or false effects. These considerations call for a balance between under- and over-fitted models, the so-called model-selection problem (Forster 2000).
To solve this problem, several strategies have been proposed. One common
option is to use iterative procedures, such as the leaps and bounds
algorithm (Furnival and Wilson 1974) through which the best
subset selection is obtained. This is a full information criteria-based
approach, which compares all possible models and ranks them
(Calcagno and Mazancourt 2010). Nevertheless, the problem
of selecting the best model from among all possible combinations of
Several R packages have been developed to carry out automatic variable
selection or model selection. For instance, the
meifly package (Wickham 2014)
can be used to search through all the different models. This type of
exhaustive search can be also addressed using some other algorithm, such
as the branch-and-bound algorithm in the
leaps package (Lumley and Miller 2009) or
the leaps-and-bounds algorithm in the
subselect package
(Orestes Cerdeira et al. 2015). Both packages also implement other selection methods
(heuristics). The leaps
package includes the forward or backward stepwise, or sequential
replacement while the
subselect package
provides a simulated annealing-type search algorithm, a genetic
algorithm, and a restricted local improvement algorithm. To use the
Lasso method, the user can apply, for example, the lars
function
implemented in the lars
package (Hastie and Efron 2013) or the glmnet
function, which fits a
generalized linear model via penalized maximum likelihood, implemented
in the glmnet package
(Friedman et al. 2015). Additionally, another procedure used by the R
community seems to be the model-selection-oriented step
function
(Hastie and Pregibon 1992) built into the stats package. When it comes to model
selection with generalized linear models, one option could be to use the
glmulti package
(Calcagno 2013) or
bestglm package
(Mcleod and Xu 2014). Finally, within the class of generalized
additive models, other algorithms have also been introduced to achieve
component selection, see Lin and Zhang (2006) and references therein, the boosting
technique of Tutz and Binder (2006) or the generalization of the approach of
Belitz and Lang (2008). More recently, and widely applied by R users, the gam
function of the mgcv
package (Wood 2006, 2011)
includes an argument (select = TRUE
) for model selection and fitting
in a single step by adding a second penalty term in the estimation
scheme of continuous covariates (Marra and Wood 2011).
The FWDselect package introduces an alternative to existing approaches in the form of a methodology whereby R users can select the best variables in different regression contexts, both parametric and nonparametric. Unlike other existing packages, the procedure implemented in it can be equally applied to linear models, generalized linear models or/and generalized additive models, with Gaussian, binary or Poisson response. The forward selection algorithm proposed is based on a greedy procedure which changes one variable at the time in the model — keeping the others fixed — and does this repeatedly until none of the selected variables can be exchanged to improve model fit. This is a greedy algorithm, which may not find the actual best solution, but is less greedy than other methods such as step. In addition, in contrast with other packages in which the users must decide — either previous or post selection — the number of variables that have to be included, the bootstrap test introduced in our procedure allows them to determine the number with a significance level.
The remainder of this paper is organised as follows. First we describe
the algorithm used to select the best subset of size
This section introduces the developed methodology and gives a description of the variable selection algorithm. The implemented package can be used with Gaussian, binary or Poisson response, however and based on the application data, we will explain the procedure with a nonparametric regression model with Gaussian response.
Let
where
where
This implies that
To date, several approaches to estimating the model in ((1)) have been suggested in the statistical literature, e.g., Buja et al. (1989), Härdle and Hall (1993), Mammen et al. (1999). In this package penalized regression splines, as implemented in the mgcv package, are used.
It is important to highlight that, in situations involving a large number of variables, correct estimation of the response will be obtained on the basis of selecting the appropriate predictors. In the case that we have information a priori about which of the initial set of variables are relevant, it would be possible to apply a likelihood ratio test (Neyman and Pearson 1928) or a F-test type (Seber and Wild 1989; Seber 1997) in a parametric framework, or a generalized likelihood ratio test (Fan et al. 2001; Fan and Jiang 2005, 2007) in a nonparametric one. However, in situations where we do not have information in advance, it will be necessary to select the model according to a selection algorithm.
According to this, we propose a procedure that includes two topics: i)
selecting the best combination of
The first topic of our procedure is, given a number
Let
where
Algorithm 1 Modified forward stepwise selection method
1. Given a number
2. The elements of the vector of indices
Firstly, determine the variable of the first position
Note that all possible models of one variable must be estimated.
Fix the first variable obtained previously,
Fix
Fix
3. Once variables
4. Step 3 is repeated until there is no change in the selected
covariates, i.e., the algorithm stops when it has gone through a
complete cycle without changing any of the
Previously, the best subset of
Accordingly, we propose a procedure to test the null hypothesis of
the following strategy is considered: for a subset of size
versus the general hypothesis
where
Given a i.i.d. sample
Obtain the best subset of
Obtain the nonparametric estimates of the null model as
Compute the residuals as
where
The purpose of this step is to assess whether there is enough structure left in the residuals that could be modeled by the predictors not included in the null model. Note that, within these possible predictors, we only select one of them in order to reduce the computational cost of the algorithm. However, the ideal solution would be to estimate the model in ((4)) determining the best subset of predictors within the remainder variables, instead of selecting only one of them.
Both options are implemented in the package by means of the
speedup
argument of the test
function. If speedup = TRUE
is
specified a unique predictor for the residuals is used. If
speedup = FALSE
is specified the user can choose more than one
predictor. With this latter option, when the number of variables is
large, the selection of the best subset of predictors for the
residuals requires a high computational burdenqmin
argument which
must be filled by the user. This argument corresponds to the size of
the best subset of predictors. In order to help the user select it,
it is recommended to visualize the graphical output of the plot
function and to choose the number
Finally, we propose the following test statistic, based on the
estimations of
It is important to stress that, if the null hypothesis holds,
Step 1: Obtain
Step 2: For
where
Step 3: For
The test rule based on
Applying this test to
The validation of the approach relying on the bootstrap-based test can be consulted in (Sestelo 2013) where type I error and power have been calculated for different test statistics. Also, the performance of the test for different levels of correlation between covariates have been analyzed. All the test statistics perform reasonably well, with the level coming relatively close to the nominal size and the probability of rejection rising as we separate from the null hypothesis, specially with large sample sizes. Furthermore, several simulation studies have been considered in order to compare the methodology proposed in this paper with other procedures reported in the literature that carry out automatic variable selection.
This section introduces an overview of how the package is structured. FWDselect is a shortcut for “Forward selection” and this is its major functionality: to provide a forward stepwise-based selection procedure. This software helps the user select relevant variables and evaluate how many of these need to be included in a regression model. In addition, it enables both numerical and graphical outputs to be displayed.
Our package includes several functions that enable users to select the variables to be included in linear models, generalized linear models or generalized additive models. The functions within FWDselect are briefly described in Table 1.
Users can obtain the best combinations of selection
. Additionally, if one wants to obtain
the results for more than one subset size, it is possible to apply the
qselection
function, which returns a summary table showing the
different subsets, selected variables and information criterion values.
These values are obtained by cross-validation with the purpose of
comparing correctly the resulting models which include a different
number of variables. The object obtained with this last function is the
argument required for plot
, which provides a graphical output.
Finally, to determine the number of variables that should be introduced
in the model, only the test
function needs to be applied. Table
2 provides a summary of the arguments of the
selection
, qselection
and test
functions. The most computationally
demanding parts of the code, namely those that involve the estimation of
the models, the cross-validation and the bootstrap, have been
parallelized by means of the parallel package via forking on
Unix-alike platforms or creating a PSOCK cluster on Windows systems.
Function | Description |
---|---|
selection |
Main function for selecting a subset of lm , glm or gam functions. |
print.selection |
Method of the generic print function for "selection" objects, which returns a short summary. |
qselection |
Function that enables users to obtain the selected variables for more than one size of subset. |
print.qselection |
Method of the generic print function for "qselection" objects. Returns a table showing the chosen covariates to be introduced into the models and their information criteria obtained by cross-validation. |
plot.qselection |
Visualisation of "qselection" objects. Plots the cross-validation information criterion for several subsets with size |
test |
Function that applies a bootstrap-based test for covariate selection. Helps determine the precise number of variables to be included in the model. |
selection() arguments |
|
---|---|
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
q |
An integer specifying the size of the subset of variables to be selected. |
prevar |
A vector containing the number of the |
criterion |
The information criterion to be used. Default is the "deviance" . Other functions provided are the coefficient of determination ("R2" ), the residual variance ("variance" ), the Akaike information criterion ("aic" ), AIC with a correction for finite sample sizes ("aicc" ) and the Bayesian information criterion ("bic" ). The deviance, coefficient of determination and variance are calculated by cross-validation. |
method |
A character string specifying which regression method is used, "lm" , "glm" or "gam" . |
family |
This is a family object specifying the distribution and link to use in fitting. |
seconds |
A logical value. If TRUE then, rather than returning the single best model only, the function returns a few of the best models. |
nmodels |
Number of secondary models to be returned. |
nfolds |
Number of folds for the cross-validation procedure, for deviance , R2 or variance criterion. |
cluster |
A logical value. If TRUE (default) the code is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time needed for single-thread computing, it could be better not to parallelize. |
ncores |
An integer value specifying the number of cores to be used in the parallelized procedure. If NULL , the number of cores to be used is equal to the number of cores of the machine |
qselection() arguments |
|
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
qvector |
A vector with more than one variable-subset size to be selected. |
criterion |
The information criterion to be used. Default is the "deviance" . Other functions provided are the coefficient of determination ("R2" ), the residual variance ("variance" ), the Akaike information criterion ("aic" ), AIC with a correction for finite sample sizes ("aicc" ) and the Bayesian information criterion ("bic" ). The deviance, coefficient of determination and variance are calculated by cross-validation. |
method |
A character string specifying which regression method is used, "lm" , "glm" or "gam" . |
family |
This is a family object specifying the distribution and link to use in fitting. |
nfolds |
Number of folds for the cross-validation procedure, for deviance , R2 or variance criterion. |
cluster |
A logical value. If TRUE (default) the code is parallelized. |
ncores |
An integer value specifying the number of cores to be used in the parallelized procedure. If NULL , the number of cores to be used is equal to the number of cores of the machine |
test() arguments |
|
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
method |
A character string specifying which regression method is used, "lm" , "glm" or "gam" . |
family |
This is a family object specifying the distribution and link to use in fitting. |
nboot |
Number of bootstrap repeats. |
speedup |
A logical value. If TRUE (default), the testing procedure is computationally efficient since it considers one more variable to fit the alternative model than the number of variables used to fit the null. If FALSE , the fit of the alternative model is based on considering the best subset of variables of size greater than qmin . |
qmin |
By default NULL . If speedup is FALSE , qmin is an integer number selected by the user. To help the selection of this argument, it is recommended to visualize the graphical output of the plot function and choose the number q which minimizes the curve. |
unique |
A logical value. By default FALSE . If TRUE , the test is performed only for one null hypothesis, given by the argument q . |
q |
By default NULL . If unique is TRUE , q is the integer number |
bootseed |
Seed to be used in the bootstrap procedure. |
cluster |
A logical value. If TRUE (default), the testing procedure is parallelized. |
ncores |
An integer value specifying the number of cores to be used in the parallelized procedure. If NULL , the number of cores to be used is equal to the number of cores of the machine |
In this section we illustrate the use of FWDselect package using a
real data set, the pollution data (included in the package). The
software is applied to the prediction of atmospheric SO
In previous studies
(Prada-Sánchez and Febrero-Bande 1997; Prada-Sánchez et al. 2000; Roca-Pardiñas et al. 2004; Garcı́a-Jurado et al.),
semiparametric, partially linear models and generalized additive models
with unknown link functions were applied to the prediction of
atmospheric SO
Let
> library(FWDselect)
> data(pollution)
> head(pollution)[1:2, ]
In17 In16 In15 In14 In13 In12 In11 In10 In9 In8
1 3.02 3.01 3.01 3.01 3.01 3.03 3.03 3.03 3.03 3.03
2 16.49 16.55 16.42 16.35 16.56 16.75 16.74 16.72 16.63 16.53
In7 In6 In5 In4 In3 In2 In1 In0 InY
1 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 10.78
2 16.32 16.08 15.77 15.47 14.81 14.30 13.70 13.35 10.65
The variables from In17
to In0
correspond to the registered values
of SOIn0
denotes the zero instant
(In1
corresponds to the 5-min temporal instant before
(In2
is the 10-min temporal instant before (InY
) refers to the response variable,
where
It may often be of interest to determine the best subset of variables of
size selection
and the argument q = 1
. Also, based on the model
that we want to estimate here (additive model), we have to use “gam
”
on the method
argument.
> x <- pollution[, -19]
> y <- pollution[, 19]
> obj1 <- selection(x, y, q = 1, method = "gam", criterion = "deviance")
> obj1
****************************************************
Best subset of size q = 1 : In0
Information Criterion Value - deviance : 278663
****************************************************
For more than one subset size, the qselection
function returns a table
for the different subset sizes, with the selected variables and the
information criterion value.
> obj2 <- qselection(x, y, qvector = c(1:6), method = "gam", criterion = "deviance")
[1] "Selecting subset of size 1 ..."
[1] "Selecting subset of size 2 ..."
[1] "Selecting subset of size 3 ..."
[1] "Selecting subset of size 4 ..."
[1] "Selecting subset of size 5 ..."
[1] "Selecting subset of size 6 ..."
> obj2
q deviance selection
1 1 278662.959 In0
2 2 201474.673 In0, In2
3 3 232164.509 In0, In2, In1
4 4 219941.426 In0, In3, In1, In5
5 5 184293.934 In0, In3, In1, In7, In6
6 6 200877.902 In0, In3, In1, In7, In6, In5
The above function output is a useful display that greatly helps determine the most relevant variables. A plot of this object can easily be obtained by using the following input command:
> plot(obj2)
Figure 1 shows the deviance values (obtained by
cross-validation) corresponding to the different subsets. In each
subset,
The performance of the proposed predictors was then evaluated in a new
real pollution episode. We estimate firstly each of the proposed models
using the gam
function of the mgcv
package with the training data
set (pollution
data). Then, we apply the predict.gam
function to
each model using, in this case, the test data set. These data are found
in the episode
data, also included in this package. The corresponding
data frame is illustrated as follows:
> data(episode)
> head(episode)[1:2, ]
In17 In16 In15 In14 In13 In12 In11 In10 In9 In8 In7 In6 In5
1 3.02 3.02 3.03 3.10 3.10 3.10 3.10 3.22 3.27 3.33 3.36 3.38 3.47
2 3.02 3.03 3.10 3.10 3.10 3.10 3.22 3.27 3.33 3.36 3.38 3.47 3.50
In4 In3 In2 In1 In0 InY time
1 3.50 3.56 3.61 4.28 4.60 5.45 00:00
2 3.56 3.61 4.28 4.60 4.68 6.20 00:05
The course of the incident is depicted in Figure 2.
Temporal instants are plotted on the horizontal axis and the real 2-hour
mean SO
The prediction obtained with the inclusion of just one variable in the
model,
Model | MSE |
---|---|
1 682.14 | |
366.44 | |
556.49 |
The question that now arises is what is the minimum number of variables
that must be used in order to obtain the best prediction. It is possible
to deduce that there is an optimal intermediate point between the number
of variables that enters the model (preferably low) and the deviance
value (preferably also low). To find this point, the test described in
the previous section for the null hypothesis cluster = TRUE
(default). This should considerably increase
the performance on multi–core / multi–threading machines.
> test(x, y, nboot = 100, method = "gam", bootseed = 0413)
[1] "Processing IC bootstrap for H_0 ( 1 )..."
[1] "Processing IC bootstrap for H_0 ( 2 )..."
*************************************
Hypothesis Statistic pvalue Decision
1 H_0 (1) 5779.03 0 Rejected
2 H_0 (2) 959.21 0.78 Not Rejected
The deduction to be drawn is that, for a 5% significance level, the null
hypothesis is rejected with
Lastly, as we mention before, there are other alternatives for variable
selection in additive models. One of the best-known and used procedures
is the argument select
of the gam
function from the mgcv package
(Marra and Wood 2011). To illustrate and compare its usage
with our procedure, we have estimated the model in
((5)) by means of the cited function using the
pollution
data. Then, its performance was evaluated using again the
episode
data. The prediction obtained using this double penalty GAM is
far from what it should be (see Figure 3), actually, the mean
square error obtained (5 024.29) is the worst of all so far (see the
code in Appendix). It seems that, in situations with a large number of
variables, the selection of the best subset could be a better approach.
This paper discusses implementation in R of a new algorithm for the problem of variable selection in a regression framework. The FWDselect package provides R users a simple method for ascertaining the relevant variables for prediction purposes and how many of these should be included in the model. The proposed method is a new forward stepwise-based selection procedure that selects a model containing a subset of variables according to an information criterion, and also takes into account the computational cost. Bootstrap techniques have been used to determine the minimum number of variables needed to obtain an appropriate prediction.
In some situations, several statistically equivalent optimal models of
size seconds = TRUE
of the selection
functions. In addition, the software provides the user with a way of
easily obtaining the best subset of variables using different types of
data in different frameworks, by applying the lm
, glm
and gam
functions already implemented in R. The use of these classical R
functions nevertheless entails a high computational cost. Hence, a
further interesting extension would be the implementation of this
package using C, C++ or Fortran as the programming language. R users
could profit from this advantage in a future version of this package.
Insofar as the validity of the method is concerned, we think that the results obtained with simulated data are correct, and the results with the diabetes data are in accordance with other methodologies. This suggest that the behavior of the procedure in a nonparametric framework will be also adequate.
The results in this paper were obtained using
This work was supported by research grant SFRH/BPD/93928/2013 of “Fundação para a Ciência e a Tecnologia” (FCT) and by FEDER Funds through “Programa Operacional Factores de Competitividade - COMPETE”, by Portuguese Funds through FCT, in the form of grant PEst-OE/MAT/UI0013/2014, by grant MTM2011-23204 (FEDER support included) of the Spanish Ministry of Science and Innovation and by grant 10PXIB300068PR from the Galician Regional Authority (Xunta de Galicia).
Here we illustrate the use of FWDselect using simulated data. The use of simulated data allows us to ascertain the behavior of our software and to compare it with others available tools designed for the same purpose.
Consider a vector of 10 covariates,
The following code will simulate 100 observations (
> library(glmulti)
Loading required package: rJava
> library(leaps)
> rm(list = ls())
> set.seed(0413)
> n <- 100
> x <- matrix(runif(10 * n, -1, 1), ncol = 10, nrow = n, byrow = FALSE)
> e <- rnorm(n, 0, 1)
> y <- 2 * x[, 1] + 4 * x[, 5] + e
> data <- data.frame(x, y)
Now we compare our method against other existing methodologies developed
to perform automated variable selection. We choose the leaps package
(regsubsets
function), which selects the best variables for each
subset of size step
function from the stats
package which selects a formula-based model using the AIC; and the
glmulti which compares all posible models through an exhaustive
screening of the candidates, or a genetic algorithm, or a very fast
exhaustive branch-and-bound algorithm .
> res1 <- regsubsets(x, y)
> summary(res1)$outmat[2, ]
a b c d e f g h i j
"*" " " " " " " "*" " " " " " " " " " "
> res2 <- step(lm(y ~ ., data = data), trace = FALSE)
> res2
Call:
lm(formula = y ~ X1 + X4 + X5 + X9, data = data)
Coefficients:
(Intercept) X1 X4 X5 X9
0.1538 1.9691 0.4285 3.5774 0.2875
> res3 <- glmulti(y ~ ., data = data, level = 1, plotty = F, report = FALSE)
> summary(res3)$bestmodel
[1] "y ~ 1 + X1 + X4 + X5 + X9"
> res4aux <- test(x, y, nboot = 100)
[1] "Processing IC bootstrap for H_0 ( 1 )..."
[1] "Processing IC bootstrap for H_0 ( 2 )..."
*************************************
Hypothesis Statistic pvalue Decision
1 H_0 (1) 90.92 0 Rejected
2 H_0 (2) 21.11 0.06 Not Rejected
> res4 <- selection(x = x, y = y, q = res4aux$nvar, cluster = FALSE)
> res4
****************************************************
Best subset of size q = 2 : 5 1
Information Criterion Value - deviance : 16.70488
****************************************************
The regsubsets
function is based on all subsets or, in other words,
exhaustive variable selection. The method identifies the best subsets of
linear predictors using a branch–and–bound algorithm (Miller, 2002).
Since this function returns separate best models of all sizes, we
consider only the results obtained for a subset of size two. In this
case, the procedure works properly returning the step
is a widely used methodology for jointly determining the
number and choice of variables. In this case, this procedure fails
returning a model which includes the effects of four covariates (step
). Finally, in order to ascertain the performance of FWDselect,
we firstly apply the test
function with the purpose of determine the
number of variable that have to be included in the model. Then, once
this number is obtained (saved in the returned list as $nvar
), the
selection
function determines correctly the
According to the computation time of these four methods, the fastest
procedure is the implemented in the leaps package taking only 0.001
secs. The second one is the step
function which runs in 0.037 secs.
The next one is the glmulti
function which takes 3.149 secs. Lastly,
the most computationally demanding code is the implemented in the
FWDselect package which requires 9.181 secs. All the results have been
obtained using the R
’s system.time
command on a 2.4 GHz Intel Core
i5, with 4 cores and 4 Gb of RAM.
The previous results have been obtained using simulated data with a
linear effect of the covariates. However, in practice, the user does not
know the dependence structure, i. e., how the response variable depends
on the covariates. With this in mind, we have considered and applied
again the four procedures on another scenario where the response
variable depends again on the same two covariates, but in this case, the
effect of them is nonlinear. Particularly, the
being both
> y <- 2 * x[, 1]**2 + 2 * sin(2 * pi * x[, 5]) + e
> data <- data.frame(x, y)
> res1 <- regsubsets(x, y)
> summary(res1)$outmat[2, ]
a b c d e f g h i j
" " " " " " " " "*" "*" " " " " " " " "
> res2 <- step(lm(y ~ ., data = data), trace = 0)
> res2
Call:
lm(formula = y ~ X5, data = data)
Coefficients:
(Intercept) X5
0.4764 -1.2377
> res3 <- glmulti(y ~ ., data = data, level = 1, plotty = F, report = FALSE)
> summary(res3)$bestmodel
[1] "y ~ 1 + X5"
> res4aux <- test(x, y, nboot = 100, method = "gam")
[1] "Processing IC bootstrap for H_0 ( 1 )..."
[1] "Processing IC bootstrap for H_0 ( 2 )..."
*************************************
Hypothesis Statistic pvalue Decision
1 H_0 (1) 46.04 0 Rejected
2 H_0 (2) 20.89 0.15 Not Rejected
> res4 <- selection(x, y, q = res4aux$nvar, cluster = FALSE, method = "gam")
> res4
****************************************************
Best subset of size q = 2 : 5 1
Information Criterion Value - deviance : 18.41203
****************************************************
In this case, the performance of the methods changes. Excluding the
FWDselect, all the procedures fail to select the correct model. The
leaps package returns the
The results presented in this appendix have been obtained with one
simulated sample of
meifly, leaps, subselect, lars, glmnet, glmulti, bestglm, mgcv, FWDselect
Bayesian, ChemPhys, Econometrics, Environmetrics, MachineLearning, MixedModels, Spatial, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Sestelo, et al., "FWDselect: An R Package for Variable Selection in Regression Models", The R Journal, 2016
BibTeX citation
@article{RJ-2016-009, author = {Sestelo, Marta and Villanueva, Nora M. and Meira-MachadoAuthor, Luis and Roca-Pardiñas, Javier}, title = {FWDselect: An R Package for Variable Selection in Regression Models}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {1}, issn = {2073-4859}, pages = {132-148} }