FWDselect : An R Package for Variable Selection in Regression Models

In multiple regression models, when there are a large number (p) of explanatory variables which may or may not be relevant for predicting the response, it is useful to be able to reduce the model. To this end, it is necessary to determine the best subset of q (q ≤ p) predictors which will establish the model with the best prediction capacity. FWDselect package introduces a new forward stepwisebased selection procedure to select the best model in different regression frameworks (parametric or nonparametric). The developed methodology, which can be equally applied to linear models, generalized linear models or generalized additive models, aims to introduce solutions to the following two topics: i) selection of the best combination of q variables by using a step-by-step method; and, perhaps, most importantly, ii) search for the number of covariates to be included in the model based on bootstrap resampling techniques. The software is illustrated using real and simulated data.


Introduction
In a multivariate regression framework, the target response Y can depend on a set of p initial covariates X 1 , X 2 , . . ., X p but in practical situations we often would like to determine which covariates are "relevant" to describe this response.
The question of how to choose a subset of predictors of size q (q ≤ p) has not totally been satisfactorily solved yet.This problem is particularly important for large p and/or when there are redundant predictors.As a general rule, an increase in the number of variables to be included in a model provides an "apparently" better fit of the observed data; however, these estimates are not always satisfying for different reasons.On the one hand, the inclusion of such irrelevant variables would increase the variance of the estimates, resulting in a partial loss of the predictive capability of the model.On the other hand, the inclusion of too many variables may lead to unnecessary complexity in the resulting model, conducing to a difficult interpretation.
Model selection (and variable selection in regression, in particular) is a trade-off between bias and variance.Inference based on models with few variables can be biased, however, models that take into account too many variables may result in a lack of precision or false effects.These considerations call for a balance between under-and over-fitted models, the so-called model-selection problem (Forster, 2000).
To solve this problem, several strategies have been proposed.One common option is to use iterative procedures, such as the leaps and bounds algorithm (Furnival and Wilson, 1974) through which the best subset selection is obtained.This is a full information criteria-based approach, which compares all possible models and ranks them (Calcagno and de Mazancourt, 2010).Nevertheless, the problem of selecting the best model from among all possible combinations of p predictors is not trivial.In the presence of a large number of variables this selection procedure may require an excessively high computational cost and thus, in some cases, the problem becomes intractable.In order to relax this exhaustive search, heuristic iterative procedures such as forward-and backward-stepwise (Hocking, 1976) have been developed.This greedy algorithm produces a nested sequence of models based on the use of some information criteria which compares the models obtained in the course of the simplification or complexification scheme.Several criteria have been used for this purpose (Venables and Ripley, 1997;Miller, 2002), including Mallow's Cp (Mallows, 1973) or the Akaike Information Criteria or AIC (Akaike, 1973).Apart from the iterative procedures, other strategies applied in the variable selection problem are, e.g.shrinkage regression methods -such as ridge regression or the Lasso (least absolute shrinkage and selection operator) (Tibshirani, 1996;Hastie et al., 2003)-or the Bayesian approach (Green, 1995;Kuo and Mallick, 1998;Park and Casella, 2008;Hans, 2009).
Several R packages have been developed to carry out automatic variable selection or model selection.For instance, the meifly package (Wickham, 2014) can be used to search through all the different models.This type of exhaustive search can be also addressed using some other algorithm, such as the branch-and-bound algorithm in the leaps package (Lumley and Miller, 2009) or the leaps-and-bounds algorithm in the subselect package (Orestes Cerdeira et al., 2015).Both packages also implement other selection methods (heuristics).The leaps package includes the forward or backward stepwise, or sequential replacement while the subselect package provides a simulated annealing-type search algorithm, a genetic algorithm, and a restricted local improvement algorithm.To use the Lasso method, the user can apply, for example, the lars function implemented in the lars package (Hastie and Efron, 2013)  via penalized maximum likelihood, implemented in the glmnet package (Friedman et al., 2015).Additionally, another procedure used by the R community seems to be the model-selection-oriented step function (Hastie and Pregibon, 1992) built into the stats package.When it comes to model selection with generalized linear models, one option could be to use the glmulti package (Calcagno, 2013) or bestglm package (Mcleod and Xu, 2014).Finally, within the class of generalized additive models, other algorithms have also been introduced to achieve component selection, see Lin and Zhang (2006) and references therein, the boosting technique of Tutz and Binder (2006) or the generalization of the approach of Belitz and Lang (2008).More recently, and widely applied by R users, the gam function of the mgcv package (Wood, 2006(Wood, , 2011) ) includes an argument (select = TRUE) for model selection and fitting in a single step by adding a second penalty term in the estimation scheme of continuous covariates (Marra and Wood, 2011).
The FWDselect package introduces an alternative to existing approaches in the form of a methodology whereby R users can select the best variables in different regression contexts, both parametric and nonparametric.Unlike other existing packages, the procedure implemented in it can be equally applied to linear models, generalized linear models or/and generalized additive models, with Gaussian, binary or Poisson response.The forward selection algorithm proposed is based on a greedy procedure which changes one variable at the time in the model -keeping the others fixed -and does this repeatedly until none of the selected variables can be exchanged to improve model fit.This is a greedy algorithm, which may not find the actual best solution, but is less greedy than other methods such as step.In addition, in contrast with other packages in which the users must decide -either previous or post selection -the number of variables that have to be included, the bootstrap test introduced in our procedure allows them to determine the number with a significance level.
The remainder of this paper is organised as follows.First we describe the algorithm used to select the best subset of size q, along with the bootstrap techniques that are used to determine the number of variables to be included in the model.Then a detailed description of the package is presented, and its usage is illustrated through the analysis of three data sets.Finally, the last section contains the main conclusions of this work.

Methodology background
This section introduces the developed methodology and gives a description of the variable selection algorithm.The implemented package can be used with Gaussian, binary or Poisson response, however and based on the application data, we will explain the procedure with a nonparametric regression model with Gaussian response.
Let X = (X 1 , X 2 , . . ., X p ) be a vector of p initial variables and Y the response.An additive regression model can be expressed as where where m j (j = 1, . . ., p) are smooth and unknown functions and ε is the zero-mean error.Additionally, to guarantee the identification of the above model, a constant α is introduced in the model and it is required that the partial functions satisfy E m j X j = 0, j = 1, . . ., p.

This implies that E [Y] = α.
To date, several approaches to estimating the model in (1) have been suggested in the statistical literature, e.g., Buja et al. (1989), Härdle and Hall (1993), Mammen et al. (1999).In this package penalized regression splines, as implemented in the mgcv package, are used.
It is important to highlight that, in situations involving a large number of variables, correct estimation of the response will be obtained on the basis of selecting the appropriate predictors.In the case that we have information a priori about which of the initial set of variables are relevant, it would be possible to apply a likelihood ratio test (Neyman and Pearson, 1928) or a F-test type (Seber and Wild, 1989;Seber, 1997) in a parametric framework, or a generalized likelihood ratio test (Fan et al., 2001;Fan andJiang, 2005, 2007) in a nonparametric one.However, in situations where we do not have information in advance, it will be necessary to select the model according to a selection algorithm.
The R Journal Vol.8/1, Aug. 2016ISSN 2073-4859 According to this, we propose a procedure that includes two topics: i) selecting the best combination of q variables by using a new forward stepwise-based selection procedure; and ii) determining the minimum number of covariates to be included in the model.Both topics are explained as below.

Selecting the best variables
The first topic of our procedure is, given a number q (q ≤ p), to select the best combination of q variables.For this purpose, one option is to use a complete subset selection method as Roca-Pardiñas et al. (2009), which requires all possible models to be considered.When confronted with a large number of variables, however, the computational cost of the procedure can be very high or even prohibitive.In view of this, we provide a new method that speeds up the process based on a heuristic search which aims to approximate the optimal solution.There is no guarantee however that the procedure finds the best subset of covariates -this could only be achieved based on searching through all the possible subsets -but it has the advantage of requiring a smaller number of computations to reach the optimal solution or, at least, close to the optimal one.Let X j 1 , . . ., X j k be a subset of variables of size k (k ≤ q).We define IC j 1 ,...,j k as one possible information criterion (such as AIC, deviance, residual variance, etc.) of the nonparametric model where ε is the zero-mean error.Based on this information criterion, IC, the proposed automatic forward stepwise selection method is given in Algorithm 1.Note that any criterion can be used without correcting it to take account of the number of variables.This is possible because the models which are compared have always the same number of variables.

Testing the number of significant variables
Previously, the best subset of q variables is selected according to an information criterion.However, the question that arises in this procedure is to know the optimal number q.Thus, the second topic in our methodology is to decide the number of covariates that should be included in the model, i.e, determining the number of significant variables.
Accordingly, we propose a procedure to test the null hypothesis of q significant variables in the model versus the alternative in which the model contains more than q variables.Based on the additive model the following strategy is considered: for a subset of size q, considerations will be given to test the null hypothesis where I is the indicator function and considering that the m j 's are not equal to zero on a set of positive probability.
. ., X p , to test the above null hypothesis we propose the following strategy: (i) Obtain the best subset of q predictor variables.To this end we use the selection algorithm described in Algorithm 1.Without loss of generality, we assume that the q variables selected are in the first q positions of the X vector.
(ii) Obtain the nonparametric estimates of the null model as m0 (iii) Compute the residuals as r i = Y i − m0 (X i ) and obtain the nonparametric estimates of g (X i ) according to the model The R Journal Vol.8/1, Aug. 2016 ISSN 2073-4859 Algorithm 1: Modified forward stepwise selection method 1.Given a number q, selects the q variables X l 1 , . . ., X l q which minimises the following expression l 1 , l 2 , . . . ,l q = arg min (j1,...,jq) IC j 1 ,...,j q . (3) 2. The elements of the vector of indices (l 1 , l 2 , . . ., l q ) are selected consecutively in the following manner: (a) Firstly, determine the variable of the first position X l 1 where Note that all possible models of one variable must be estimated.
(b) Fix the first variable obtained previously, X l 1 , and obtain the second one, X l 2 , with (c) Fix X l 1 and X l 2 , and obtain the third one, X l 3 , where IC l 1 ,l 2 ,j 3 .
(d) Fix X l 1 , X l 2 , . . ., X l q−1 , and repeat the procedure analogously until the q-th variable, X l q , with l q = arg min j q 1≤j q ≤p, j q / ∈{l 1 ,...,l q−1 } IC l 1 ,...,j q 3. Once variables X l 1 , X l 2 , . . ., X l q have been selected, run through positions j = 1, . . ., q and replace each l j element as follows, only if the obtained IC is less than the minimum criterion obtained with the previous l j , l j = arg min j j j j / ∈{l 1 ,...,l j−1 ,l j+1 ,...,l q } IC l 1 ,...,l j−1 ,j j ,l j+1 ,...,l q .4. Step 3 is repeated until there is no change in the selected covariates, i.e., the algorithm stops when it has gone through a complete cycle without changing any of the q positions.The R Journal Vol.8/1, Aug. 2016 ISSN 2073-4859 where g is an unknown and smooth function which is applied to a unique covariate.This covariate will be chosen from X q+1 , . . ., X p applying the selection algorithm exposed in Algorithm 1.Without loss of generality, we assume that g (X) = α + g q+1 X q+1 .The purpose of this step is to assess whether there is enough structure left in the residuals that could be modeled by the predictors not included in the null model.Note that, within these possible predictors, we only select one of them in order to reduce the computational cost of the algorithm.However, the ideal solution would be to estimate the model in (4) determining the best subset of predictors within the remainder variables, instead of selecting only one of them.
Both options are implemented in the package by means of the speedup argument of the test function.If speedup = TRUE is specified a unique predictor for the residuals is used.If speedup = FALSE is specified the user can choose more than one predictor.With this latter option, when the number of variables is large, the selection of the best subset of predictors for the residuals requires a high computational burden1 .Therefore, in practice, we propose a solution by using the qmin argument which must be filled by the user.This argument corresponds to the size of the best subset of predictors.In order to help the user select it, it is recommended to visualize the graphical output of the plot function and to choose the number q which minimizes the curve.
(iv) Finally, we propose the following test statistic, based on the estimations of g It is important to stress that, if the null hypothesis holds, T should be close to zero.Thus, the test rule for checking H 0 (q) with a significance level of α is that the null hypothesis is rejected if T is larger than its (1 − α)-percentile.To approximate the distributions of the test statistic resampling methods such as the bootstrap introduced by Efron (1979) (see also Efron and Tibshirani, 1993;Härdle and Mammen, 1993;Kauermann and Opsomer, 2003) can be applied.Here we use the wild bootstrap (Wu, 1986;Liu, 1988;Mammen, 1993) because this method is valid both for homocedastic and heteroscedastic models where the variance of the error is a function of the covariate.The testing procedure consists of the following steps: Step 1: Obtain T from the sample data, as explained above.
Step 2: For i = 1, . . ., n, obtain m0 (X i ) and the bootstrap residuals as where εi = Y i − m0 (X i ) are the residuals of the null model and V 1 , . . . ,V n is an i.i.d.random variable with mass (5+ Step 3: The test rule based on T is given by rejecting the null hypothesis if T > T 1−α , where Applying this test to q = 1, . . ., p − 1 could be an important issue in a covariate selection procedure.If H 0 (q) is not rejected, only the subset of the covariates X j 1 , . . . ,X j q will be retained, and the remaining variables will be eliminated from the model.In all other cases, the test is repeated with q + 1 variables until the null hypothesis is not rejected.For example, if H 0 (1) is not rejected just one variable should be included into the model.If this hypothesis is rejected it will be required to test H 0 (2).If this new hypothesis is again rejected, H 0 (3) should be tested and so on until a certain H 0 (q) is accepted.
The validation of the approach relying on the bootstrap-based test can be consulted in Sestelo (2013) where type I error and power have been calculated for different test statistics.Also, the performance of the test for different levels of correlation between covariates have been analyzed.All the test statistics perform reasonably well, with the level coming relatively close to the nominal size and the probability of rejection rising as we separate from the null hypothesis, specially with large sample sizes.Furthermore, several simulation studies have been considered in order to compare the methodology proposed in this paper with other procedures reported in the literature that carry out automatic variable selection.

FWDselect in practice
This section introduces an overview of how the package is structured.FWDselect is a shortcut for "Forward selection" and this is its major functionality: to provide a forward stepwise-based selection procedure.This software helps the user select relevant variables and evaluate how many of these need to be included in a regression model.In addition, it enables both numerical and graphical outputs to be displayed.
Our package includes several functions that enable users to select the variables to be included in linear models, generalized linear models or generalized additive models.The functions within FWDselect are briefly described in Table 1.
Users can obtain the best combinations of q variables by means of the main function which is selection.Additionally, if one wants to obtain the results for more than one subset size, it is possible to apply the qselection function, which returns a summary table showing the different subsets, selected variables and information criterion values.These values are obtained by cross-validation with the purpose of comparing correctly the resulting models which include a different number of variables.The object obtained with this last function is the argument required for plot, which provides a graphical output.Finally, to determine the number of variables that should be introduced in the model, only the test function needs to be applied.Table 2 provides a summary of the arguments of the selection, qselection and test functions.The most computationally demanding parts of the code, namely those that involve the estimation of the models, the cross-validation and the bootstrap, have been parallelized by means of the parallel package via forking on Unix-alike platforms or creating a PSOCK cluster on Windows systems.

Function Description selection
Main function for selecting a subset of q variables.Note that the selection procedure can be used with lm, glm or gam functions.

print.selection
Method of the generic print function for "selection" objects, which returns a short summary.qselection Function that enables users to obtain the selected variables for more than one size of subset.

print.qselection
Method of the generic print function for "qselection" objects.Returns a table showing the chosen covariates to be introduced into the models and their information criteria obtained by cross-validation.

plot.qselection
Visualisation of "qselection" objects.Plots the cross-validation information criterion for several subsets with size q chosen by users.test Function that applies a bootstrap-based test for covariate selection.
Helps determine the precise number of variables to be included in the model.

Example of application
In this section we illustrate the use of FWDselect package using a real data set, the pollution data (included in the package).The software is applied to the prediction of atmospheric SO 2 pollution incidents by means of additive models.Combustion of fuel oil or coal releases sulphur dioxide into the atmosphere in different quantities.Current Spanish legislation governing environmetrical pollution controls the vicinity of potential point sources of pollution, such as coal-fired power stations.It places a limit on the mean of 24 successive determinations of SO 2 concentration taken at 5-minute intervals.An emission episode is said to occur when the series of bi-hourly means of SO 2 is greater than a specific level, r.In this framework, it is of interest for a plant, both economically and environmentally, to be able to predict, when the legal limit will be exceeded with sufficient time for effective countermeasures to be taken.
In previous studies (García-Jurado et al.;Prada-Sánchez et al., 2000;Prada-Sánchez and Febrero-Bande, 1997;Roca-Pardiñas et al., 2004), semiparametric, partially linear models and generalized additive models with unknown link functions were applied to the prediction of atmospheric SO 2 pollution incidents in the vicinity of a coal/oil-fired power station.Here, we present a new approach to this problem, whereby we try to predict a new emission episode, focusing our attention on the importance of ascertaining the best combinations of time instants for the purpose of obtaining the best prediction.Bearing this in mind, the selection of the optimal subset of variables could be a good approach to this issue.
The R Journal Vol.8/1, Aug. 2016 ISSN 2073-4859 selection() arguments x A data frame containing all the covariates.y A vector with the response values.q An integer specifying the size of the subset of variables to be selected.prevar A vector containing the number of the q − 1 selected variables in the previous step.By default it is NULL.criterion The information criterion to be used.Default is the "deviance".Other functions provided are the coefficient of determination ("R2"), the residual variance ("variance"), the Akaike information criterion ("aic"), AIC with a correction for finite sample sizes ("aicc") and the Bayesian information criterion ("bic").The deviance, coefficient of determination and variance are calculated by cross-validation.method A character string specifying which regression method is used, "lm", "glm" or "gam".family This is a family object specifying the distribution and link to use in fitting.seconds A logical value.If TRUE then, rather than returning the single best model only, the function returns a few of the best models.nmodels Number of secondary models to be returned.nfolds Number of folds for the cross-validation procedure, for deviance, R2 or variance criterion.cluster A logical value.If TRUE (default) the code is parallelized.Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation.R takes time to distribute tasks across the processors also it will need time for binding them all together later on.Therefore, if the time for distributing and gathering pieces together is greater than the time needed for single-thread computing, it could be better not to parallelize.ncores An integer value specifying the number of cores to be used in the parallelized procedure.If NULL, the number of cores to be used is equal to the number of cores of the machine −1.

qselection() arguments
x A data frame containing all the covariates.y A vector with the response values.qvector A vector with more than one variable-subset size to be selected.criterion The information criterion to be used.Default is the "deviance".Other functions provided are the coefficient of determination ("R2"), the residual variance ("variance"), the Akaike information criterion ("aic"), AIC with a correction for finite sample sizes ("aicc") and the Bayesian information criterion ("bic").The deviance, coefficient of determination and variance are calculated by cross-validation.method A character string specifying which regression method is used, "lm", "glm" or "gam".family This is a family object specifying the distribution and link to use in fitting.nfolds Number of folds for the cross-validation procedure, for deviance, R2 or variance criterion.cluster A logical value.If TRUE (default) the code is parallelized.ncores An integer value specifying the number of cores to be used in the parallelized procedure.If NULL, the number of cores to be used is equal to the number of cores of the machine −1.

test() arguments
x A data frame containing all the covariates.y A vector with the response values.method A character string specifying which regression method is used, "lm", "glm" or "gam".family This is a family object specifying the distribution and link to use in fitting.nboot Number of bootstrap repeats.speedup A logical value.If TRUE (default), the testing procedure is computationally efficient since it considers one more variable to fit the alternative model than the number of variables used to fit the null.If FALSE, the fit of the alternative model is based on considering the best subset of variables of size greater than q, the one that minimizes an information criterion.The size of this subset must be given by the user filling the argument qmin.qmin By default NULL.If speedup is FALSE, qmin is an integer number selected by the user.To help the selection of this argument, it is recommended to visualize the graphical output of the plot function and choose the number q which minimizes the curve.unique A logical value.By default FALSE.If TRUE, the test is performed only for one null hypothesis, given by the argument q. q By default NULL.If unique is TRUE, q is the integer number q of H 0 (q) to be tested.bootseed Seed to be used in the bootstrap procedure.cluster A logical value.If TRUE (default), the testing procedure is parallelized.ncores An integer value specifying the number of cores to be used in the parallelized procedure.If NULL, the number of cores to be used is equal to the number of cores of the machine − 1.Let t be the present time, and X t the value obtained by the series of bi-hourly means for SO 2 at instant t (5-minute temporal instants).Setting r = 150 µg/m 3 N as the maximum value permitted for the SO 2 concentration, and half-an-hour (6 instants) as the prediction horizon, it is of interest to predict Y = X t+6 , with the best vector of X l = (X t , X t−1 , X t−2 , . . . ,X t−17 ).Note that one of the problems that arises is to decide which temporal instants (X t , X t−1 , X t−2 , . . . ,X t−17 ) are relevant for prediction purposes, since inclusion of all the times X l may well degrade the overall performance of the prediction model.Based on this, we demonstrate the package capabilities using these data.An excerpt of the data frame included in the package is shown below: > library(FWDselect) > data(pollution) > head(pollution)[1:2, ] In17 In16 In15 In14 In13 In12 In11 In10 In9 In8 1 3.02 3.01 3.01 3.01 3.01 3.03 3.03 3.03 3.03 3.03 2 16.49 16.55 16.42 16.35 16.56 16.75 16.74 16.72 16.63 16.53 In7 In6 In5 In4 In3 In2 In1 In0 InY 1 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03 10.78 2 16.32 16.08 15.77 15.47 14.81 14.30 13.70 13.35 10.65 The variables from In17 to In0 correspond to the registered values of SO 2 at a specific temporal instant.In0 denotes the zero instant (X t ), In1 corresponds to the 5-min temporal instant before (X t−1 ), In2 is the 10-min temporal instant before (X t−2 ), and so on until the last variable.The last column of the data frame (InY) refers to the response variable, Y = X t+6 , the temporal instant that we wish to predict.For this purpose, we propose the underlying generalised additive model where m j , with j = 0, . . ., 17, are smooth and unknown functions and ε is the error which is assumed to have mean zero.To estimate the model in ( 5), FWDselect allows penalised regression splines, implemented in the mgcv package (Wood, 2003(Wood, , 2004(Wood, , 2011)).
It may often be of interest to determine the best subset of variables of size q needed to predict the response.The question that naturally arises in this application is, what is the best temporal instant for predicting an emission episode.This is easy to ascertain with the function selection and the argument q = 1.Also, based on the model that we want to estimate here (additive model), we have to use "gam" on the method argument.For more than one subset size, the qselection function returns a table for the different subset sizes, with the selected variables and the information criterion value.
> obj2 <-qselection(x, y, qvector = c(1:6), method = "gam", criterion = "deviance") [1] "Selecting subset of size 1 .. The above function output is a useful display that greatly helps determine the most relevant variables.A plot of this object can easily be obtained by using the following input command:

> plot(obj2)
Figure 1 shows the deviance values (obtained by cross-validation) corresponding to the different subsets.In each subset, q represents the number of temporal instants included in the model.Note, however, that only the results until subset of size q = 6 are shown because, from this size onwards, the rest of the obtained models have similar deviances.The performance of the proposed predictors was then evaluated in a new real pollution episode.We estimate firstly each of the proposed models using the gam function of the mgcv package with the training data set (pollution data).Then, we apply the predict.gamfunction to each model using, in this case, the test data set.These data are found in the episode data, also included in this package.The corresponding data frame is illustrated as follows: In17 In16 In15 In14 In13 In12 In11 In10 In9 In8 In7 In6 In5 1 3.02 3.02 3.03 3.10 3.10 3.10 3.10 3.22 3.27 3.33 3.36 3.38 3.47 2 3.02 3.03 3.10 3.10 3.10 3.10 3.22 3.27 3.33 3.36 3.38 3.47 3.50 In4 In3 In2 In1 In0 InY time 1 3.50 3.56 3.61 4.28 4.60 5.45 00:00 2 3.56 3.61 4.28 4.60 4.68 6.20 00:05 The course of the incident is depicted in Figure 2. Temporal instants are plotted on the horizontal axis and the real 2-hour mean SO 2 concentration that we seek to predict (Y = X t+6 ) is represented by a grey line.The predictions obtained by applying the different models are shown in the same figure.The code, both for the predictions as for the plot, is shown in the Appendix.
The prediction obtained with the inclusion of just one variable in the model, X t , is far from the optimum.However, the addition of one more variable, X t−2 , resulted in a remarkable increase in the model predictive capability.It makes possible for predictions close to real values to be obtained.Lastly, it can be seen that the incorporation of one more variable or temporal instant (X t−1 ) in the model does not produce any improvement in pollution-incident prediction.Numerically speaking, the same results can be observed by taking into account the Mean Square Error for each model (Table 3).The question that now arises is what is the minimum number of variables that must be used in order to obtain the best prediction.It is possible to deduce that there is an optimal intermediate point between the number of variables that enters the model (preferably low) and the deviance value (preferably also low).To find this point, the test described in the previous section for the null hypothesis H 0 (q) is applied for each size, q (through the input command shown below).The procedure stops when a certain null hypothesis is accepted.The most computationally demanding parts are those that involve the bootstrap and the cross-validation techniques.This can be parallelized using the argument cluster = TRUE (default).This should considerably increase the performance on multi-core / multi-threading machines.
> test(x, y, nboot = 100, method = "gam", bootseed = 0413) [1] "Processing IC bootstrap for H_0 ( 1 ).. The deduction to be drawn is that, for a 5% significance level, the null hypothesis is rejected with q = 1 and accepted thereafter.From these results, it can be concluded that the best temporal instants for prediction of an emission episode would be X t and X t−2 .
The R Journal Vol.8/1, Aug. 2016 ISSN 2073-4859 Lastly, as we mention before, there are other alternatives for variable selection in additive models.One of the best-known and used procedures is the argument select of the gam function from the mgcv package (Marra and Wood, 2011).To illustrate and compare its usage with our procedure, we have estimated the model in ( 5) by means of the cited function using the pollution data.Then, its performance was evaluated using again the episode data.The prediction obtained using this double penalty GAM is far from what it should be (see Figure 3), actually, the mean square error obtained (5 024.29) is the worst of all so far (see the code in Appendix).It seems that, in situations with a large number of variables, the selection of the best subset could be a better approach.

Conclusions
This paper discusses implementation in R of a new algorithm for the problem of variable selection in a regression framework.The FWDselect package provides R users a simple method for ascertaining the relevant variables for prediction purposes and how many of these should be included in the model.The proposed method is a new forward stepwise-based selection procedure that selects a model containing a subset of variables according to an information criterion, and also takes into account the computational cost.Bootstrap techniques have been used to determine the minimum number of variables needed to obtain an appropriate prediction.
In some situations, several statistically equivalent optimal models of size q may exist.In such cases, FWDselect allows the user to visualise those models and select the most interesting one.This is obtained with the argument seconds = TRUE of the selection functions.In addition, the software provides the user with a way of easily obtaining the best subset of variables using different types of data in different frameworks, by applying the lm, glm and gam functions already implemented in R. The use of these classical R functions nevertheless entails a high computational cost.Hence, a further interesting extension would be the implementation of this package using C, C++ or Fortran as the programming language.R users could profit from this advantage in a future version of this package.
Insofar as the validity of the method is concerned, we think that the results obtained with simulated data are correct, and the results with the diabetes data are in accordance with other methodologies.This suggest that the behavior of the procedure in a nonparametric framework will be also adequate.
The results in this paper were obtained using R 3.2.0.The FWDselect package (Sestelo et al., 2015) is available from the Comprehensive R Archive Network at the URL http://cran.r-project.org/web/packages/FWDselect/.The regsubsets function is based on all subsets or, in other words, exhaustive variable selection.The method identifies the best subsets of linear predictors using a branch-and-bound algorithm (Miller, 2002).Since this function returns separate best models of all sizes, we consider only the results obtained for a subset of size two.In this case, the procedure works properly returning the X 1 and X 5 variables as the best subset of size two.The model-selection oriented function is a widely used methodology for jointly determining the number and choice of variables.In this case, this procedure fails returning a model which includes the effects of four covariates (X 1 , X 4 , X 5 and X 9 ).The results obtained with the glmulti package, another option for model selection, are also mistaken.The procedure returns the same model obtained with the previous method (step).Finally, in order to ascertain the performance of FWDselect, we firstly apply the test function with the purpose of determine the number of variable that have to be included in the model.Then, once this number is obtained (saved in the returned list as $nvar), the selection function determines correctly the X 1 and X 5 variables.
According to the computation time of these four methods, the fastest procedure is the implemented in the leaps package taking only 0.001 secs.The second one is the step function which runs in 0.037 secs.The next one is the glmulti function which takes 3.149 secs.Lastly, the most computationally demanding code is the implemented in the FWDselect package which requires 9.181 secs.All the results have been obtained using the R's system.timecommand on a 2.4 GHz Intel Core i5, with 4 cores and 4 Gb of RAM.
The previous results have been obtained using simulated data with a linear effect of the covariates.However, in practice, the user does not know the dependence structure, i. e., how the response variable depends on the covariates.With this in mind, we have considered and applied again the four procedures on another scenario where the response variable depends again on the same two covariates, but in this case, the effect of them is nonlinear.Particularly, the Y is now generated in accordance with Y = 2 (X 1 ) 2 + 2 sin (2πX 5 ) + ε, being both ε and the explanatory covariates the same of the previous scenario.Note that we have now a nonlinear scenario in which the response variable depends only on two covariates.> y <-2 * x[, 1]**2 + 2 * sin(2 * pi * x[, 5]) + e > data <-data.frame(x,y) > res1 <-regsubsets(x, y) > summary(res1)$outmat[2, ] a b c d e f g h i j " " " " " " " " "*" "*" " " " " " " " " > res2 <-step(lm(y ~., data = data), trace = 0) > res2 In this case, the performance of the methods changes.Excluding the FWDselect, all the procedures fail to select the correct model.The leaps package returns the X 5 and X 6 variables whereas the others two packages only retrieve the effect of X 5 .
The results presented in this appendix have been obtained with one simulated sample of n = 100.In order to evaluate the real performance of the methods, a simulation study using five hundred independent samples with different sample sizes (n = 50, 100, 200) was carried out.Focusing on the linear scenario, the leaps and FWDselect packages work well with 100% and close to 95% of successes, respectively (for any sample sizes).The success rate for the other two packages is around 22%.Note that the results of leaps have been obtained assuming a subset of size two, and thus providing an advantage to this method over the others.In relation with the nonlinear scenario, the proportion of failures is very high for all procedures excepting FWDselect.The latter performs correctly close to 30% of the times for the smallest sample size, around 63% for n = 100 while it reaches 91.6% of successes for n = 200.

Figure 1 :
Figure 1: For each subset of size q, cross-validation deviance obtained by the best model for the pollution data.

TheFigure 2 :
Figure 2: Example of an SO 2 pollution incident that occurred on 4 July 2003.Temporal instants are shown on the horizontal axis.The grey line represents the known response of SO 2 levels in µg/m 3 N. Estimation of SO 2 levels with one, two and three covariates are represented by circles, squares and diamonds respectively.

Figure 3 :
Figure 3: Example of an SO 2 pollution incident that occurred on 4 July 2003.Temporal instants are shown on the horizontal axis.The grey line represents the known response of SO 2 levels in µg/m 3 N. Estimation of SO 2 levels obtained by means of the double penalty GAM is represented by circles.

Table 1 :
Summary of functions in the FWDselect package.

Table 2 :
Arguments of selection, qselection and test functions.

Table 3 :
Mean Square Error of the selected models.