FWDselect: An R Package for Variable Selection in Regression Models

Marta Sestelo; Nora M. Villanueva; Luis Meira-MachadoAuthor; Javier Roca-Pardiñas

1 Introduction

In a multivariate regression framework, the target response $Y$ can depend on a set of $p$ initial covariates $X_1, X_2, \ldots, X_p$ but in practical situations we often would like to determine which covariates are “relevant” to describe this response.

The question of how to choose a subset of predictors of size $q$ ($q\le p$) has not totally been satisfactorily solved yet. This problem is particularly important for large $p$ and/or when there are redundant predictors. As a general rule, an increase in the number of variables to be included in a model provides an “apparently” better fit of the observed data; however, these estimates are not always satisfying for different reasons. On the one hand, the inclusion of such irrelevant variables would increase the variance of the estimates, resulting in a partial loss of the predictive capability of the model. On the other hand, the inclusion of too many variables may lead to unnecessary complexity in the resulting model, conducing to a difficult interpretation.

Model selection (and variable selection in regression, in particular) is a trade-off between bias and variance. Inference based on models with few variables can be biased, however, models that take into account too many variables may result in a lack of precision or false effects. These considerations call for a balance between under- and over-fitted models, the so-called model-selection problem (Forster 2000).

To solve this problem, several strategies have been proposed. One common option is to use iterative procedures, such as the leaps and bounds algorithm (Furnival and Wilson 1974) through which the best subset selection is obtained. This is a full information criteria-based approach, which compares all possible models and ranks them (Calcagno and Mazancourt 2010). Nevertheless, the problem of selecting the best model from among all possible combinations of $p$ predictors is not trivial. In the presence of a large number of variables this selection procedure may require an excessively high computational cost and thus, in some cases, the problem becomes intractable. In order to relax this exhaustive search, heuristic iterative procedures such as forward- and backward-stepwise (Hocking 1976) have been developed. This greedy algorithm produces a nested sequence of models based on the use of some information criteria which compares the models obtained in the course of the simplification or complexification scheme. Several criteria have been used for this purpose (Venables and Ripley 1997; Miller 2002), including Mallow’s Cp (Mallows 1973) or the Akaike Information Criteria or AIC (Akaike 1973). Apart from the iterative procedures, other strategies applied in the variable selection problem are, e.g. shrinkage regression methods —such as ridge regression or the Lasso (least absolute shrinkage and selection operator) (Tibshirani 1996; Hastie et al. 2003)— or the Bayesian approach (Green 1995; Kuo and Mallick 1998; Park and Casella 2008; Hans 2009).

Several R packages have been developed to carry out automatic variable selection or model selection. For instance, the meifly package (Wickham 2014) can be used to search through all the different models. This type of exhaustive search can be also addressed using some other algorithm, such as the branch-and-bound algorithm in the leaps package (Lumley and Miller 2009) or the leaps-and-bounds algorithm in the subselect package (Orestes Cerdeira et al. 2015). Both packages also implement other selection methods (heuristics). The leaps package includes the forward or backward stepwise, or sequential replacement while the subselect package provides a simulated annealing-type search algorithm, a genetic algorithm, and a restricted local improvement algorithm. To use the Lasso method, the user can apply, for example, the lars function implemented in the lars package (Hastie and Efron 2013) or the glmnet function, which fits a generalized linear model via penalized maximum likelihood, implemented in the glmnet package (Friedman et al. 2015). Additionally, another procedure used by the R community seems to be the model-selection-oriented step function (Hastie and Pregibon 1992) built into the stats package. When it comes to model selection with generalized linear models, one option could be to use the glmulti package (Calcagno 2013) or bestglm package (Mcleod and Xu 2014). Finally, within the class of generalized additive models, other algorithms have also been introduced to achieve component selection, see Lin and Zhang (2006) and references therein, the boosting technique of Tutz and Binder (2006) or the generalization of the approach of Belitz and Lang (2008). More recently, and widely applied by R users, the gam function of the mgcv package (Wood 2006, 2011) includes an argument (select = TRUE) for model selection and fitting in a single step by adding a second penalty term in the estimation scheme of continuous covariates (Marra and Wood 2011).

The FWDselect package introduces an alternative to existing approaches in the form of a methodology whereby R users can select the best variables in different regression contexts, both parametric and nonparametric. Unlike other existing packages, the procedure implemented in it can be equally applied to linear models, generalized linear models or/and generalized additive models, with Gaussian, binary or Poisson response. The forward selection algorithm proposed is based on a greedy procedure which changes one variable at the time in the model — keeping the others fixed — and does this repeatedly until none of the selected variables can be exchanged to improve model fit. This is a greedy algorithm, which may not find the actual best solution, but is less greedy than other methods such as step. In addition, in contrast with other packages in which the users must decide — either previous or post selection — the number of variables that have to be included, the bootstrap test introduced in our procedure allows them to determine the number with a significance level.

The remainder of this paper is organised as follows. First we describe the algorithm used to select the best subset of size $q$, along with the bootstrap techniques that are used to determine the number of variables to be included in the model. Then a detailed description of the package is presented, and its usage is illustrated through the analysis of three data sets. Finally, the last section contains the main conclusions of this work.

2 Methodology background

This section introduces the developed methodology and gives a description of the variable selection algorithm. The implemented package can be used with Gaussian, binary or Poisson response, however and based on the application data, we will explain the procedure with a nonparametric regression model with Gaussian response.

Let $\textbf{X} = (X_1, X_2, \ldots, X_p)$ be a vector of $p$ initial variables and $Y$ the response. An additive regression model can be expressed as

\[\label{eq:chapter4gam} Y=m\left(\textbf{X}\right)+ \varepsilon, \tag{1} \]

where

\[m \left( \textbf{X} \right)= \alpha+m_{1}\left(X_{1}\right)+m_{2}\left(X_{2}\right)+\cdots+m_{p}\left(X_{p}\right),\]

where $m_j (j=1, \ldots, p)$ are smooth and unknown functions and $\varepsilon$ is the zero-mean error. Additionally, to guarantee the identification of the above model, a constant $\alpha$ is introduced in the model and it is required that the partial functions satisfy

\[E \left [ m_j \left( X_j \right) \right ] = 0, \quad j=1,\ldots,p.\]

This implies that $E \left[ Y \right] = \alpha$.

To date, several approaches to estimating the model in ((1)) have been suggested in the statistical literature, e.g., Buja et al. (1989), Härdle and Hall (1993), Mammen et al. (1999). In this package penalized regression splines, as implemented in the mgcv package, are used.

It is important to highlight that, in situations involving a large number of variables, correct estimation of the response will be obtained on the basis of selecting the appropriate predictors. In the case that we have information a priori about which of the initial set of variables are relevant, it would be possible to apply a likelihood ratio test (Neyman and Pearson 1928) or a F-test type (Seber and Wild 1989; Seber 1997) in a parametric framework, or a generalized likelihood ratio test (Fan et al. 2001; Fan and Jiang 2005, 2007) in a nonparametric one. However, in situations where we do not have information in advance, it will be necessary to select the model according to a selection algorithm.

According to this, we propose a procedure that includes two topics: i) selecting the best combination of $q$ variables by using a new forward stepwise-based selection procedure; and ii) determining the minimum number of covariates to be included in the model. Both topics are explained as below.

Selecting the best variables

The first topic of our procedure is, given a number $q$ $\left(q \le p \right)$, to select the best combination of $q$ variables. For this purpose, one option is to use a complete subset selection method as (Roca-Pardiñas et al. 2009), which requires all possible models to be considered. When confronted with a large number of variables, however, the computational cost of the procedure can be very high or even prohibitive. In view of this, we provide a new method that speeds up the process based on a heuristic search which aims to approximate the optimal solution. There is no guarantee however that the procedure finds the best subset of covariates — this could only be achieved based on searching through all the possible subsets — but it has the advantage of requiring a smaller number of computations to reach the optimal solution or, at least, close to the optimal one.

Let $X_{j_1}$, …, $X_{j_k}$ be a subset of variables of size $k$ ($k \le q$). We define $IC_{j_{1} ,\ldots ,j_{k}}$ as one possible information criterion (such as AIC, deviance, residual variance, etc.) of the nonparametric model

\[\label{eq:chapter4model1} Y=\alpha + m_{j_1} \left(X_{j_1}\right)+m_{j_2} \left(X_{j_2}\right)+\ldots+m_{j_k}\left(X_{j_k} \right)+\varepsilon', \tag{2} \]

where $\varepsilon'$ is the zero-mean error. Based on this information criterion, $IC$, the proposed automatic forward stepwise selection method is given in Algorithm 1. Note that any criterion can be used without correcting it to take account of the number of variables. This is possible because the models which are compared have always the same number of variables.

Algorithm 1 Modified forward stepwise selection method

1. Given a number $q$, selects the $q$ variables $X_{l_1}, \ldots, X_{l_q}$ which minimises the following expression

\[\label{eq:suboptimo} \left (l_1, l_2, \ldots, l_q\right) = \displaystyle\mathop{\mathrm{arg\,min}}_{\substack{ \left(j_{1}, \ldots, j_{q}\right)\\ 1\le j_{1}\le \cdots \le j_{q} \le p}} IC_{j_{1} ,\ldots ,j_{q} }. \tag{3} \]

2. The elements of the vector of indices $(l_1, l_2, \ldots, l_q)$ are selected consecutively in the following manner:

Firstly, determine the variable of the first position $X_{l_1}$ where

\[l_1= \displaystyle\mathop{\mathrm{arg\,min}}_{\substack{ j_1 \\1\le j_{1} \le p }} \hspace{0.1cm}IC_{j_1}.\]

Note that all possible models of one variable must be estimated.
Fix the first variable obtained previously, $X_{l_1}$ , and obtain the second one, $X_{l_2}$, with

\[l_2= \displaystyle\mathop{\mathrm{arg\,min}}_{\substack{ j_{2}\\1\le j_{2} \le p, \hspace{0.1cm} j_{2}\neq l_1}} \hspace{0.1cm} IC_{l_1,j_2}.\]
Fix $X_{l_1}$ and $X_{l_2}$, and obtain the third one, $X_{l_3}$, where

\[l_3= \displaystyle\mathop{\mathrm{arg\,min}}_{\substack{j_{3} \\ 1\le j_{3} \le p, \hspace{0.1cm} j_{3}\notin \left\{ l_1,l_2\right\}}} \hspace{0.1cm} IC_{l_1,l_2,j_3}.\]
Fix $X_{l_1}, X_{l_2}, \ldots, X_{l_{q-1}}$, and repeat the procedure analogously until the $q$-th variable, $X_{l_q}$, with

\[l_q= \displaystyle\mathop{\mathrm{arg\,min}}_{\substack{j_{q}\\ 1\le j_{q} \le p, \hspace{0.1cm} j_{q}\notin \left\{ l_1,\ldots,l_{q-1}\right\}}} \hspace{0.0cm} IC_{l_1,\ldots,j_q}\]

3. Once variables $X_{l_1}, X_{l_2}, \ldots, X_{l_q}$ have been selected, run through positions $j=1, \ldots, q$ and replace each $l_j$ element as follows, only if the obtained $IC$ is less than the minimum criterion obtained with the previous $l_j$,

\[l_j= \displaystyle\mathop{\mathrm{arg\,min}}_{\substack{ j_j\\ j_j \notin \left\{ l_1,\ldots,l_{j-1}, l_{j+1},\ldots,l_q \right\}}} \hspace{0.0cm} IC_{l_1,\ldots,l_{j-1},j_j, l_{j+1},\ldots,l_q}.\]

4. Step 3 is repeated until there is no change in the selected covariates, i.e., the algorithm stops when it has gone through a complete cycle without changing any of the $q$ positions.

Testing the number of significant variables

Previously, the best subset of $q$ variables is selected according to an information criterion. However, the question that arises in this procedure is to know the optimal number $q$. Thus, the second topic in our methodology is to decide the number of covariates that should be included in the model, i.e, determining the number of significant variables.

Accordingly, we propose a procedure to test the null hypothesis of $q$ significant variables in the model versus the alternative in which the model contains more than $q$ variables. Based on the additive model

\[Y=m\left(\textbf{X}\right)+\varepsilon \quad {\text{where}} \quad m \left( \textbf{X} \right)= \alpha+m_{1}\left(X_{1}\right)+m_{2}\left(X_{2}\right)+\cdots+m_{p}\left(X_{p}\right),\]

the following strategy is considered: for a subset of size $q$, considerations will be given to test the null hypothesis

\[H_{0} \left(q\right): \sum_{j=1}^p I_{ \left\{m_j \ne 0 \right\} } \le q\]

versus the general hypothesis

\[H_{1} \left(q\right): \sum_{j=1}^p I_{\left\{m_j \ne 0\right\}} > q,\]

where $I$ is the indicator function and considering that the $m_j$’s are not equal to zero on a set of positive probability.

Given a i.i.d. sample $\left\{ \left (\textbf{X}_{i},Y_{i}\right) \right\}_{i=1}^{n}$, with $\textbf{X}=\left(X_1,\ldots,X_p\right)$, to test the above null hypothesis we propose the following strategy:

Obtain the best subset of $q$ predictor variables. To this end we use the selection algorithm described in Algorithm 1. Without loss of generality, we assume that the $q$ variables selected are in the first $q$ positions of the X vector.
Obtain the nonparametric estimates of the null model as

\[\hat{m}_0\left(\textbf{X}_i\right)=\alpha + \hat{m}_{1}\left(X_{i1}\right)+\ldots+\hat{m}_{q}\left(X_{iq}\right).\]
Compute the residuals as $r_i=Y_i -\hat{m}_0 \left( \textbf{X}_i \right)$ and obtain the nonparametric estimates of $g\left(\textbf{X}_i \right)$ according to the model

\[\label{eq:eqres} r_i=g\left(\textbf{X}_i\right)+\varepsilon, \tag{4} \]

where $g$ is an unknown and smooth function which is applied to a unique covariate. This covariate will be chosen from $X_{q+1}, \ldots, X_{p}$ applying the selection algorithm exposed in Algorithm 1. Without loss of generality, we assume that $g\left(\textbf{X}\right)=\alpha+g_{q+1}\left(X_{q+1}\right)$.

The purpose of this step is to assess whether there is enough structure left in the residuals that could be modeled by the predictors not included in the null model. Note that, within these possible predictors, we only select one of them in order to reduce the computational cost of the algorithm. However, the ideal solution would be to estimate the model in ((4)) determining the best subset of predictors within the remainder variables, instead of selecting only one of them.

Both options are implemented in the package by means of the speedup argument of the test function. If speedup = TRUE is specified a unique predictor for the residuals is used. If speedup = FALSE is specified the user can choose more than one predictor. With this latter option, when the number of variables is large, the selection of the best subset of predictors for the residuals requires a high computational burden¹. Therefore, in practice, we propose a solution by using the qmin argument which must be filled by the user. This argument corresponds to the size of the best subset of predictors. In order to help the user select it, it is recommended to visualize the graphical output of the plot function and to choose the number $q$ which minimizes the curve.
Finally, we propose the following test statistic, based on the estimations of $g$

\[T=\sum_{i=1}^n \left | \hat{g} \left( \textbf{X}_i \right) \right |.\]

It is important to stress that, if the null hypothesis holds, $T$ should be close to zero. Thus, the test rule for checking $H_0(q)$ with a significance level of $\alpha$ is that the null hypothesis is rejected if $T$ is larger than its $(1-\alpha)$-percentile. To approximate the distributions of the test statistic resampling methods such as the bootstrap introduced by Efron (1979; see also Efron and Tibshirani 1993; Härdle and Mammen 1993; Kauermann and Opsomer 2003) can be applied. Here we use the wild bootstrap (Wu 1986; Liu 1988; Mammen 1993) because this method is valid both for homocedastic and heteroscedastic models where the variance of the error is a function of the covariate. The testing procedure consists of the following steps:

Step 1: Obtain $T$ from the sample data, as explained above.
Step 2: For $i=1, \ldots, n$, obtain $\hat{m}_0 \left(\textbf{X}_i\right)$ and the bootstrap residuals as

\[\varepsilon_i^{\bullet b} = \hat \varepsilon_i V_{i}\]

where $\hat \varepsilon_i=Y_i- \hat m_0\left(\textbf{X}_i\right)$ are the residuals of the null model and $V_1, \ldots, V_n$ is an i.i.d. random variable with mass $\frac{\left(5+\sqrt{5}\right)}{10}$ and $\frac{\left(5-\sqrt{5}\right)}{10}$ at the points $\frac{\left(1-\sqrt{5}\right)}{2}$ and $\frac{\left(1+\sqrt{5}\right)}{2}$. Note that this distribution satisfies $E\left(V_i\right)=0, E\left(V_i^2\right)=E\left(V_i^3\right)=1$.
Step 3: For $b = 1, \ldots, B$, simulate the bootstrap sample $\left\{\textbf{X}_{i},Y_{i}^{\bullet b}\right\}_{i=1}^{n}$ with $Y_{i}^{\bullet b}=\hat{m}_0\left(\textbf{X}_i\right)+\varepsilon_i^{\bullet b}$, and compute the bootstrap estimates of $T^{\bullet b}$.

The test rule based on $T$ is given by rejecting the null hypothesis if $T>T^{1-\alpha}$, where $T^{1-\alpha}$ is the empirical $\left(1-\alpha\right)$-percentile of values $T^{\bullet b}$ ($b=1,\ldots,B$).

Applying this test to $q=1,\ldots,p-1$ could be an important issue in a covariate selection procedure. If $H_0\left(q\right)$ is not rejected, only the subset of the covariates $X_{j_1},\ldots,X_{j_q}$ will be retained, and the remaining variables will be eliminated from the model. In all other cases, the test is repeated with $q+1$ variables until the null hypothesis is not rejected. For example, if $H_0\left(1\right)$ is not rejected just one variable should be included into the model. If this hypothesis is rejected it will be required to test $H_0\left(2\right)$. If this new hypothesis is again rejected, $H_0\left(3\right)$ should be tested and so on until a certain $H_0\left(q\right)$ is accepted.

The validation of the approach relying on the bootstrap-based test can be consulted in (Sestelo 2013) where type I error and power have been calculated for different test statistics. Also, the performance of the test for different levels of correlation between covariates have been analyzed. All the test statistics perform reasonably well, with the level coming relatively close to the nominal size and the probability of rejection rising as we separate from the null hypothesis, specially with large sample sizes. Furthermore, several simulation studies have been considered in order to compare the methodology proposed in this paper with other procedures reported in the literature that carry out automatic variable selection.

3 FWDselect in practice

This section introduces an overview of how the package is structured. FWDselect is a shortcut for “Forward selection” and this is its major functionality: to provide a forward stepwise-based selection procedure. This software helps the user select relevant variables and evaluate how many of these need to be included in a regression model. In addition, it enables both numerical and graphical outputs to be displayed.

Our package includes several functions that enable users to select the variables to be included in linear models, generalized linear models or generalized additive models. The functions within FWDselect are briefly described in Table 1.

Users can obtain the best combinations of $q$ variables by means of the main function which is selection. Additionally, if one wants to obtain the results for more than one subset size, it is possible to apply the qselection function, which returns a summary table showing the different subsets, selected variables and information criterion values. These values are obtained by cross-validation with the purpose of comparing correctly the resulting models which include a different number of variables. The object obtained with this last function is the argument required for plot, which provides a graphical output. Finally, to determine the number of variables that should be introduced in the model, only the test function needs to be applied. Table 2 provides a summary of the arguments of the selection, qselection and test functions. The most computationally demanding parts of the code, namely those that involve the estimation of the models, the cross-validation and the bootstrap, have been parallelized by means of the parallel package via forking on Unix-alike platforms or creating a PSOCK cluster on Windows systems.

Table 1: Summary of functions in the *FWDselect* package.
Function	Description
`selection`	Main function for selecting a subset of $q$ variables. Note that the selection procedure can be used with `lm`, `glm` or `gam` functions.
`print.selection`	Method of the generic print function for `"selection"` objects, which returns a short summary.
`qselection`	Function that enables users to obtain the selected variables for more than one size of subset.
`print.qselection`	Method of the generic print function for `"qselection"` objects. Returns a table showing the chosen covariates to be introduced into the models and their information criteria obtained by cross-validation.
`plot.qselection`	Visualisation of `"qselection"` objects. Plots the cross-validation information criterion for several subsets with size $q$ chosen by users.
`test`	Function that applies a bootstrap-based test for covariate selection. Helps determine the precise number of variables to be included in the model.

Table 2: Arguments of `selection`, `qselection` and `test` functions.
	`selection()` arguments
`x`	A data frame containing all the covariates.
`y`	A vector with the response values.
`q`	An integer specifying the size of the subset of variables to be selected.
`prevar`	A vector containing the number of the $q-1$ selected variables in the previous step. By default it is NULL.
`criterion`	The information criterion to be used. Default is the `"deviance"`. Other functions provided are the coefficient of determination (`"R2"`), the residual variance (`"variance"`), the Akaike information criterion (`"aic"`), AIC with a correction for finite sample sizes (`"aicc"`) and the Bayesian information criterion (`"bic"`). The deviance, coefficient of determination and variance are calculated by cross-validation.
`method`	A character string specifying which regression method is used, `"lm"`, `"glm"` or `"gam"`.
`family`	This is a family object specifying the distribution and link to use in fitting.
`seconds`	A logical value. If `TRUE` then, rather than returning the single best model only, the function returns a few of the best models.
`nmodels`	Number of secondary models to be returned.
`nfolds`	Number of folds for the cross-validation procedure, for `deviance`, `R2` or `variance` criterion.
`cluster`	A logical value. If `TRUE` (default) the code is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time needed for single-thread computing, it could be better not to parallelize.
`ncores`	An integer value specifying the number of cores to be used in the parallelized procedure. If `NULL`, the number of cores to be used is equal to the number of cores of the machine $-1$.
	`qselection()` arguments
`x`	A data frame containing all the covariates.
`y`	A vector with the response values.
`qvector`	A vector with more than one variable-subset size to be selected.
`criterion`	The information criterion to be used. Default is the `"deviance"`. Other functions provided are the coefficient of determination (`"R2"`), the residual variance (`"variance"`), the Akaike information criterion (`"aic"`), AIC with a correction for finite sample sizes (`"aicc"`) and the Bayesian information criterion (`"bic"`). The deviance, coefficient of determination and variance are calculated by cross-validation.
`method`	A character string specifying which regression method is used, `"lm"`, `"glm"` or `"gam"`.
`family`	This is a family object specifying the distribution and link to use in fitting.
`nfolds`	Number of folds for the cross-validation procedure, for `deviance`, `R2` or `variance` criterion.
`cluster`	A logical value. If `TRUE` (default) the code is parallelized.
`ncores`	An integer value specifying the number of cores to be used in the parallelized procedure. If `NULL`, the number of cores to be used is equal to the number of cores of the machine $-1$.
	`test()` arguments
`x`	A data frame containing all the covariates.
`y`	A vector with the response values.
`method`	A character string specifying which regression method is used, `"lm"`, `"glm"` or `"gam"`.
`family`	This is a family object specifying the distribution and link to use in fitting.
`nboot`	Number of bootstrap repeats.
`speedup`	A logical value. If `TRUE` (default), the testing procedure is computationally efficient since it considers one more variable to fit the alternative model than the number of variables used to fit the null. If `FALSE`, the fit of the alternative model is based on considering the best subset of variables of size greater than $q$, the one that minimizes an information criterion. The size of this subset must be given by the user filling the argument `qmin`.
`qmin`	By default `NULL`. If `speedup` is `FALSE`, `qmin` is an integer number selected by the user. To help the selection of this argument, it is recommended to visualize the graphical output of the `plot` function and choose the number `q` which minimizes the curve.
`unique`	A logical value. By default `FALSE`. If `TRUE`, the test is performed only for one null hypothesis, given by the argument `q`.
`q`	By default `NULL`. If `unique` is `TRUE`, `q` is the integer number $q$ of $H_0(q)$ to be tested.
`bootseed`	Seed to be used in the bootstrap procedure.
`cluster`	A logical value. If `TRUE` (default), the testing procedure is parallelized.
`ncores`	An integer value specifying the number of cores to be used in the parallelized procedure. If `NULL`, the number of cores to be used is equal to the number of cores of the machine $-$ 1.

Example of application

In this section we illustrate the use of FWDselect package using a real data set, the pollution data (included in the package). The software is applied to the prediction of atmospheric SO$_2$ pollution incidents by means of additive models. Combustion of fuel oil or coal releases sulphur dioxide into the atmosphere in different quantities. Current Spanish legislation governing environmetrical pollution controls the vicinity of potential point sources of pollution, such as coal-fired power stations. It places a limit on the mean of 24 successive determinations of SO$_2$ concentration taken at 5-minute intervals. An emission episode is said to occur when the series of bi-hourly means of SO${_2}$ is greater than a specific level, $r$. In this framework, it is of interest for a plant, both economically and environmentally, to be able to predict, when the legal limit will be exceeded with sufficient time for effective countermeasures to be taken.

In previous studies (Prada-Sánchez and Febrero-Bande 1997; Prada-Sánchez et al. 2000; Roca-Pardiñas et al. 2004; Garcı́a-Jurado et al.), semiparametric, partially linear models and generalized additive models with unknown link functions were applied to the prediction of atmospheric SO$_2$ pollution incidents in the vicinity of a coal/oil-fired power station. Here, we present a new approach to this problem, whereby we try to predict a new emission episode, focusing our attention on the importance of ascertaining the best combinations of time instants for the purpose of obtaining the best prediction. Bearing this in mind, the selection of the optimal subset of variables could be a good approach to this issue.

Let $t$ be the present time, and $X_t$ the value obtained by the series of bi-hourly means for SO$_2$ at instant $t$ (5-minute temporal instants). Setting $r=150$ $\mu$g/m$^3$N as the maximum value permitted for the SO$_2$ concentration, and half-an-hour (6 instants) as the prediction horizon, it is of interest to predict $Y=X_{t+6}$, with the best vector of $X_l=\left(X_t,X_{t-1},X_{t-2},\ldots,X_{t-17}\right)$. Note that one of the problems that arises is to decide which temporal instants $\left(X_{t},X_{t-1},X_{t-2},\ldots,X_{t-17}\right)$ are relevant for prediction purposes, since inclusion of all the times $X_l$ may well degrade the overall performance of the prediction model. Based on this, we demonstrate the package capabilities using these data. An excerpt of the data frame included in the package is shown below:

> library(FWDselect)
> data(pollution)
> head(pollution)[1:2, ]
   In17  In16  In15  In14  In13  In12  In11  In10   In9   In8
1  3.02  3.01  3.01  3.01  3.01  3.03  3.03  3.03  3.03  3.03
2 16.49 16.55 16.42 16.35 16.56 16.75 16.74 16.72 16.63 16.53
    In7   In6   In5   In4   In3   In2   In1   In0   InY
1  3.03  3.03  3.03  3.03  3.03  3.03  3.03  3.03 10.78
2 16.32 16.08 15.77 15.47 14.81 14.30 13.70 13.35 10.65

The variables from In17 to In0 correspond to the registered values of SO$_2$ at a specific temporal instant. In0 denotes the zero instant ($X_t$), In1 corresponds to the 5-min temporal instant before ($X_{t-1}$), In2 is the 10-min temporal instant before ($X_{t-2}$), and so on until the last variable. The last column of the data frame (InY) refers to the response variable, $Y=X_{t+6}$, the temporal instant that we wish to predict. For this purpose, we propose the underlying generalised additive model

\[\label{eq:chapter6modelapp} Y=m_0\left(X_{t}\right)+m_1\left(X_{t-1}\right)+\ldots+m_{17}\left(X_{t-17}\right)+\varepsilon \tag{5} \]

where $m_j$, with $j=0, \ldots,17$, are smooth and unknown functions and $\varepsilon$ is the error which is assumed to have mean zero. To estimate the model in ((5)), FWDselect allows penalised regression splines, implemented in the mgcv package (Wood 2003; Wood 2004, 2011).

It may often be of interest to determine the best subset of variables of size $q$ needed to predict the response. The question that naturally arises in this application is, what is the best temporal instant for predicting an emission episode. This is easy to ascertain with the function selection and the argument q = 1. Also, based on the model that we want to estimate here (additive model), we have to use “gam” on the method argument.

> x <- pollution[, -19]
> y <- pollution[, 19]
> obj1 <- selection(x, y, q = 1, method = "gam", criterion = "deviance")
> obj1

****************************************************
Best subset of size q = 1 : In0

Information Criterion Value - deviance : 278663
****************************************************

For more than one subset size, the qselection function returns a table for the different subset sizes, with the selected variables and the information criterion value.

> obj2 <- qselection(x, y, qvector = c(1:6), method = "gam", criterion = "deviance")
[1] "Selecting subset of size 1 ..."
[1] "Selecting subset of size 2 ..."
[1] "Selecting subset of size 3 ..."
[1] "Selecting subset of size 4 ..."
[1] "Selecting subset of size 5 ..."
[1] "Selecting subset of size 6 ..."
> obj2
  q   deviance                    selection
1 1 278662.959                          In0
2 2 201474.673                     In0, In2
3 3 232164.509                In0, In2, In1
4 4 219941.426           In0, In3, In1, In5
5 5 184293.934      In0, In3, In1, In7, In6
6 6 200877.902 In0, In3, In1, In7, In6, In5

The above function output is a useful display that greatly helps determine the most relevant variables. A plot of this object can easily be obtained by using the following input command:

> plot(obj2)

Figure 1 shows the deviance values (obtained by cross-validation) corresponding to the different subsets. In each subset, $q$ represents the number of temporal instants included in the model. Note, however, that only the results until subset of size $q=$ 6 are shown because, from this size onwards, the rest of the obtained models have similar deviances.

graphic without alt text — Figure 1: For each subset of size $q$, cross-validation deviance obtained by the best model for the pollution data.

The performance of the proposed predictors was then evaluated in a new real pollution episode. We estimate firstly each of the proposed models using the gam function of the mgcv package with the training data set (pollution data). Then, we apply the predict.gam function to each model using, in this case, the test data set. These data are found in the episode data, also included in this package. The corresponding data frame is illustrated as follows:


> data(episode)
> head(episode)[1:2, ]
  In17 In16 In15 In14 In13 In12 In11 In10  In9  In8  In7  In6  In5
1 3.02 3.02 3.03 3.10 3.10 3.10 3.10 3.22 3.27 3.33 3.36 3.38 3.47
2 3.02 3.03 3.10 3.10 3.10 3.10 3.22 3.27 3.33 3.36 3.38 3.47 3.50
   In4  In3  In2  In1  In0  InY  time
1 3.50 3.56 3.61 4.28 4.60 5.45 00:00
2 3.56 3.61 4.28 4.60 4.68 6.20 00:05

The course of the incident is depicted in Figure 2. Temporal instants are plotted on the horizontal axis and the real 2-hour mean SO$_2$ concentration that we seek to predict ($Y=X_{t+6}$) is represented by a grey line. The predictions obtained by applying the different models are shown in the same figure. The code, both for the predictions as for the plot, is shown in the Appendix.

The prediction obtained with the inclusion of just one variable in the model, $X_{t}$, is far from the optimum. However, the addition of one more variable, $X_{t-2}$, resulted in a remarkable increase in the model predictive capability. It makes possible for predictions close to real values to be obtained. Lastly, it can be seen that the incorporation of one more variable or temporal instant ($X_{t-1}$) in the model does not produce any improvement in pollution-incident prediction. Numerically speaking, the same results can be observed by taking into account the Mean Square Error for each model (Table 3).

Table 3: Mean Square Error of the selected models.
Model	MSE
$Y= X_t$	1 682.14
$Y= X_t+X_{t-2}$	366.44
$Y= X_t+X_{t-2}+X_{t-1}$	556.49

The question that now arises is what is the minimum number of variables that must be used in order to obtain the best prediction. It is possible to deduce that there is an optimal intermediate point between the number of variables that enters the model (preferably low) and the deviance value (preferably also low). To find this point, the test described in the previous section for the null hypothesis $H_0(q)$ is applied for each size, $q$ (through the input command shown below). The procedure stops when a certain null hypothesis is accepted. The most computationally demanding parts are those that involve the bootstrap and the cross-validation techniques. This can be parallelized using the argument cluster = TRUE (default). This should considerably increase the performance on multi–core / multi–threading machines.

> test(x, y, nboot = 100, method = "gam", bootseed = 0413)
[1] "Processing IC bootstrap for H_0 ( 1 )..."
[1] "Processing IC bootstrap for H_0 ( 2 )..."

*************************************
  Hypothesis Statistic pvalue     Decision
1    H_0 (1)   5779.03      0     Rejected
2    H_0 (2)    959.21   0.78 Not Rejected

The deduction to be drawn is that, for a 5% significance level, the null hypothesis is rejected with $q=$ 1 and accepted thereafter. From these results, it can be concluded that the best temporal instants for prediction of an emission episode would be $X_{t}$ and $X_{t-2}$.

Lastly, as we mention before, there are other alternatives for variable selection in additive models. One of the best-known and used procedures is the argument select of the gam function from the mgcv package (Marra and Wood 2011). To illustrate and compare its usage with our procedure, we have estimated the model in ((5)) by means of the cited function using the pollution data. Then, its performance was evaluated using again the episode data. The prediction obtained using this double penalty GAM is far from what it should be (see Figure 3), actually, the mean square error obtained (5 024.29) is the worst of all so far (see the code in Appendix). It seems that, in situations with a large number of variables, the selection of the best subset could be a better approach.

4 Conclusions

This paper discusses implementation in R of a new algorithm for the problem of variable selection in a regression framework. The FWDselect package provides R users a simple method for ascertaining the relevant variables for prediction purposes and how many of these should be included in the model. The proposed method is a new forward stepwise-based selection procedure that selects a model containing a subset of variables according to an information criterion, and also takes into account the computational cost. Bootstrap techniques have been used to determine the minimum number of variables needed to obtain an appropriate prediction.

In some situations, several statistically equivalent optimal models of size $q$ may exist. In such cases, FWDselect allows the user to visualise those models and select the most interesting one. This is obtained with the argument seconds = TRUE of the selection functions. In addition, the software provides the user with a way of easily obtaining the best subset of variables using different types of data in different frameworks, by applying the lm, glm and gam functions already implemented in R. The use of these classical R functions nevertheless entails a high computational cost. Hence, a further interesting extension would be the implementation of this package using C, C++ or Fortran as the programming language. R users could profit from this advantage in a future version of this package.

Insofar as the validity of the method is concerned, we think that the results obtained with simulated data are correct, and the results with the diabetes data are in accordance with other methodologies. This suggest that the behavior of the procedure in a nonparametric framework will be also adequate.

The results in this paper were obtained using $R$ 3.2.0. The FWDselect package (Sestelo et al. 2015) is available from the Comprehensive R Archive Network at the URL http://cran.r-project.org/web/packages/FWDselect/.

5 Acknowledgments

This work was supported by research grant SFRH/BPD/93928/2013 of “Fundação para a Ciência e a Tecnologia” (FCT) and by FEDER Funds through “Programa Operacional Factores de Competitividade - COMPETE”, by Portuguese Funds through FCT, in the form of grant PEst-OE/MAT/UI0013/2014, by grant MTM2011-23204 (FEDER support included) of the Spanish Ministry of Science and Innovation and by grant 10PXIB300068PR from the Galician Regional Authority (Xunta de Galicia).

6 Appendix

Here we illustrate the use of FWDselect using simulated data. The use of simulated data allows us to ascertain the behavior of our software and to compare it with others available tools designed for the same purpose.

Consider a vector of 10 covariates, $\textbf{X}=\left(X_1, \ldots, X_{10}\right)$, and a continuous response, $Y$, generated in accordance with

\[Y = 2 \hspace{0.1cm} X_1+ 4 \hspace{0.1cm} X_5+ \varepsilon,\] where $\varepsilon$ is the error distributed in accordance with a $N(0, 1)$ and the explanatory covariates are random variables obtained from uniforme distribution on $[-1,1]$. Note that we have a linear scenario in which the response variable depends only on two covariates.

The following code will simulate 100 observations ($n =$ 100) from the above model. To ensure the reproducibility of the results reported in the paper a seed was considered (0413).

> library(glmulti)
Loading required package: rJava
> library(leaps)
> rm(list = ls())
> set.seed(0413)
> n <- 100
> x <- matrix(runif(10 * n, -1, 1), ncol = 10, nrow = n, byrow = FALSE)
> e <- rnorm(n, 0, 1)
> y <- 2 * x[, 1] + 4 * x[, 5] + e
> data <- data.frame(x, y)

Now we compare our method against other existing methodologies developed to perform automated variable selection. We choose the leaps package (regsubsets function), which selects the best variables for each subset of size $q$ without determining the number of variables that users have to include in the model; the step function from the stats package which selects a formula-based model using the AIC; and the glmulti which compares all posible models through an exhaustive screening of the candidates, or a genetic algorithm, or a very fast exhaustive branch-and-bound algorithm .

> res1 <- regsubsets(x, y)
> summary(res1)$outmat[2, ]
  a   b   c   d   e   f   g   h   i   j 
"*" " " " " " " "*" " " " " " " " " " " 

> res2 <- step(lm(y ~ ., data = data), trace = FALSE)
> res2

Call:
lm(formula = y ~ X1 + X4 + X5 + X9, data = data)

Coefficients:
(Intercept)           X1           X4           X5           X9  
     0.1538       1.9691       0.4285       3.5774       0.2875  


> res3 <- glmulti(y ~ ., data = data, level = 1, plotty = F, report = FALSE)
> summary(res3)$bestmodel
[1] "y ~ 1 + X1 + X4 + X5 + X9"

> res4aux <- test(x, y, nboot = 100)
[1] "Processing IC bootstrap for H_0 ( 1 )..."
[1] "Processing IC bootstrap for H_0 ( 2 )..."

*************************************
  Hypothesis Statistic pvalue     Decision
1    H_0 (1)     90.92      0     Rejected
2    H_0 (2)     21.11   0.06 Not Rejected
> res4 <- selection(x = x, y = y, q = res4aux$nvar, cluster = FALSE)
> res4

****************************************************
Best subset of size q = 2 : 5 1

Information Criterion Value - deviance : 16.70488
****************************************************

The regsubsets function is based on all subsets or, in other words, exhaustive variable selection. The method identifies the best subsets of linear predictors using a branch–and–bound algorithm (Miller, 2002). Since this function returns separate best models of all sizes, we consider only the results obtained for a subset of size two. In this case, the procedure works properly returning the $X_1$ and $X_5$ variables as the best subset of size two. The model–selection oriented function step is a widely used methodology for jointly determining the number and choice of variables. In this case, this procedure fails returning a model which includes the effects of four covariates ($X_1$, $X_4$, $X_5$ and $X_9$). The results obtained with the glmulti package, another option for model selection, are also mistaken. The procedure returns the same model obtained with the previous method (step). Finally, in order to ascertain the performance of FWDselect, we firstly apply the test function with the purpose of determine the number of variable that have to be included in the model. Then, once this number is obtained (saved in the returned list as $nvar), the selection function determines correctly the $X_1$ and $X_5$ variables.

According to the computation time of these four methods, the fastest procedure is the implemented in the leaps package taking only 0.001 secs. The second one is the step function which runs in 0.037 secs. The next one is the glmulti function which takes 3.149 secs. Lastly, the most computationally demanding code is the implemented in the FWDselect package which requires 9.181 secs. All the results have been obtained using the R’s system.time command on a 2.4 GHz Intel Core i5, with 4 cores and 4 Gb of RAM.

The previous results have been obtained using simulated data with a linear effect of the covariates. However, in practice, the user does not know the dependence structure, i. e., how the response variable depends on the covariates. With this in mind, we have considered and applied again the four procedures on another scenario where the response variable depends again on the same two covariates, but in this case, the effect of them is nonlinear. Particularly, the $Y$ is now generated in accordance with

\[Y = 2 \left(X_1\right)^2+ 2\sin \left(2 \pi X_5\right)+ \varepsilon,\]

being both $\varepsilon$ and the explanatory covariates the same of the previous scenario. Note that we have now a nonlinear scenario in which the response variable depends only on two covariates.

> y <- 2 * x[, 1]**2 + 2 * sin(2 * pi * x[, 5]) + e
> data <- data.frame(x, y)
> res1 <- regsubsets(x, y)
> summary(res1)$outmat[2, ]
  a   b   c   d   e   f   g   h   i   j 
" " " " " " " " "*" "*" " " " " " " " " 

> res2 <- step(lm(y ~ ., data = data), trace = 0)
> res2

Call:
lm(formula = y ~ X5, data = data)

Coefficients:
(Intercept)           X5  
     0.4764      -1.2377  


> res3 <- glmulti(y ~ ., data = data, level = 1, plotty = F, report = FALSE)
> summary(res3)$bestmodel
[1] "y ~ 1 + X5"

> res4aux <- test(x, y, nboot = 100, method = "gam")
[1] "Processing IC bootstrap for H_0 ( 1 )..."
[1] "Processing IC bootstrap for H_0 ( 2 )..."


*************************************
  Hypothesis Statistic pvalue     Decision
1    H_0 (1)     46.04      0     Rejected
2    H_0 (2)     20.89   0.15 Not Rejected
> res4 <- selection(x, y, q = res4aux$nvar, cluster = FALSE, method = "gam")
> res4

****************************************************
Best subset of size q = 2 : 5 1

Information Criterion Value - deviance : 18.41203
****************************************************

In this case, the performance of the methods changes. Excluding the FWDselect, all the procedures fail to select the correct model. The leaps package returns the $X_5$ and $X_6$ variables whereas the others two packages only retrieve the effect of $X_5$.

The results presented in this appendix have been obtained with one simulated sample of $n =$ 100. In order to evaluate the real performance of the methods, a simulation study using five hundred independent samples with different sample sizes ($n = 50, 100, 200$) was carried out. Focusing on the linear scenario, the leaps and FWDselect packages work well with 100% and close to 95% of successes, respectively (for any sample sizes). The success rate for the other two packages is around 22%. Note that the results of leaps have been obtained assuming a subset of size two, and thus providing an advantage to this method over the others. In relation with the nonlinear scenario, the proportion of failures is very high for all procedures excepting FWDselect. The latter performs correctly close to 30% of the times for the smallest sample size, around 63% for $n =100$ while it reaches 91.6% of successes for $n = 200$.

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

FWDselect: An R Package for Variable Selection in Regression Models

1 Introduction

2 Methodology background

Selecting the best variables

Testing the number of significant variables

3 FWDselect in practice

Example of application

4 Conclusions

5 Acknowledgments

6 Appendix

CRAN packages used

CRAN Task Views implied by cited packages

Note

References

Reuse

Citation

Model	MSE
\(Y= X_t\)	1 682.14
\(Y= X_t+X_{t-2}\)	366.44
\(Y= X_t+X_{t-2}+X_{t-1}\)	556.49