A new package crs is introduced for computing nonparametric regression (and quantile) splines in the presence of both continuous and categorical predictors. B-splines are employed in the regression model for the continuous predictors and kernel weighting is employed for the categorical predictors. We also develop a simple R interface to NOMAD, which is a mixed integer optimization solver used to compute optimal regression spline solutions.
Regression splines constitute a popular approach for the nonparametric
estimation of regression functions though they are restricted to
continuous predictors as are smoothing splines (see e.g. smooth.spline
which is limited to one numeric predictor) and locally weighted
scatterplot smoothing (see loess
which is limited to four numeric
predictors). However, it is not uncommon to encounter relationships
involving a mix of numeric and categorical predictors, and the
crs package implements a
regression spline framework for accomplishing this task.
The crs package implements the approaches described in
Ma, J. S. Racine, and L. Yang (2012) and Ma and J. S. Racine (2012) when the option kernel=TRUE
is selected (default) as described below. The categorical predictors can
be handled in two ways, (i) using kernel weighting where the kernel
functions are tailored to the discrete support of the categorical
predictors (Racine and Q. Li 2004), and (ii) using indicator basis functions.
The fundamental difference between these two approaches is that the use
of indicator basis functions consumes degrees of freedom via the number
of columns in the basis matrix, while kernel weighting does not. As
well, this package implements a range of related methods and provides
options that we hope will make it appealing for applied projects,
research, and pedagogical purposes alike.
Both semiparametric additive models and fully nonparametric models are
implemented. Data-driven methods can be used for selecting the spline
degree, number of segments/knots, and bandwidths: leave-one-out
cross-validation (cv.func = "cv.ls"
) (Stone 1974, 1977),
generalized cross-validation (cv.func="cv.gcv"
) (Craven and G. Wahba 1979),
and the information-based criterion (cv.func="cv.aic"
) proposed by
Hurvich, J. S. Simonoff, and C. L. Tsai (1998). Derivatives of arbitrary order can readily
be constructed. One of the computational challenges encountered is to
obtain the optimal spline degree (non-negative integer), number of
segments/knots (positive integer), and bandwidth (bounded and
real-valued) for each predictor. We overcome this challenge by providing
an interface to NOMAD (Abramson, C. Audet, G. Couture, J. Dennis Jr., and S. Le Digabel 2011; Le Digabel 2011).
Before proceeding we briefly provide an overview of some important implementation details:
The degree of the spline and number of segments (i.e. knots minus one) for each continuous predictor can be set manually as can the bandwidths for each categorical predictor (if appropriate).
Alternatively, any of the data-driven criteria (i.e. cv.func="…"
)
could be used to select either the degree of the spline (holding the
number of segments/knots fixed at any user-set value) and bandwidths
for the categorical predictors (if appropriate), or the number of
segments (holding the degree of the spline fixed at any user-set
value) and bandwidths for the categorical predictors (if
appropriate), or the number of segments and the degree of the spline
for each continuous predictor and bandwidths for each categorical
predictor (if appropriate).
When indicator basis functions are used instead of kernel smoothing,
whether to include each categorical predictor or not can be
specified manually or chosen via any cv.func
method.
We allow the degree of the spline for each continuous predictor to
include zero, the inclusion indicator for each categorical predictor
to equal zero, and the bandwidth for each categorical predictor to
equal one, and when the degree/inclusion indicator is zero or the
bandwidth is one, the variable is thereby removed from the
regression: in this manner, irrelevant predictors can be
automatically removed by any cv.func
method negating the need for
pre-testing (mirroring findings detailed in Hall, J. S. Racine, and Q. Li (2004; Hall, Q. Li, and J. S. Racine 2007) for kernel regression).
The design philosophy underlying the crs package aims to closely mimic
the behaviour of the lm
function. Indeed, the implementation relies on
lm
and its supporting functions for computation of the spline
coefficients, delete-one residuals, fitted values, prediction and the
like. 95% confidence bounds for the fit and derivatives are constructed
from asymptotic formulae and automatically generated. Below we describe
in more detail the specifics of the implementation for the interested
reader. Others may wish to jump to the illustrative example that appears
towards the end of this article or simply install and load the package
and run example(crs)
.
Readers familiar with existing R-functions and packages for spline modelling will naturally be wondering what the difference is between the crs package and existing spline-based procedures such as
smooth.spline
in R base
gam
in the gam package
(Hastie 2011)
First we note that the functions smooth.spline
and ssanova
are based
on smoothing spline methodology, while spm
uses penalized splines but
gam
in the gam/mgcv packages allows for smoothing splines,
penalized splines, and regression splines. The crs package is
restricted to ‘regression splines’ which differs in a number of ways
from ‘smoothing splines’ and ‘penalized splines’, the fundamental
difference being that smoothing/penalized splines use the data points
themselves as potential knots and penalize ‘roughness’ (typically the
second derivative of the estimate) while regression splines place knots
at equidistant/equiquantile points and do not explicitly penalize
roughness, rather, they rely on various cross-validatory approaches for
model selection. We direct the interested reader to Wahba (1990) for a
treatment of smoothing splines. The crs package is one of the few
packages devoted to regression spline techniques. We also provide
quantile regression splines via the option tau=
\(\tau\)
(\(\tau\in(0,1)\)).
Second, many of the aforementioned smoothing spline implementations are
semiparametric in nature, the semiparametric additive model being
particularly popular. Though semiparametric models exist to circumvent
the curse of dimensionality, it does not come without cost. That is, the
burden of determining whether semiparametric or nonparametric approaches
would be warranted in any given situation is placed squarely on the
researcher’s shoulder. Unlike many existing spline methods in R, the
implementation in the crs package is designed so that every parameter
that must be chosen by the researcher can be data-driven (via
cross-validation) so that such choices adapt to the data at hand
including whether to use a semiparametric or nonparametric model. This
is accomplished using options such as basis="auto"
, knots="auto"
,
and complexity="degree-knots"
(basis="auto"
deciding whether to use
an additive basis or tensor product basis in multivariate settings,
knots="auto"
whether to use equispaced knots or quantile knots, and
complexity="degree-knots"
determining both the spline degrees and
number of knots).
Generally speaking, almost all of these existing smoothing spline approaches can handle mixed datatypes though their approaches to the treatment of categorical variables often differ (none use categorical kernel smoothing as in the crs package).
Regression spline methods can be limited in their potential applicability as they are based on continuous predictors. However, in applied settings we often encounter categorical predictors such as strength of preference (“strongly prefer”, “weakly prefer”, “indifferent”) and so forth.
We wish to model the unknown conditional mean in the following location-scale model, \[Y=g\left( \mathbf{X},\mathbf{Z}\right) +\sigma \left( \mathbf{X},\mathbf{Z} \right) \varepsilon,\] where \(g(\cdot )\) is an unknown function, \(\mathbf{X=}\left( X_{1},\ldots ,X_{q}\right) ^{\mathrm{T}}\) is a \(q\)-dimensional vector of continuous predictors, \(\mathbf{Z}=\left( Z_{1},\ldots ,Z_{r}\right) ^{\mathrm{T}}\) is an \(r\)-dimensional vector of categorical predictors, and \(\sigma^2\left(\mathbf{X},\mathbf{Z}\right)\) is the conditional variance of \(Y\) given \(\mathbf{X}\) and \(\mathbf{Z}\). Letting \(\mathbf{z} =\left( z_{s}\right) _{s=1}^{r}\), we assume that \(z_{s}\) takes \(c_{s}\) different values in \(D_{s}\equiv \left\{ 0,1,\ldots ,c_{s}-1\right\}\), \(s=1,\ldots ,r\), and \(c_{s}\) is a finite positive constant.
For the continuous predictors the regression spline model employs the B-spline routines in the GNU Scientific Library (Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, P. Alken, M. Booth, and F. Rossi 2009). The B-spline function is the maximally differentiable interpolative basis function (B-spline stands for ‘basis-spline’).
Heuristically, we conduct linear (in parameter) regression using the R
function lm
, however, we replace the continuous predictors with
B-splines of potentially differing order and number of segments for each
continuous predictor. For the tensor product bases we set
intercept=TRUE
for each univariate spline basis, while for the
additive spline bases we adopt the intercept=FALSE
variants and
include an intercept term in the model (the B-splines will therefore not
sum to one, i.e. an order \(m\) B-spline with one segment (two
knots/breakpoints) has \(m+1\) columns and we drop the first as is often
done, though see Zhou and D. A. Wolfe (2000) for an alternative approach based upon
shrinkage methods). This allows multiple bases to coexist when there is
more than one continuous predictor. The tensor product basis is given by
\[B_1\otimes B_2\otimes \dots \otimes B_p,\] where \(\otimes\) is the
Kronecker product where the products operate column-wise and \(B_j\) is
the basis matrix for predictor \(j\) as outlined above. We also support a
‘generalized’ polynomial B-spline basis that consists of a varying-order
polynomial with appropriate interactions. When additive B-spline bases
are employed we have a semiparametric ‘additive’ spline model (no
interaction among variables unless explicitly specified). When the
tensor product or generalized polynomial is employed we have a fully
nonparametric model (interaction among all variables). Whether to use
the additive, tensor product, or generalized polynomial bases can be
pre-specified or automatically determined via any cv.func
method (see
the options for basis=
in ?crs
).
We offer the default option to use categorical kernel weighting
(lm(…,weights=L)
) to handle the presence of categorical predictors
(see below for a description of L
). We also offer the option of using
indicator basis functions for the categorical predictors.
The unknown function \(g\left( \mathbf{X},\mathbf{Z}\right)\) can be approximated by \(\mathcal{B}\left( \mathbf{X}\right) ^{\mathrm{T}}\beta \left( \mathbf{Z}\right)\), where \(\beta \left( \mathbf{Z}\right)\) is a vector of coefficients and \(\mathcal{B}\left( \mathbf{X}\right)\) the B-spline basis matrix described above.
We estimate \(\beta \left( \mathbf{z}\right)\) by minimizing the following weighted least squares criterion, \[\widehat{\beta }\left( \mathbf{z}\right) =\mathop{\mathrm{arg\!min}}_{\beta \left( \mathbf{z} \right) \in \mathbb{R}^{\mathbf{K}_{n}}}\sum\limits_{i=1}^{n}\left\{ Y_{i}- \mathcal{B}\left( \mathbf{X}_{i}\right) ^{\mathrm{T}}\beta \left( \mathbf{z}\right) \right\} ^{2}L\left( \mathbf{Z}_{i},\mathbf{z},\lambda\right).\]
The user can determine where knots are to be placed using one of two methods:
knots can be placed at equally spaced quantiles whereby an equal number of observations lie in each segment (‘quantile knots’).
knots can be placed at equally spaced intervals (‘uniform knots’).
If preferred, this can be determined automatically using the option
knots="auto"
.
Let \(Z_i\) be an \(r\)-dimensional vector of categorical/discrete predictors. We use \(z_s\) to denote the \(s\)-th component of \(z\), we assume that \(z_s\) takes \(c_s\) different values in \(D_s \equiv \{0,1,\dots ,c_s-1\}\), \(s=1,\dots,r\), and let \(c_s\geq 2\) be a finite positive constant.
For an unordered categorical predictor, we use a variant of the kernel function outlined in Aitchison and C. G. G. Aitken (1976) defined as \[\label{eq:barL(new)} l( Z_{is}, z_{s},\lambda_s) = \left\{ \begin{array}{ll} 1, & \text{ when $ Z_{is} = z_s$},\\ \lambda_s, & \text{ otherwise}. \end{array} \right. (\#eq:barL(new)) \] Let \({\bf 1}(A)\) denote the usual indicator function, which assumes the value one if \(A\) holds true, zero otherwise. Using @ref(eq:barL(new)), we can construct a product kernel weight function given by \[L\left( \mathbf{Z}_{i},\mathbf{z},\lambda\right) = \prod_{s=1}^r l( Z_{is},z_s,\lambda_s) = \prod_{s=1}^{r} \lambda_s^{ {\bf 1}( Z_{is} \neq z_s) },\] while for an ordered categorical we use the function defined by \[\tilde l(Z_{is},z_s,\lambda_s) = \lambda_s^{|Z_{is}-z_s| }\] and modify the product kernel function appropriately. When \(Z\) contains a mix of ordered and unordered variables we select the appropriate kernel for each variable’s type when constructing the product kernel \(L\left( \mathbf{Z}_{i},\mathbf{z},\lambda\right)\).
Note that when \(\lambda_s=1\) all observations are ‘pooled’ over categorical predictor \(s\) hence the variable \(z_s\) is removed from the resulting estimate, while when \(\lambda_s=0\) only observations lying in a given cell are used to form the estimate.
Estimating the model requires construction of the spline bases and their
tensor product (if specified) along with the categorical kernel
weighting function. Then, for a given degree and number of segments for
each continuous predictor and bandwidth for each categorical predictor
(or indicator bases if kernel=FALSE
), the model is fit via
least-squares.
All smoothing parameters can be set manually by the user if so desired,
however be forewarned that you must use the option cv="none"
otherwise
the values specified manually will become the starting points for search
when cv="nomad"
(‘nonsmooth mesh adaptive direct search (NOMAD)’, see
Abramson, C. Audet, G. Couture, J. Dennis Jr., and S. Le Digabel (2011) and Le Digabel (2011)). Currently, we provide a simple R interface
to NOMAD (see the section below) in the crs package which also can be
applied to solve other mixed integer optimization problems.
The degree, segment, and bandwidth vectors can be jointly determined via
any cv.func
method by setting the option cv="nomad"
or
cv="exhaustive"
(exhaustive search). Here we have to solve nonlinear
non-convex and non-differentiable mixed integer constrained optimization
problems to obtain the optimal degree, number of segments, and bandwidth
for each predictor.
Setting the option cv="nomad"
(default) computes NOMAD-based
cross-validation directed search while setting cv="exhaustive"
computes exhaustive cross-validation directed search for each unique
combination of the degree and segment vector for each continuous
predictor from degree=degree.min
through degree=degree.max
(default
0 and 10, respectively) and from segments=segments.min
through
segments=segments.max
(default 1 and 10, respectively).
When kernel=TRUE
(default) setting the option cv="exhaustive"
computes bandwidths (\(\in[0,1]\)) obtained via numerical minimization
(using optim
) for each unique combination of the degree and segment
vectors (restarting can be conducted via restarts=
). When setting
cv="nomad"
the number of multiple starts can be controlled by
nmulti=
(default is 5). The model possessing the lowest criterion
function value over the nmulti
restarts is then selected as the final
model.
Note that cv="exhaustive"
is often unfeasible (this combinatoric
problem can become impossibly large to compute in finite time) hence
cv="nomad"
is the default. However, with cv="nomad"
one should set
nmulti=
to some sensible value greater than zero to avoid becoming
trapped in local minima (default nmulti=5
).
We incorporate three popular approaches for setting the smoothing parameters of the regression spline model, namely least-squares cross-validation, generalized cross-validation, and an AIC method corrected for use in nonparametric settings.
Let the fitted value from the regression spline model be denoted \(\hat Y_i=B_{m}(X_i)^T\hat \beta(Z_i)\). Letting \(\hat\varepsilon_i=Y_i-\hat Y_i\) denote the \(i\)th residual from the categorical regression spline model, the least-squares cross-validation function is given by \[CV=\frac{1}{n}\sum_{i=1}^n\frac{\hat\varepsilon_i^2}{(1-h_{ii})^2}\] and this computation can be done with effectively one pass through the data set, where \(h_{ii}\) denotes the \(i\)th diagonal element of the spline basis projection matrix (see below for details). Since \(h_{ii}\) is computed routinely for robust diagnostics in R, this can be computed along with (and hence as cheaply as) the vector of spline coefficients themselves. Thus least-squares cross-validation is computationally appealing, particularly for large data sets.
Let \(H\) denote the \(n\times n\) weighting matrix such that \(\hat Y=HY\) with its \(i\)th diagonal element given by \(H_{ii}\) where \(\text{tr}(H)\) means the trace of \(H\) which is equal to \(\sum_{i=1}^n h_{ii}\). The matrix \(H\) is often called the ‘hat matrix’ or ‘smoother matrix’ and depends on \(X\) but not on \(Y\). The ‘generalized’ cross-validation function is obtained by replacing \(h_{ii}\) in the above formula with its average value denoted \(\text{tr}(H)/n\) (Craven and G. Wahba 1979).
The information-based criterion proposed by Hurvich, J. S. Simonoff, and C. L. Tsai (1998) is given by \[\text{AIC}_c=\ln(\hat\sigma^2)+ \frac{1+\text{tr}(H)/n}{1-\{\text{tr}(H)+2\}/n}\] where \[\hat\sigma^2=\frac{1}{n}\sum_{i=1}^n\hat\varepsilon_i^2 =Y'(I-H)'(I-H)Y/n.\] Each of these criterion functions can be minimized with respect to the unknown smoothing parameters either by numerical optimization procedures or by exhaustive search.
Though each of the above criteria are asymptotically equivalent in terms of the bandwidths they deliver (\(\text{tr}(H)/n\to 0\) as \(n\to\infty\)), they may differ in finite-sample settings for a small smoothing parameter (large \(\text{tr}(H)/n\)) with the AIC\(_c\) criterion penalizing more heavily when undersmoothing than either the least-squares cross-validation or generalized cross-validation criteria (the AIC\(_c\) criterion effectively applies an infinite penalty for \(\text{tr}(H)/n\ge 1/2\)).
Once a model has been selected via cross-validation (i.e. degree
,
segments
, include
or lambda
have been selected), there is the
opportunity to (potentially) further refine the model by adding the
option prune=TRUE
to the crs
function call. Pruning is accomplished
by conducting stepwise cross-validated variable selection using a
modified version of the stepAIC
function in the R
MASS package where the
function extractAIC
is replaced with the function extractCV
with
additional modifications where necessary. Pruning of potentially
superfluous bases is undertaken, however, the pruned model (potentially
containing a subset of the bases) is returned only if its
cross-validation score improves upon the model being pruned. When this
is not the case a warning is issued to this effect. A final pruning
stage is commonplace in spline frameworks and may positively impact on
the finite-sample efficiency of the resulting estimator depending on the
rank of the model being pruned. Note that this option can only be
applied when kernel=FALSE
(or when there exist only numeric
predictors).
The crs package has included a simple R interface to the NOMAD
optimization solver called snomadr
. snomadr
implements the NOMAD
library which is an open source C++ implementation of the Mesh Adaptive
Direct Search (MADS) algorithm designed for constrained optimization of
blackbox functions (Abramson, C. Audet, G. Couture, J. Dennis Jr., and S. Le Digabel 2011; Le Digabel 2011). snomadr
can be seen as a
standalone interface to this optimization solver, though we would like
to point out that the authors of NOMAD
are currently working on an R
package for NOMAD that we expect to be much more comprehensive than the
simple interface provided here. The principle behind developing our
interface is simply to shield the user from setting up the optimization
model underlying our regression spline implementation. For what follows
we will not discuss the performance of the solver NOMAD, rather we
direct the interested reader to Abramson, C. Audet, G. Couture, J. Dennis Jr., and S. Le Digabel (2011; Le Digabel 2011) and the
references therein.
The structure of the snomadr
interface is similar to the interface in
the R package ipoptr (http://www.ucl.ac.uk/~uctpjyy/ipoptr.html)
which exports the R interface to the optimization solver IPOPT (Wächter and L. T. Biegler 2006).
The interface is split into two major portions, one is the R code called
snomadr.R
and another is a C++ program called snomadr.cpp
. The role
of snomadr.R
is to set up the optimization model, define options and
then call the C++ code to solve the optimization problem. snomadr.cpp
will receive arguments from the R code and then call the NOMAD library
functions to solve the problem and return results to snomadr.R
.
.Call
is used to transfer arguments and results between the R and C++
languages. For further details see ?snomadr
.
By way of illustration we consider a simple simulated example involving one continuous and one categorical predictor.
set.seed(42)
<- 1000
n <- runif(n)
x <- rbinom(n, 1, .5)
z <- cos(2 * pi * x) + z + rnorm(n, sd=0.25)
y <- factor(z)
z <- crs(y ~ x + z)
model summary(model)
:
Callcrs.formula(formula = y ~ x + z)
/B-spline Bases Regression
Kernel Weighting
Spline
1 continuous predictor
There is 1 categorical predictor
There is /number of segments for x: 3/4
Spline degreefor z: 0.0008551836
Bandwidth : degree-knots
Model complexity proxy: quantiles
Knot type: 1000
Training observations: 14
Trace of smoother matrix: 0.2453 on 993
Residual standard error
degrees of freedom-squared: 0.927,
Multiple R-squared: 0.9265
Adjusted R-statistic: 962.9 on 13 and 986 DF,
F-value: 0
p-validation score: 0.061491531
Cross: 5 Number of multistarts
The function crs()
called in this example returns a "crs"
object.
The generic functions fitted
and residuals
extract (or generate)
estimated values and residuals. Furthermore, the functions summary
,
predict
, and plot
(options mean=FALSE
, deriv=
\(i\), ci=FALSE
,
plot.behavior
= c("plot"
, "plot-data"
, "data")
, where \(i\) is a
positive integer) support objects of this type.
Figure 1 presents summary output in
the form of partial regression surfaces (predictors not appearing on the
axes are held constant at their medians/modes). Note that for this
simple example we used the option plot(model,mean=TRUE)
which presents
‘partial regression plots’.1
Next we plot the first partial derivative (which plots the partial
derivative with respect to the continuous predictor holding the
categorical predictor at its modal value and next the difference between
the regression function when the categorical predictor equals 0 and 1
holding the continuous predictor at its median) for this model in Figure
2. Note that here we used the
option plot(model, deriv=1)
.
We now compare and contrast some existing spline implementations in R with the regression spline implementation contained in the crs package via two illustrative Monte Carlo simulations.
We first consider the simple and popular case of nonparametrically modeling the relationship between a response \(Y\) and univariate predictor \(X\). Here the purpose is to help the reader assess the performance of regression splines versus smoothing splines. We consider four data generating processes (DGPs), the sine and cosine functions, the absolute value function, and the Doppler function given by \[g(x)=\sqrt{x(1-x)}\times\sin\left(\frac{2\pi(1+2^{-7/5})}{x+2^{-7/5}}\right)\] For the sine, cosine, and Doppler functions \(X\) is \(U[0,1]\), while for the absolute value function \(X\) is \(U[-1,1]\). We vary the signal/noise ratio by standardizing the DGP to have mean zero and unit variance and setting \(\sigma\) for the Gaussian noise to \(1/4,1/2,1,2\) which produces expected in-sample fits of \(R^2=0.95,0.80,0.50,0.20\) for the oracle estimator (i.e. one that uses knowledge of the DGP).
We compare crs
-based regression splines with smoothing spline methods
gam
in the mgcv package, spm
in the SemiPar package, ssanova
in the gss package, and loess
in base R. All methods in this first
simulation use default settings.
We draw \(M=1,000\) replications of size \(n=1,000\) from each DGP, compute
the mean square error (MSE) for each model, and report the median value
of each model’s MSE relative to that for crs
over the \(M=1,000\)
replications in Table 1 below (numbers less than one
indicate that a method is more efficient than crs
).
\(\sin(2\pi x)\) | gam | spm | ssanova | loess |
\(\sigma=0.25\) | 0.97 | 1.24 | 1.02 | 8.06 |
\(\sigma=0.50\) | 0.80 | 0.89 | 0.81 | 2.15 |
\(\sigma=1.00\) | 0.78 | 0.78 | 0.78 | 0.92 |
\(\sigma=2.00\) | 0.84 | 0.79 | 0.83 | 0.71 |
\(\sin(4\pi x)\) | gam | spm | ssanova | loess |
\(\sigma=0.25\) | 0.89 | 1.43 | 1.11 | 507.85 |
\(\sigma=0.50\) | 0.78 | 1.15 | 0.98 | 129.41 |
\(\sigma=1.00\) | 0.81 | 1.01 | 0.95 | 36.72 |
\(\sigma=2.00\) | 0.77 | 0.84 | 0.85 | 9.89 |
\(\cos(2\pi x)\) | gam | spm | ssanova | loess |
\(\sigma=0.25\) | 1.13 | 1.17 | 1.17 | 39.39 |
\(\sigma=0.50\) | 0.99 | 1.00 | 1.04 | 11.18 |
\(\sigma=1.00\) | 0.98 | 0.93 | 1.00 | 3.67 |
\(\sigma=2.00\) | 0.89 | 0.82 | 0.91 | 1.38 |
\(\cos(4\pi x)\) | gam | spm | ssanova | loess |
\(\sigma=0.25\) | 2.78 | 1.54 | 1.30 | 299.88 |
\(\sigma=0.50\) | 1.31 | 1.25 | 1.12 | 79.85 |
\(\sigma=1.00\) | 0.93 | 1.02 | 0.97 | 21.42 |
\(\sigma=2.00\) | 0.82 | 0.89 | 0.87 | 6.09 |
\(\text{abs}(x)\) | gam | spm | ssanova | loess |
\(\sigma=0.25\) | 1.55 | 0.91 | 0.98 | 14.59 |
\(\sigma=0.50\) | 1.20 | 1.05 | 1.11 | 5.97 |
\(\sigma=1.00\) | 1.23 | 1.18 | 1.26 | 2.73 |
\(\sigma=2.00\) | 1.41 | 1.36 | 1.43 | 1.61 |
\(\text{doppler}(x)\) | gam | spm | ssanova | loess |
\(\sigma=0.25\) | 65.45 | 1.67 | 1.68 | 359.63 |
\(\sigma=0.50\) | 18.53 | 1.39 | 1.42 | 99.51 |
\(\sigma=1.00\) | 5.15 | 1.14 | 1.19 | 25.40 |
\(\sigma=2.00\) | 1.78 | 1.03 | 1.04 | 6.81 |
Table 1 reveals some general patterns. For smooth low
frequency DGPs such as the sine and cosine, gam
, spm
, and ssanova
turn in similar performances depending on the signal/noise ratio.
However, loess
does not do well with these DGPs. For the absolute
value and Doppler DGPs gam
, spm
, ssanova
and loess
on balance
perform worse than crs
. From a minimax perspective, the regression
spline implementation in the crs package performs relatively well
overall. Naturally by modifying tuning parameters one could expect to
improve the performance of the smoothing spline methods (this would
apply to crs
as well as default settings are used throughout). The
point to be made is simply that the regression spline implementation in
crs appears to adapt fairly well to the underlying DGP and may
therefore be of interest to readers.
Next we consider mixed predictor multivariate DGPs with additive and multiplicative features. The additive DGP is of the form
<- runif(n)
x1 <- runif(n)
x2 <- rbinom(n, 1, .5)
z <- cos(2*pi*x1) + sin(2*pi*x2) + z
dgp <- (dgp - mean(dgp))/sd(dgp)
dgp <- dgp + rnorm(n, sd=sigma)
y <- factor(z) z
while the multiplicative DGPs are of the form
<- runif(n)
x1 <- runif(n)
x2 <- rbinom(n, 1, .5)
z <- cos(2*pi*x1) * sin(2*pi*x2) * z
dgp <- (dgp - mean(dgp))/sd(dgp)
dgp <- dgp + rnorm(n, sd=sigma)
y <- factor(z) z
and the radial function given by
<- runif(n, -5, 5)
x1 <- runif(n, -5, 5)
x2 <- rbinom(n, 1, .5)
z <- sin(sqrt(x1^2 + x2^2 + z))/
dgp sqrt(x1^2 + x2^2 + z)
<- (dgp - mean(dgp))/sd(dgp)
dgp <- dgp + rnorm(n, sd=sigma)
y <- factor(z) z
We again vary the signal/noise ratio by standardizing the DGP and setting \(\sigma\) for the Gaussian noise to \(1/4,1/2,1,2\) which produces expected in-sample fits of \(R^2=0.95,0.80,0.50,0.20\) for the oracle estimator (i.e. one that uses knowledge of the DGP). We estimate the following eight models,
<- crs(y ~ x1 + x2 + z)
model # gam (mgcv) additive
<- gam(y ~ s(x1) + s(x2) + z)
model # gam (mgcv) tensor to admit interactions
<- gam(y ~ t2(x1, x2, k=k) + z)
model # gam (mgcv) tensor with smooth for every z
<- gam(y ~ t2(x1, x2, k=k, by=z) + z)
model # spm (SemiPar) additive
<- spm(y ~ f(x1) + f(x2) + z)
model # spm (SemiPar) tensor product
<- spm(y ~ f(x1, x2) + z)
model # ssanova (gss) additive
<- ssanova(y ~ x1 + x2 + z)
model # ssanova (gss) tensor product
<- ssanova(y ~ x1 * x2 + z) model
We draw \(M=1000\) replications of size \(n=1000\) from each DGP, compute
the MSE for each model, and report the median value of each model’s MSE
relative to that for crs
over the \(M\) replications in Table
2 below.
Additive DGP (\(\cos(2\pi x_1)+\sin(2\pi x_2)+z\)) | |||||||
gam | spm | ssanova | |||||
\(\sigma\) | (add) | (int) | (by) | (add) | (te) | (add) | (te) |
\(0.25\) | 0.60 | 1.57 | 2.11 | 0.64 | 1.71 | 0.62 | 0.70 |
\(0.50\) | 0.57 | 0.94 | 1.56 | 0.57 | 1.24 | 0.57 | 0.65 |
\(1.00\) | 0.55 | 0.83 | 1.44 | 0.51 | 0.95 | 0.54 | 0.65 |
\(2.00\) | 0.52 | 0.75 | 1.35 | 0.49 | 0.72 | 0.51 | 0.60 |
Additive DGP (\(\cos(4\pi x_1)+\sin(4\pi x_2)+z\)) | |||||||
gam | spm | ssanova | |||||
\(\sigma\) | (add) | (int) | (by) | (add) | (te) | (add) | (te) |
\(0.25\) | 0.72 | 1.07 | 1.88 | 0.77 | 5.76 | 0.65 | 0.72 |
\(0.50\) | 0.54 | 0.94 | 1.70 | 0.63 | 2.50 | 0.58 | 0.65 |
\(1.00\) | 0.52 | 0.90 | 1.65 | 0.57 | 1.53 | 0.57 | 0.62 |
\(2.00\) | 0.53 | 0.90 | 1.64 | 0.56 | 1.17 | 0.55 | 0.63 |
Multiplicative DGP (\(\cos(2\pi x_1)\times \sin(2\pi x_2)\times z\)) | |||||||
gam | spm | ssanova | |||||
\(\sigma\) | (add) | (int) | (by) | (add) | (te) | (add) | (te) |
\(0.25\) | 228.82 | 111.54 | 0.78 | 229.05 | 111.41 | 229.17 | 112.97 |
\(0.50\) | 95.18 | 46.93 | 1.09 | 95.26 | 46.81 | 95.31 | 47.46 |
\(1.00\) | 30.23 | 15.50 | 1.16 | 30.25 | 15.55 | 30.25 | 15.60 |
\(2.00\) | 9.61 | 5.39 | 1.14 | 9.60 | 5.49 | 9.59 | 5.32 |
Multiplicative DGP (\(\cos(4\pi x_1)\times \sin(4\pi x_2)\times z\)) | |||||||
gam | spm | ssanova | |||||
\(\sigma\) | (add) | (int) | (by) | (add) | (te) | (add) | (te) |
\(0.25\) | 93.39 | 44.36 | 0.62 | 93.52 | 51.31 | 93.58 | 54.14 |
\(0.50\) | 30.05 | 14.66 | 0.64 | 30.09 | 16.85 | 30.11 | 17.71 |
\(1.00\) | 9.92 | 5.28 | 0.74 | 9.93 | 5.92 | 9.94 | 6.56 |
\(2.00\) | 3.40 | 2.26 | 0.86 | 3.39 | 2.50 | 3.39 | 3.37 |
Multiplicative DGP (radial) | |||||||
gam | spm | ssanova | |||||
\(\sigma\) | (add) | (int) | (by) | (add) | (te) | (add) | (te) |
\(0.25\) | 89.66 | 2.18 | 1.16 | 89.07 | 2.29 | 89.60 | 2.21 |
\(0.50\) | 31.35 | 1.27 | 1.30 | 31.18 | 1.29 | 31.35 | 1.21 |
\(1.00\) | 12.65 | 1.19 | 1.72 | 12.56 | 1.09 | 12.68 | 1.08 |
\(2.00\) | 4.49 | 1.10 | 1.82 | 4.44 | 0.88 | 4.51 | 0.99 |
For the mgcv routine gam
, a referee noted that ?choose.k
could be
consulted where it suggests using gam.check()
to help guide the
appropriate number of knots and suggested non-stochastic values of k=5
for the additive DGP with \(2\pi\) and k=10
for the remaining DGPs
(these values were used in Table 2 for gam
via the
argument k=k
rather than the default values, while crs
uses
stochastic values for the basis and all smoothing parameters).
When the DGP is additive, it is clear that the smoothing splines that
presume additivity are more efficient than crs
for this DGP. Note that
this is an oracle type of property in the sense that the user does not
in general have knowledge of the true DGP, but if they did and used this
information their model would naturally be more efficient than a method
that did not (we elaborate further on this below). The regression spline
implementation in the crs package does not presume this knowledge when
basis="auto"
is used (the default) though it allows for the
possibility. However, were one to use the incorrect smoothing spline
model when the DGP is additive (e.g. assume interaction when it is not
present), one could do worse than crs
as can be seen in Table
2. But note that when the DGP is non-additive, crs
appears
to be more efficient than the smoothing spline approaches even when they
employ, say, tensor basis functions to model the non-additive
relationship and are adapted to admit the presence of the binary factor
predictor (except gam
which, as noted above, uses a number of knots
k
that is optimized for these DGPs, rather than the defaults which
were used for all other methods). From a minimax perspective, the
regression spline implementation in the crs package performs
relatively well overall. Apart from selecting the number of knots for
gam
as described above, default settings were used throughout and
therefore it is likely that efficiency could be improved in some cases
by further tuning (this would apply to crs
as well, naturally).
To elaborate further on the issue of tuning and efficiency, if instead
of using defaults for crs
we were to assume additivity as done for its
peers in columns 2, 5, and 7 in Table 2 (options
basis="additive"
, kernel=FALSE
), then for the additive DGP in Table
2 the first row would instead be 0.94, 2.43, 3.29, 1.00,
2.69, 0.98, and 1.08, and we observe that the efficiency of crs
improves substantially even when using a stochastic choice of the
B-spline degree and number of knots, while the other methods use
non-stochastic choices for the number of knots that are constant over
all \(M=1,000\) Monte Carlo replications. And if we followed the lead of
the anonymous referee who kindly suggested using non-stochastic values
for the number of knots (i.e. k=5
) for gam
in mgcv based on
replications from the additive DGP used for row 1 of Table
2, we might choose non-stochastic values degree=c(4,4)
and
segments=c(3,3)
based on cross-validation and the first row of Table
2 would instead be 1.18, 3.09, 4.19, 1.25, 3.42, 1.22, and
1.36. Furthermore, if in addition we used the option prune=TRUE
enabling post-estimation pruning of the model, the first row of Table
2 would instead be 1.48, 3.91, 5.35, 1.59, 4.28, 1.53, and
1.72 (note that this can only be used in conjunction with the option
kernel=FALSE
). The point to be made is that indeed efficiency can be
improved in some cases by tuning of smoothing parameters and choice of
the basis, and this also holds for crs
itself.
There exist a range of demonstrations available via demo()
including
radial_persp
: R code for fitting and plotting a 3D perspective
plot for the ‘radial’ function.
radial_rgl
: R code for fitting and generating a 3D real-time
rendering plot for the ‘radial’ function using OpenGL (requires the
rgl package (Adler and D. Murdoch 2012)).
sine_rgl
: R code for fitting and generating a 3D real-time
rendering plot for a product sine function using OpenGL.
radial_constrained_mean
: R code for constrained radial function
estimation.
cqrs.R
: R code for fitting and plotting quantile regression
splines.
There exist a number of datasets including
cps71
: Canadian cross-section wage data consisting of a random
sample taken from the 1971 Canadian Census Public Use Tapes for male
individuals having common education (grade 13). There are 205
observations in total (contributed by Aman Ullah).
Engel95
: British cross-section data consisting of a random sample
taken from the British Family Expenditure Survey for 1995. The
households consist of married couples with an employed
head-of-household between the ages of 25 and 55 years. There are
1655 household-level observations in total (contributed by Richard
Blundell).
wage1
: Cross-section wage data consisting of a random sample taken
from the U.S. Current Population Survey for the year 1976. There are
526 observations in total (contributed by Jeffrey Wooldridge).
We have tried to overload the crs
method and associated S3 plot
method to accommodate the needs of a range of users and to automate a
number of routine tasks. Warning messages are produced where possible to
guide the user towards the most appropriate choice of settings. User
specified weights can be provided to allow for weighted least squares
and weighted cross-validation. In addition, we have a function crsiv
that implements instrumental variable regression splines that may be of
interest to some users (see ?crsiv
for details). Furthermore, quantile
regression splines are supported by setting tau=
a scalar in the range
\((0,1)\). Finally, we would like to direct the reader to the vignettes
crs
, crs_faq
, and spline_primer
for further examples, frequently
asked questions, and background information.
Racine would like to gratefully acknowledge support from the Natural
Sciences and Engineering Research Council of Canada
(http://www.nserc.ca), the Social Sciences and Humanities Research
Council of Canada (http://www.sshrc.ca), and the Shared Hierarchical
Academic Research Computing Network (http://www.sharcnet.ca). We would
also like to gratefully acknowledge Sébastien Le Digabel and members of
the NOMAD
project for their contributions to the FOSS community.
Finally, we are indebted to the editors and two anonymous referees for
their invaluable comments and suggestions.
crs, SemiPar, mgcv, gss, gam, MASS, rgl
Bayesian, Distributions, Econometrics, Environmetrics, MixedModels, NumericalMathematics, Optimization, Psychometrics, Robust, Spatial, SpatioTemporal, Survival, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nie, "The crs Package: Nonparametric Regression Splines for Continuous and Categorical Predictors", The R Journal, 2012
BibTeX citation
@article{RJ-2012-012, author = {Nie, Jeffrey S. Racine, Zhenghua}, title = {The crs Package: Nonparametric Regression Splines for Continuous and Categorical Predictors}, journal = {The R Journal}, year = {2012}, note = {https://rjournal.github.io/}, volume = {4}, issue = {2}, issn = {2073-4859}, pages = {48-56} }