A package is introduced that provides the weighted smooth backfitting estimator for a large family of popular semiparametric regression models. This family is known as generalized structured models, comprising, for example, generalized varying coefficient model, generalized additive models, mixtures, potentially including parametric parts. The kernel-based weighted smooth backfitting belongs to the statistically most efficient procedures for this model class. Its asymptotic properties are well-understood thanks to the large body of literature about this estimator. The introduced weights allow for the inclusion of sampling weights, trimming, and efficient estimation under heteroscedasticity. Further options facilitate easy handling of aggregated data, prediction, and the presentation of estimation results. Cross-validation methods are provided which can be used for model and bandwidth selection.
The classes of generalized structured models (GSM) of (Mammen and Nielsen 2003),
structured additive regression models (Brezger et al. 2005), and semiparametric
separable models (Rodrı́guez-Poó et al. 2003) are all devoted to harmonizing the
fundamental aspects of flexibility, dimensionality and interpretability
(c.f. also Stone 1986) for multidimensional regression. In some cases, the
particular structure is derived from pure theory, sometimes from
empirical knowledge, or it is chosen data-adaptively. The epithet
‘structured’ underlines the explicit modeling of the structure of a
regression in order to distinguish it from fully automatic black-box
regression or prediction. (Mammen and Nielsen 2003) define for response
Since (Hastie and Tibshirani 1990) introduced their backfitting algorithm, additive models have become quite popular in statistics, particularly in biometrics, technometrics, and environmetrics. (Opsomer and Ruppert 1997) and (Opsomer 2000) derived asymptotic theory for that classical backfitting estimator with kernel smoothing. (Mammen et al. 1999) developed asymptotic theory for a modified version, the smooth backfitting (SB) estimator, under weaker assumptions on the data like the allowance for strong correlation of the covariates. (Mammen and Nielsen 2003) extended this method to the general GSM class ((1)), and (Roca–Pardiñas and Sperlich 2010) proposed a common algorithm for it. Many extensions have been developed, procedures for bandwidth selection (e.g., Mammen and Park 2005), quantile regression (Lee et al. 2010), and further asymptotic theory for particular cases (see e.g., Yu et al. 2008 for GAM). Most recent contributions extend SB to additive inverse regression (Bissantz et al. 2016), proportional hazards (Hiabu et al. 2020), or regression with error-in-variables (Han and Park 2018). All SB procedures and their theory are kernel-based.
The main advantage of SB is, apart from its excellent numerical performance proven by Nielsen and Sperlich (2005) and (Roca–Pardiñas and Sperlich 2010), compared to the classical backfitting, that there exists a comprehensive literature that studies its statistical behavior and underlying assumptions. It provides the exact and complete asymptotic theory of SB, such that today this estimator is well understood. The only drawback has been that so far, there hardly existed an easily available software for this estimator, except the R-package sBF of (Arcagni and Bagnato 2014) for the basic additive model. But due to its complexity, practitioners typically abstain from implementing it themselves. Therefore, the wsbackfit R-package has been developed which provides the weighted SB for all models listed in the next section, including a data-driven bandwidth selector. The package is freely available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=wsbackfit (R Core Team 2016). Thus, this package closes the gap between the huge body of existing and still increasing literature about SB on the one side, and its potential use on the other, providing the necessary software. We hope it is soon extended by procedures for the various, partly above cited, extensions.
It is to be mentioned that there certainly exist R packages for alternative methods to estimate related models. Before briefly discussing some of the most advanced packages, let us mention the reviews of (Binder and Tutz 2008), which reviewed spline-based methods, and (Fahrmeier et al. 2004) which reviewed (spline-based) Bayesian methods.
Maybe the broadest set of models can be handled by the package BayesX (Umlauf et al. 2019). It embraces several well-known regression models such as GAM, generalized additive mixed models (GAMM), generalized geo-additive mixed models (GGAMM), dynamic models, varying coefficient models (VCM), and geographically weighted regression. Besides exponential family regression, BayesX also supports non-standard regression situations such as regression for categorical responses, hazard regression for continuous survival times, and continuous-time multi-state models; see also its support platform http://www.uni-goettingen.de/de/bayesx. It has been created by Brezger et al. (2005) and Kneib et al. (2008).
The R package gam (Hastie 2019)
presents considerable enhancements of the S-PLUS version going back to
Hastie and Tibshirani (1990). It uses classical backfitting to combine different smoothing or
fitting methods, particularly local regression and smoothing splines.
Another powerful package is
mgcv (Wood 2017), which allows
the fitting of generalized additive (mixed) models, with smoothing
parameter estimation done by (restricted) marginal likelihood or
generalized cross-validation, and uses iterated nested Laplace
approximation for fully Bayesian inference. Another powerful and
well-functioning package is
GAMLSS of Stasinopoulos and Rigby (2007). It is
based on penalized likelihood estimation combined with classical
backfitting. While mgcv models the index function, GAMLSS models the
location, scale, and shape functions by additive linear mixed models. It
has been created to tackle many interesting distributions of
Regarding kernel-based methods that consider related or specific cases of ((1)), there are, for example, marginal integration (Linton and Nielsen 1995) for additive interaction models (Sperlich et al. 2002), and local polynomials for smooth varying coefficients (Li and Racine 2010). The latter is implemented in the np package (Hayfield and Racine 2008), and turned out to be very competitive when compared to the before-mentioned spline-based packages (Sperlich and Theler 2015).
The aim is to estimate a GSM as introduced in ((1)). In the
moment of estimation, one has to be specific about
As said, the SB idea for GSM, together with some general results about
its asymptotic behavior, was introduced by (Mammen and Nielsen 2003); specific
algorithms and their implementation were introduced and studied in
Roca–Pardiñas and Sperlich (2010). Detailed information about the implementation is given in a
technical report (Roca-Pardiñas and Sperlich 2008). The implemented algorithms in wsbackfit
are modified versions to speed up the procedure by binning techniques
and a combination of parametric linear with local constant kernel
regression; see below. The models considered in this package are
semiparametric in the sense that they contain parametric as well as
nonparametric components. Most of them could be seen as extensions of a
generalized linear model (GLM) of type
Regarding the choice of
Roca–Pardiñas and Sperlich (2010) showed that the estimation procedure for all these models can be
summarized in one common feasible minimization problem, namely
We call our procedure ‘weighted smooth backfitting’ to emphasize that
the user has the option to include a vector of additional weights. As
said, by putting the usual kernel weights apart, part of the weighting
comes from local scoring in order to account for the link function
The models that package wsbackfit can presently estimate are: a
partial linear GAM, a generalized partial linear varying coefficient
model (GVCM), and combinations of them. The first one is a
generalization of a GLM by replacing some linear components with
additive nonparametric functions,
Finally, one can include all together, i.e., nonparametric additive
terms, nonparametric varying coefficients, and a parametric (linear)
part like
Cross-validation (CV) can be used for model selection in general. However, for the sake of presentation we describe here our implementation in the context of bandwidth selection.
All nonparametric estimates of the
Given sample
Gaussian | |
Binary | |
Poisson |
Generally spoken, unless bandwidths are fixed by the user, they can be
selected as
In order to simplify the problem, we provide the following three options to the user:
the user just prefixes all bandwidths that shall be used as final bandwidths;
the user prefixes starting values for the bandwidths, say
the user only prefixes a bandwidth grid for a scalar
The choice of prior
For the sake of presentation, we explain more details about the CV
implementation only along option 3; as for option 2, it works
analogously. Moreover, suppose the user chooses all bandwidths by option
3. As said, in option 3,
Unfortunately, a naive implementation of the leave-one-out CV technique
would still imply a high computational cost as for each potential value
of
We conclude with two remarks. First, as said, the wsbackfit package
also allows the user to specify the bandwidths seq(0.01, 0.99, length = 30)
but can optionally be set by
the user via option bw.grid
, see below. For option 2, the algorithm
looks for the optimal
As explained above, smooth backfitting is solved by an iterative
procedure to solve ((3)). When the link function is the
identity, then there is only the loop running over the different
additive components. If the link is not the identity, then there is also
an outer loop carrying out the local scoring iteration. Then, within
each of such outer iteration steps, the formerly mentioned loop of the
smooth backfitting is conducted. Both loops are triggered by two
factors, the tolerance
The inner loop stops when for iteration
Although we implemented some modifications and simplifications like the
above described k-fold CV, or the combination of parametric linear with
local constant estimation, for details see the Section
3.4, SB in high dimensions might still imply a high
computational cost. Therefore, as already indicated above, we
implemented the kernel smoothing inside the wsbackfit package using
binning-type procedures. These are used throughout, also for CV when
selecting bandwidths. The key idea of this binning is to reduce the
number of kernel evaluations (exploiting the fact that many of these are
nearly identical) by replacing the original data set (composed of
Note that for minimizing ((3)), we need to solve some univariate integrals over nonparametric estimates which is therefore done numerically. This is calculated by the Simpson rule with 51 equidistant grid points over the entire range of the respective covariate, i.e., from its smallest to the largest observation. Simulations showed that finer grids led to no improvement of the final estimates.
When we introduced the class of GSM in Section 2, we
restricted our presentation to models that can be estimated by the here
introduced package. At this stage – note that the package design
invites further contributions of SB-based methods – the package is able
to estimate model ((7)) only for some link functions, and all
Now recall models ((5)) and ((6)), which probably
represent the most common cases in practice. For these models, consider
the identification of the
The package is composed of several functions that enable users to fit the models with the methods described above. Table 1 provides a summary of the presently available functions.
wsbackfit functions | |
---|---|
sback |
Main function for fitting generalized structures models using smooth backfitting. |
sb |
Function used to indicate the nonparametric terms and varying coefficient terms in the sback() formula. |
plot |
Function that provides plots of sback objects produced by sback() . |
print |
The default print method for an sback() object. |
summary |
Function that takes a fitted object produced by sback() and produces various useful summaries from it. |
predict |
Function that provides predictions (e.g., fitted values and nonparametric terms) based on an sback object for a new data set (newdata ). |
residuals |
Returns residuals of sback objects produced by sback() . Deviance, Pearson, working and response residuals are available. |
summary |
Summary for objects of class sback . |
The package has been designed similarly to other regression packages.
The main function is sback
. It fits GSM with SB, and creates an object
of class sback
. Numerical and graphical summaries of the fitted model
can be obtained by using print
, summary
, and plot
, implemented for
sback
objects. Moreover, function predict
allows obtaining
predictions (e.g., fitted values and nonparametric terms) for data
different from those used for estimation, called, therefore, newdata
.
The main arguments of the sback
function are listed in Table
2, and the list of outputs is given in Table 3.
sback arguments – input values |
|
---|---|
formula |
A formula object specifying the model to be fitted. |
data |
Data frame representing the data and containing all needed variables. |
offset |
An optional numerical vector containing a priori known components to be included in the linear predictor during fitting. Default is zero. |
weights |
An optional numeric vector of ‘prior weights’ to be used in the fitting process. By default, the weights are set to one. |
kernel |
A character specifying the kernel function. Implemented are: Gaussian and Epanechnikov. By default "Gaussian" . |
bw.grid |
Numeric vector; a grid for searching the bandwidth factor seq(0.01,0.99,length = 30) |
c.bw.factor |
logical; indicates whether the common factor scheme for bandwidth selection proposed by (Han and Park 2018) is performed. If TRUE , and provided the user has specified the (marginal) bandwidths for all nonparametric functions, say FALSE . |
KfoldCV |
Number of cross-validation folds to be used for either (1) automatically selecting the optimal bandwidth (in the sequence given in argument bw.grid ) for each nonparametric function; or (2) automatically selecting the optimal common bandwidth factor (see argument c.bw.factor ). Default is 5. |
kbin |
An integer value specifying the number of binning knots. Default is 30. |
family |
A character specifying the distribution family. Implemented are: Gaussian, Binomial and Poisson. In all cases, the link function is the canonical one. By default "gaussian" . |
sback output values |
|
---|---|
call |
The matched call. |
formula |
The original supplied formula argument. |
data |
The original supplied data argument. |
weights |
The original supplied weights argument. |
offset |
The original supplied offset argument. |
kernel |
The original supplied kernel argument. |
kbin |
The original supplied kbin argument. |
family |
The original supplied family argument. |
effects |
Matrix with the estimated nonparametric functions (only the nonlinear component) for each covariate value in the original supplied data. |
fitted.values |
A numeric vector with the fitted values for the supplied data. |
residuals |
A numeric vector with the deviance residuals for the supplied data. |
h |
A numeric vector of the same length as the number of nonparametric functions, with the bandwidths used to fit the model. |
coeff |
A numeric vector with the estimated regression coefficients. This vector contains the estimates of the regression coefficients associated with the parametric part of the model (if present) as well as the linear components of the nonparametric functions. |
err.CV |
Matrix with the cross-validated error (deviance) associated with the sequence of tested bandwidths (those provided in argument bw.grid in function sback ). |
We call function sback
by
sback(formula, data, offset = NULL, weights = NULL,
kernel = c("Gaussian", "Epanechnikov"),
bw.grid = seq(0.01, 0.99, length = 30), c.bw.factor = FALSE,
KfoldCV = 5, kbin = 30,
family = c("gaussian", "binomial", "poisson"))
The argument formula
corresponds to the model for the conditional mean
function ((7)). This formula is similar to that used for
glm
, except that nonparametric functions can be added to the additive
predictor by means of function sb
(for details, see Table
4). For instance, specification y ~ x1 + sb(x2, h = -1)
assumes a parametric effect of x1
(with x1
either numerical or
categorical), and a nonparametric effect of x2
. Varying coefficient
terms get incorporated similarly. For example, y ~ sb(x1, by = x2)
indicates that the coefficient(s) of x2
depend nonparametrically on
x1
. In this case, both, x1
and x2
should be numerical predictors.
More examples are provided further below.
sb arguments |
|
---|---|
x1 |
The univariate predictor. |
by |
Numeric predictor of the same dimension as x1 . If present, the coefficients of this predictor depend nonparametrically on x1 , i.e., a varying coefficient term. |
h |
Bandwidth for this term. If h = -1 , the bandwidth is automatically selected using k-fold cross-validation. A value of 0 would indicate a linear fit. By default -1 . |
The bandwidths associated with the nonparametric functions are specified
inside sb
through argument h
(see Table 4), either as
final bandwidth (by setting h =
h =
c.bw.factor = TRUE
; option 2), or by demanding
a CV-bandwidth which is the product of a common factor bw.grid
) times the scale h = -1
; option 3); recall Section 3. Actually,
the user has even four options. These are specified through argument h
of function sb
, where option 4 is a mixture of options 1 and 3 by
setting h =
sb
for some nonparametric functions, and
h = -1
for the others. The number KfoldCV
, with
In Section 2, recall expression ((3))
with subsequent discussion, we saw that the SB algorithm is implemented
with potentially adjusted dependent variable offset
and weights
allow the user to modify them on
top of what is done automatically (e.g., due to the local scoring). More
specifically, the vector offset
is subtracted from weights
provided by the user. For an example in which this option is
used to improve the efficiency of the estimators, see Section
6.1.
The desired smoothing kernel, either Epanechnikov or Gaussian, is
specified through the argument kernel
(by default Gaussian). With
kbin
, the user indicates the number of binning knots (denoted as family
specifies the
conditional distribution of the response variable. So far, the user can
select among Gaussian, Poisson, and binary. In all cases, the canonical
link function is considered. Finally, recall that predictions for data
different from those being used for estimation can be obtained by means
of function predict
, specifying the new dataset in argument newdata
.
This section reports the results of a small simulation. Two different scenarios are considered, namely
Additive model. Covariates
where the scaling factor in the Bernoulli case is used to control the signal-to-noise ratio.
Varying coefficient model. Here, covariates
Results for Scenarios I and II are shown in Figure 1 and
Figure 2, respectively. In both cases, the true and
estimated functions are centered before plotting to make results
comparable. All results are based on a sample size of
|
|
|
|
|
|
|
|
The last section is dedicated to examples with generalized additive and/or varying coefficient, partial linear models with and without heteroscedasticity, given different link functions. In fact, we have examples for each of the three link functions presently available. Among other things, it is shown how the optional weighting can be used to improve efficiency. While some examples are simulated, others illustrate applications from biometrics and health. Finally, we also show how to create useful graphics for interpreting the estimates.
We start with the presentation of a simulation example for estimating an
additive model under heteroscedasticity. Consider the situation where
the variance is a function of a dummy variable, i.e., one faces two
noise levels. This is estimated and afterward used to improve the
efficiency of the mean regression. As explained in the above sections,
this requires a three-step procedure: first estimate the mean model,
then the variance function, and finally re-estimate the mean model but
including the inverse of the variance as an additional weight. Consider
model
R> library(wsbackfit)
R> set.seed(123)
R> # Define the data generating process
R> n <- 1000
R> x1 <- runif(n)*4-2
R> x2 <- runif(n)*4-2
R> x3 <- runif(n)*4-2
R> x4 <- runif(n)*4-2
R> x5 <- as.numeric(runif(n)>0.6)
R> g1 <- 2*sin(2*x1)
R> g2 <- x2^2
R> g3 <- 0
R> g4 <- x4
R> mu <- g1 + g2 + g3 + g4 + 1.5*x5
R> err <- (0.5 + 0.5*x5)*rnorm(n)
R> y <- mu + err
R> df_gauss <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, x5 = as.factor(x5), y = y)
R> # Fit the model with a fixed bandwidth for each covariate
R> m0 <- sback(formula = y ~ x5 + sb(x1, h = 0.1) + sb(x2, h = 0.1) +
+ sb(x3, h = 0.1) + sb(x4, h = 0.1), kbin = 30, data = df_gauss)
A numerical summary of the fitted model can be obtained by calling
print.sback()
or summary.sback()
with shortcuts print()
and
summary()
.
R> summary(m0)
Generalized Smooth Backfitting/wsbackfit:
Call: sback(formula = y ~ x5 + sb(x1, h = 0.1) + sb(x2, h = 0.1) +
sb(x3, h = 0.1) + sb(x4, h = 0.1), data = df_gauss, kbin = 30)
Sample size: 1000
Bandwidths used in model:
Effect h
sb(x1, h = 0.1) 0.1
sb(x2, h = 0.1) 0.1
sb(x3, h = 0.1) 0.1
sb(x4, h = 0.1) 0.1
Linear/Parametric components:
Intercept x1 x2 x3 x4 x51
1.342678178 0.346506090 -0.040989607 -0.005250654 1.010634908 1.327833794
The output obtained from summary
(corresponding to the prior sback
call) includes the bandwidth of each nonparametric function, the
parameters of the parametric part (here the intercept and
To complement the numerical results, the wsbackfit package also
provides graphical outputs by the use of plot
. In particular, it
provides the plots of the estimated nonparametric functions. Figure
3 shows the figures that appear as a result of the
following code. We note that through argument select
, the user can
specify the model term to be plotted and use ylim
to indicate the
range for the y-axis. This, however, is optional. Alternatively, the
program provides plots that automatically explore the variation of the
estimates.
R> op <- par(mfrow = c(2,2))
R> plot(m0, select = 1, ylim = c(-2.5,2.5))
R> plot(m0, select = 2, ylim = c(-2.5,2.5))
R> plot(m0, select = 3, ylim = c(-2.5,2.5))
R> plot(m0, select = 4, ylim = c(-2.5,2.5))
R> par(op)
If the user is interested in plotting separately each component
(composed
is to be set
to FALSE
. The result is shown in Figure 4.
R> op <- par(mfrow = c(2,2))
R> plot(m0, select = 1, composed = FALSE, ylim = c(-2.5,2.5))
R> plot(m0, select = 2, composed = FALSE, ylim = c(-2.5,2.5))
R> plot(m0, select = 3, composed = FALSE, ylim = c(-2.5,2.5))
R> plot(m0, select = 4, composed = FALSE, ylim = c(-2.5,2.5))
R> par(op)
Note that both summary
and plot
make use of the information
contained in the m0
object.
R> names(m0)
[1] "call" "formula" "data" "weights"
[5] "offset" "kernel" "kbin" "family"
[9] "effects" "fitted.values" "residuals" "h"
[13] "coeff" "err.CV"
This is the list of outputs created by sback
. A detailed description
of what each component of this list contains was given in Table
3. The user can access this information explicitly and
individually, may it be to create its own plots or for further
reporting.
As a next step in the analyses of our example, we use the fitted model
to estimate the variance. In our example, this is considered to be a
function of the binary covariate
R> resid <- y - m0$fitted.values
R> sig0 <- var(resid[x5 == 0])
R> sig1 <- var(resid[x5 == 1])
The third and final step is to re-estimate the mean model for efficiency reasons with weights that are the inverse of the estimated variance. The code, including the summary, is
R> w <- x5/sig1 + (1-x5)/sig0
R> m1 <- sback(formula = y ~ x5 + sb(x1, h = 0.1) + sb(x2, h = 0.1) +
+ sb(x3, h = 0.1) + sb(x4, h = 0.1),
+ weights = w, kbin = 30, data = df_gauss)
R> summary(m1)
Generalized Smooth Backfitting/wsbackfit:
Call: sback(formula = y ~ x5 + sb(x1, h = 0.1) + sb(x2, h = 0.1) +
sb(x3, h = 0.1) + sb(x4, h = 0.1), data = df_gauss, weights = w,
kbin = 30)
Sample size: 1000
Bandwidths used in model:
Effect h
sb(x1, h = 0.1) 0.1
sb(x2, h = 0.1) 0.1
sb(x3, h = 0.1) 0.1
sb(x4, h = 0.1) 0.1
Linear/Parametric components:
Intercept x1 x2 x3 x4 x51
1.31707760 0.32888538 -0.01262394 0.01222234 1.00289877 1.33368035
In the previous fits of this example, we specified all bandwidths used.
For the rest of this example, let us consider the case when we ask the
program to choose the bandwidths via k-fold CV. We do this for all
nonparametric functions, using the following code in which, for the sake
of clarity and presentation, we specify h = -1
although this is
actually the default. For convenience, we also call the summary
command directly:
R> m1cv <- sback(formula = y ~ x5 + sb(x1, h = -1) + sb(x2, h = -1) +
+ sb(x3, h = -1) + sb(x4, h = -1), weights = w, kbin = 30,
+ bw.grid = seq(0.01, 0.99, length = 30), KfoldCV = 5, data = df_gauss)
R> summary(m1cv)
Generalized Smooth Backfitting/wsbackfit:
Call: sback(formula = y ~ x5 + sb(x1, h = -1) + sb(x2, h = -1) + sb(x3,
h = -1) + sb(x4, h = -1), data = df_gauss, weights = w, bw.grid = seq(0.01,
0.99, length = 30), KfoldCV = 5, kbin = 30)
Sample size: 1000
Bandwidths used in model:
Effect h
sb(x1, h = 0.0892) 0.0892
sb(x2, h = 0.0887) 0.0887
sb(x3, h = 0.0907) 0.0907
sb(x4, h = 0.0912) 0.0912
Linear/Parametric components:
Intercept x1 x2 x3 x4 x51
1.31708207 0.32881191 -0.01219359 0.01247328 1.00258427 1.33366258
We do not further discuss the results because their interpretation is the same as before, also because the automatically found data-driven optimal bandwidths are close to what we used as prefixed bandwidths in the former codes. We conclude this section with a brief example in which we specify the bandwidths for some of the nonparametric functions, while for the remaining ones, we let our CV procedure select the bandwidths. For brevity, we skip output and discussion.
R> m2cv <- sback(formula = y ~ x5 + sb(x1, h = 0.1) + sb(x2, h = -1) +
+ sb(x3, h = 0.1) + sb(x4, h = 0.1),
+ weights = w, kbin = 30, KfoldCV = 5, data = df_gauss)
The next example is an application with data studied, among others, in
Roca–Pardiñas and Sperlich (2010). They were taken from a prospective analysis conducted at the
University Hospital of Santiago de Compostela in the North of Spain. A
total of inf=1
) or not post-operative infection (inf=0
), and to see how they
relate to the risk of infection. Such predictive indicators could be
various, but given the previous studies we concentrate on the
pre-operative values of plasma glucose (gluc
) concentration (measured
in mg/dl), and lymphocytes (linf
, expressed as relative counts (in %
of the white blood cell count). The data can be found in wsbackfit
under the name infect
.
R> data(infect)
R> head(infect)
age sex linf gluc diab inf
1 85 2 28 55 2 0
2 38 1 18 56 2 1
3 49 2 29 56 2 1
4 63 2 20 60 2 0
5 91 2 17 62 2 0
6 26 2 22 66 2 0
In the original studies, it was controlled for other covariates like
age
(in years) and sex
(coded as 1 = male; 0 = female). For
illustrative purposes, we limit our analysis to the investigation of the
association of the risk of post-operative infections inf
with the
predictors linf
and gluc
, putting all other covariates aside. It is
well known that the effect of linf
on inf
varies strongly with the
concentration of gluc
. Therefore, one may think of a generalized
varying coefficient model of type
R> data(infect)
R> # Generalized varying coefficient model with binary response
R> m2 <- sback(formula = inf ~ sb(gluc, h = 10) + sb(gluc, by = linf, h = 10),
+ data = infect, kbin = 15, family = "binomial")
R> summary(m2)
Generalized Smooth Backfitting/wsbackfit:
Call: sback(formula = inf ~ sb(gluc, h = 10) + sb(gluc, by = linf,
h = 10), data = infect, kbin = 15, family = "binomial")
Sample size: 2312
Bandwidths used in model:
Effect h
sb(gluc, h = 10) 10
sb(gluc, by = linf, h = 10) 10
Linear/Parametric components:
Intercept gluc linf gluc:linf
-1.4155401353 0.0068313875 -0.0346648080 -0.0000456441
Note that this model, recall (14), contains, in addition to
the constant term (intercept), the main linear effects of gluc
linf
gluc
and linf
R> op <- par(mfrow = c(1,3))
R> plot(m2, composed = FALSE, ask = FALSE, cex.main = 2, cex = 2, cex.lab = 1.5,
+ cex.axis = 2)
R> par(op)
R> op <- par(mfrow = c(1,3))
R> plot(m2, composed = FALSE, ask = FALSE, cex.main = 2, cex = 2, cex.lab = 1.5,
+ cex.axis = 2)
R> par(op)
In Figure 5, you see the functionals of the nonparametric
additive effect of gluc
on the index
R> # Dataframe for prediction (and plotting)
R> ngrid <- 30
R> gluc0 <- seq(50,190, length.out=ngrid)
R> linf0 <- seq(0,45, length.out=ngrid)
R> infect_pred <- expand.grid(gluc = gluc0, linf = linf0)
R> m2p <- predict(m2, newdata = infect_pred)
R> n <- sqrt(nrow(infect_pred))
R> Z <- matrix(m2p$pfitted.values, n, n)
R > filled.contour(z = Z, x = gluc0, y = linf0,
+ xlab = "Glucose (mg/dl)", ylab = "Lymphocytes (%)",
+ col = cm.colors(21))
As can be seen from Figure 6, high levels of gluc
increase the post-operative infection risk, but higher linf
values can
mitigate this effect significantly.
Let us now consider a simulated example that illustrates the use of
Poisson regression with a nontrivial use of option offset
. We simulate
data where each subject may have different levels of exposure to the
event of interest. As explained in the above sections, this can be
handled with the offset
option. More specifically, for the level of
exposure
R> set.seed(123)
R> # Generate the data
R> n <- 1000
R> x1 <- runif(n,-1,1)
R> x2 <- runif(n,-1,1)
R> eta <- 2 + 3*x1^2 + 5*x2^3
R> exposure <- round(runif(n, 50, 500))
R> y <- rpois(n, exposure*exp(eta))
R> df_poiss <- data.frame(y = y, x1 = x1, x2 = x2)
R> # Fit the model
R> m4 <- sback(formula = y ~ sb(x1, h = 0.1) + sb(x2, h = 0.1),
+ data = df_poiss, offset = log(exposure),
+ kbin = 30, family = "poisson")
R> summary(m4)
Generalized Smooth Backfitting/wsbackfit:
Call: sback(formula = y ~ sb(x1, h = 0.1) + sb(x2, h = 0.1), data = df_poiss,
offset = log(exposure), kbin = 30, family = "poisson")
Sample size: 1000
Bandwidths used in model:
Effect h
sb(x1, h = 0.1) 0.1
sb(x2, h = 0.1) 0.1
Linear/Parametric components:
Intercept x1 x2
3.00099626 0.09698672 3.06092318
As for the previous examples, a graphical output can be obtained using
the plot
function like in the following code. The results are shown in
Figure 7, namely the additive components.
R> op <- par(mfrow = c(1,2))
R> plot(m4, ask = FALSE)
R> par(op)
We have first given a general introduction into the class of Generalized Structures Models, together with the powerful kernel-based smooth backfitting estimator to fit the members of this model class. This was accompanied by a (certainly incomplete) literature review and closed with a small review and discussion of existing methods and software for similar and related models. Except for the varying coefficient model estimator in the np package, they are all based on splines. We concluded that, while there is a huge body of literature on SB and its advantages, it is hardly used in practice due to the lack of software. The wsbackfit package intends to close this gap.
Next, we provided some insight into the weighted SB and the objective
function that is minimized by our algorithm. This allowed us to better
explain the users’ options like weights
and offset
. We outlined
which models can be estimated by the presently available package. The
description of the procedure was complemented by a section on the
implemented CV, bandwidth choice, binning, convergence, and
identification issues to clarify the location and scaling of the
resulting estimates.
The package description has been kept condense but its use has been illustrated along several examples that cover some of the estimable models. They comprise the use of all options provided. Moreover, the numerical examples give an idea of the estimators’ performance. For more details, we recommend consulting the cited articles dealing with the particular models.
We believe that this package is an important enrichment of the existing methods with many useful applications of flexible data analysis and prediction. It can almost straightforwardly be used for testing (Cadarso-Suárez et al. 2006; Mammen and Sperlich 2021) or studying the heterogeneity of causal effects (Benini and Sperlich 2021) and many other interesting applications. The next challenge will be the extension of this package to cover the analysis of more complex data (Jeon and Park 2020). The package is not just open for extensions. We explicitly invite people to contribute.
sBF, wsbackfit, BayesX, gam, mgcv, GAMLSS, GAMBoost, np
Bayesian, Econometrics, Environmetrics, MixedModels, Spatial
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Roca-Pardiñas, et al., "Package wsbackfit for Smooth Backfitting Estimation of Generalized Structured Models", The R Journal, 2021
BibTeX citation
@article{RJ-2021-042, author = {Roca-Pardiñas, Javier and Rodríguez-Álvarez, María Xosé and Sperlich, Stefan}, title = {Package wsbackfit for Smooth Backfitting Estimation of Generalized Structured Models}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {314-334} }