Package wsbackfit for Smooth Backfitting Estimation of Generalized Structured Models

A package is introduced that provides the weighted smooth backfitting estimator for a large family of popular semiparametric regression models. This family is known as generalized structured models, comprising, for example, generalized varying coefficient model, generalized additive models, mixtures, potentially including parametric parts. The kernel based weighted smooth backfitting belongs to the statistically most efficient procedures for this model class. Its asymptotic properties are well understood thanks to the large body of literature about this estimator. The introduced weights allow for the inclusion of sampling weights, trimming, and efficient estimation under heteroscedasticity. Further options facilitate an easy handling of aggregated data, prediction, and the presentation of estimation results. Cross-validation methods are provided which can be used for model and bandwidth selection.


Introduction and brief review
The classes of generalized structured models (GSM) of Mammen and Nielsen (2003), structured additive regression models (Brezger et al., 2005) and semiparametric separable models (Rodríguez-Poó et al., 2003) are all devoted to harmonize the fundamental aspects of flexibility, dimensionality and interpretability (c.f. also Stone, 1986) for multidimensional regression. In some cases, the particular structure is derived from pure theory, sometimes from empirical knowledge, or it is chosen data-adaptively. The epithet 'structured' underlines the explicit modeling of the structure of a regression in order to distinguish it from fully automatic black box regression or prediction. Mammen and Nielsen (2003) define for response Y with covariate vectors (Z, X, T, U) the GSM class by with Λ, G, S parametric known functions, β, δ unknown finite dimensional parameter, g(·), s(·) unknown nonparametric functions, and , ε fulfilling E[ |Z, X] = E[ε|Z, X] = 0. While Λ is a transformation with potentially unknown parts which can be estimated along Linton et al. (2008), G and S are link functions that also determine further structures. For instance, for a partial linear varying coefficient model (Park et al., 2015) with Z = (Z 1 , · · · , Z d , Zκ), X = (X 1 , ·, X d ), where Z 1 to Z d and X 1 to X d are scalars, and Zκ a vector of the length of β, function G defines with index η and a known link function G. You may also allow that some, or all of the X j , j = 1, . . . , d are identical; the same holds for the Z j , etc. Moreover, setting Z j ≡ 1 ∀j with all X j being different, you obtain the generalized additive model (GAM). A detailed discussion on identifiability is provided in Lee et al. (2012).
The main advantage of SB is, apart from its excellent numerical performance proven by Nielsen and Sperlich (2005) and Roca-Pardiñas and Sperlich (2010), in particular compared to the classical backfitting, that there exists a comprehensive literature that studies its statistical behavior and underlying assumptions. It provides the exact and complete asymptotic theory of SB, such that today this estimator is well understood. The only drawback has been that so far, there hardly existed an easily available software for this estimator, except the R-package sBF of Arcagni and Bagnato (2014) for the basic additive model. But due to its complexity, practitioners typically abstain from implementing it themselves. Therefore, the wsbackfit R-package has been developed which provides the weighted SB for all models listed in the next section, including a data driven bandwidth selector. The package is freely available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=wsbackfit (R Core Team, 2016). Thus, this package closes the gap between the huge body of existing and still increasing literature about SB on the one side, and its potential use on the other, providing the necessary software. We hope it is soon extended by procedures for the various, partly above cited, extensions.
It is to be mentioned that there certainly exist R packages for alternative methods to estimate related models. Before briefly discussing some of the most advanced packages, let us mention the reviews of Binder and Tutz (2008) which reviewed spline based methods, and Fahrmeier et al. (2004) which reviewed (spline-based) Bayesian methods.
Maybe the broadest set of models can be handled by the package BayesX (Umlauf et al., 2019). It embraces several well-known regression models such as GAM, generalized additive mixed models (GAMM), generalized geo-additive mixed models (GGAMM), dynamic models, varying coefficient models (VCM), and geographically weighted regression. Besides exponential family regression, BayesX also supports non-standard regression situations such as regression for categorical responses, hazard regression for continuous survival times, and continuous time multi-state models, see also its support platform http://www.uni-goettingen.de/de/bayesx. It has been created by Brezger et al. (2005) and Kneib et al. (2008).
The R package gam (Hastie, 2019) presents considerable enhancements of the S-PLUS version going back to Hastie and Tibshirani (1990). It uses classical backfitting to combine different smoothing or fitting methods, in particular local regression and smoothing splines. Another powerful package is mgcv (Wood, 2017), which allows fitting generalized additive (mixed) models, with smoothing parameter estimation done by (restricted) marginal likelihood or generalized cross validation, and uses iterated nested Laplace approximation for fully Bayesian inference. Another powerful and well functioning package is GAMLSS of Stasinopoulos and Rigby (2007). It is based on penalized likelihood estimation combined with classical backfitting. While mgcv models the index function, GAMLSS models the location, scale and shape functions by additive linear mixed models. It has been created to tackle many interesting distributions of Y . When speaking of likelihood based approaches, one should also mention a method introduced by Tutz and Binder (2006). Their R-package GAMBoost can be used to fit a GAM by likelihood based boosting, suited for a large number of predictors.
Regarding kernel based methods that consider related or specific cases of (1), there are for example marginal integration (Linton and Nielsen, 1995) for additive interaction models (Sperlich et al., 2002), and local polynomials for smooth varying coefficients (Li and Racine, 2010). The latter is implemented in the np package (Hayfield and Racine, 2008) and turned out to be very competitive when compared to the before-mentioned spline-based packages (Sperlich and Theler, 2015).

The models that can be estimated by wsbackfit
The aim is to estimate a GSM as introduced in (1). In the moment of estimation one has to be specific about Λ, G, and S. We concentrate on the popular cases, in particular on those that maintain additivity or a similar separability structure. This way the estimates provide easy an interpretation, and overcome the curse of dimensionality which else is inherited by more complex models. To the best of our knowledge, all existing smooth backfitting methods follow the suggestion of Mammen and Nielsen (2003) to estimate the mean and variance part subsequently, say, first G{· · · }, then S{· · · }. Our implementation follows the suggestion of Roca-Pardiñas and Sperlich (2010) to allow for a (re-)estimation of the mean part (G{· · · }) including weights obtained from the estimation of the variance part (S{· · · }) to potentially increase the efficiency. Note, however, that in nonparametrics it is often not clear to what extent an efficiency gain can be achieved this way, see for example the discussion in Xiao et al. (2003). All proposed methods we are aware of, are two-or three-step procedures similar to what we propose here. The estimation of the variance part is performed in the second step by regressing the squared residuals on (T, U), using the same procedure as for G{· · · }. Therefore it is sufficient to concentrate in the following on the mean regression which can equally well be applied to squared residuals for estimating S{· · · }.
As said, the SB idea for GSM, together with some general results about its asymptotic behavior, was introduced by Mammen and Nielsen (2003); specific algorithms and their implementation were introduced and studied in Roca-Pardiñas and Sperlich (2010). Detailed information about the implementation is given in a technical report (Roca-Pardiñas and Sperlich, 2008). The implementated algorithms in wsbackfit are modified version to speed up the procedure by binning techniques and a combination of parametric linear with local constant kernel regression, see below. The models considered in this package are semiparametric in the sense that they contain parametric as well as nonparametric components. Most of them could be seen as extensions of a generalized linear model (GLM) of type G(Z, β, g(X)) = G η(Z, β, g(X)) = G(g 0 + α t X + β t Z), see for example McCullagh and Nelder (1989). So far, this package does not tackle random effects.
Regarding the choice of G you have first to decide about the link G. For each conditional distribution there exists a canonical one: for the conditional Gaussian distribution this is the identity, for a binary response the Logit (1 + 1/exp(•)) −1 , and for a conditional Poisson it is exp(•). Note that the latter can also be used for Pseudo-Poisson estimation. The choice of G is certainly linked to the specification of Λ which is supposed to be known. Then such transformation of Y can be performed a priori by the practitioner. Therefore, we henceforth suppress Λ to simplify our notation. For Λ entailing unknown parameters consult Linton et al. (2008). Roca-Pardiñas and Sperlich (2010) showed that the estimation procedure for all these models can be summarized in one common feasible minimization problem, namely whereỸ i is the transformed (e.g. by Λ) or linearized (in local scoring if the link is not the identity) response Y i , and W i is a weight. For example, in the generalized additive model with β = 0 we have covariates Z j ≡ 1 for j = 1, . . . , d, W i contains the local scoring weights withỸ i being the accordingly adjusted dependent variable.
is the kernel function. It is well known that asymptotically, the choice of smoothing kernel does not have an important impact, as to a large part the kernel effect is compensated by an adequate bandwidth choice. We allow the user to choose between the Epanechnikov kernel which is asymptotically the most efficient one, and the Gaussian kernel which is popular as it helps to avoid some of the numerical problems that may arise in areas where data are sparse.
We call our procedure 'weighted smooth backfitting' to emphasize that the user has the option to include a vector of additional weights. As said, putting the usual kernel weights apart, part of the weighting comes from local scoring in order to account for the link function G. 2 But independently from the link function, the practitioner might also want to include sampling weights, e.g. when using administrative data, or trimming weights, e.g. for excluding boundary points. A particular case is when additional weights are included to improve the efficiency of your estimators, e.g. to account for the (co-)variance structure. Roca-Pardiñas and Sperlich (2010) estimated in a first step the mean function, afterwards the variance from the squared residuals, and used these in a third step as additional weights when re-estimating the conditional mean. The resulting average mean squared error was substantially smaller than the one of the original estimator which ignored the (co-)variance structure; recall also our discussion at the beginning of this section.
The models that package wsbackfit can presently estimate are: a partial linear GAM, a generalized partial linear varying coefficient model (GVCM), and combinations of them. The first one is a generalization of a GLM by replacing some linear components by additive nonparametric functions, where X 1 to X d are scalars, and Z is the vector of all covariates that are supposed to enter the index function η linearly. The g j are nonparametric functions. 3 It is actually true that this is a special case of the partial linear GVCM of the form (2), obtained by setting Z 1 = Z 2 = · · · = Z d = 1. We list them nonetheless separately, because besides the slightly different implementation, we want the reader to recognize the difference in the modeling approach. First, the GVCM is a generalization of a GLM in Z, as could be seen in (2). And second, here we allow for all the flexibility suggested by Lee et al. (2012) regarding the alterations of covariates X j and Z j . For example, all X 1 , . . . , X d1 with d 1 ≤ d could be the same scalar variable X 1 such that with g j0 unknown constants and g j1 unknown nonparametric functions; or, alternatively, all Z 1 , . . . , Z d1 with d 1 ≤ d could be the same scalar variable Z 1 such that (with g 10 still a constant) Certainly you can have a mixtures of both, as long as the identification conditions of Lee et al. (2012) are fulfilled to guarantee that the model does not suffer from concurvity. This includes the possibility that some variables appear in both sets, X and Z. This could be of particular interest when defining different types of interactions.
Finally, one can include all together, i.e. nonparametric additive terms, nonparametric varying coefficients, and a parametric (linear) part like with X as before, Z = (Z 1 , . . . , Z d1 , Zκ) a set of scalar variables Z 1 , . . . , Z d1 , and a vector Zκ. Again, some of the X j may represent the same variable; the same holds for the Z j , j = 1, ..., d.

Cross validation, bandwidths, and computational issues
Cross validation (CV) can be used for model selection in general. However, for the sake of presentation we describe here our implementation in the context of bandwidth selection.

Cross validation for bandwidth
All nonparametric estimates of the g j (X j ) in (6) depend on some bandwidths h 1 , . . . , h d . These can be preset by the user. Alternatively, our package provides for all models the option to choose the bandwidth automatically, that is, data-adaptivly via Cross Validation (CV). Our implementation even allows for a mixture of both, i.e. users can fix some bandwidths and choose the others by CV. Albeit we use binning techniques, performing CV can render the program pretty slow, especially for high dimensions and huge data sets. Generally, our implementation follows the ideas of Nielsen and Sperlich (2005) and Roca-Pardiñas and Sperlich (2010). It is to be mentioned that several alternatives exist, in particular for the additive model. For instance, Mammen and Park (2005) proposed bandwidths selectors based on penalized least squares and plug-in approaches.
. . , h d can be selected by minimizing some CV criterion in various ways. Allowing for limited dependent variables Y , the deviance is an appropriate measure of discrepancy between observed and fitted values. It is derived as a likelihood ratio test comparing the specified model with a so-called saturated one, when predicted values match the observed responses exactly. More specifically, denoting the fitted mean response given bŷ Generally spoken, unless bandwidths are fixed by the user, thy can be selected as withη whereĝ indicates the fit at X ij leaving out the ith data point based on the smoothing parameter h • j . One option to solve the minimization in (7) is to use a complete bandwidth selection that allows for all possible bandwidths combinations for the different covariates X j . When the number of covariates X j is large, the computational cost becomes very high or even prohibitive.
In order to simplify the problem, we provide the following three options to the user: 1. the user just prefixes all bandwidths that shall be used as final bandwidths; 2. the user prefixes starting values for the bandwidths, sayh j , and searches via CV for the optimal bandwidth vector (h 1 , . 3. the user only prefixes a bandwidth grid for a scalar hc such that (h 1 , . . . , h d ) = hc(σ 1 , . . . , σ d ) with σ j being the standard deviation of X j , and hc is chosen from the grid via CV.
The choice of priorh j typically follow some considerations of marginal distributions or marginal smoothing. For example, you could first perform a CV bandwidth choice for each nonparametric g j by setting all other g k =j to zero, or by restricting them to be linear functions. The second method follows ideas of Han and Park (2018), and the third follow a standard recommendation in the literature, see the review of Köhler et al. (2014). Combining options 1 and 3 is possible.
For the sake of presentation we explain more details about the CV implementation only along option 3, as for option 2 it works analogously. Moreover, suppose the user chooses all bandwidths by option 3. As said, in option 3, h j = hcσ j . While hc might be different for each j if h j is set by the user or chosen by option 2, in option 3 it is the same for all j. That is, we reduce the multidimensional search problem to a one-dimensional one. Specifically, if the user decides that all bandwidths are to be chosen by CV, hc := h j /σ j for all j, with where η (8)) leaving out the ith data point, and based on the smoothing parameters h • σ j , (j = 1, . . . , d).
Unfortunately, a naive implementation of the leave-one-out CV technique would still imply a high computational cost as for each potential value of h • , it is necessary to repeat the estimation as many times as we have data points. To speed up the process, the wsbackfit package uses k-fold CV instead. In brief, k-fold CV consists in randomly splitting the available sample into k complementary subsamples of (approximately) the same size such that each data point only belongs to one of the k subsamples, say κ(i). Then, the k-fold CV version of (9) is where η Xi,Zi indicates the fitted additive predictor at {X i , Z i } computed with the κ(i) subsample removed. In contrast to leave-one-out CV, in k-fold CV you repeat the estimations only k times, leaving-out one different subsample each time.
We conclude with two remarks. First, as said, the wsbackfit package also allows the user to specify the bandwidths h j for some nonparametric functions which are therefore treated as given in (10)), while letting the CV procedure select the others along option 3. A combination of option 2 with the others is not implemented. Examples can be found in Sections Package description and Examples and applications. Second, the minimum in (10) is determined by a grid search. The grid for the h • (option 3) is by default seq(0.01,0.99,length = 30) but can optionally be set by the user via option bw.grid, see below. For option 2 the algorithm looks for the optimal c h on an equispaced grid from 0.5 to 1.5.

Convergence Criteria
As explained above, smooth backfitting is solved by an iterative procedure to solve (3). When the link function is the identity, then there is only the loop running over the different additive components. If the link is not the identity, then there is also an outer loop carrying out the local scoring iteration. Then, within each of such outer iteration step, the formerly mentioned loop of the smooth backfitting is conducted. Both loops are triggered by two factors, the tolerance tol in deviations between subsequent iterations, and the maximum number maxit of iterations conducted. The defaults for the maximum number of iterations and the tolerance of deviation are maxit = 10 and tol = 0.01, respectively.
The inner loop stops when for iteration l the updatedĝ l j (·) comply with criterion where the g l−1 j refer to the estimates of the previous iteration. Similarly, the outer loop stops when for iteration k the convergence criterion are the estimates of µ i obtained in the present iteration and in the previous one (k − 1) respectively.

Binning and integration
Although we implemented some modifications and simplifications like the above described k-fold CV, or the combination of parametric linear with local constant estimation, for details see the Section Identification, SB in high dimensions might still imply a high computational cost. Therefore, as already indicated above, we implemented the kernel smoothing inside the wsbackfit package using binning-type procedures. These are used throughout, also for CV when selecting bandwidths. The key idea of this binning is to reduce the number of kernel evaluations (exploiting the fact that many of these are nearly identical) by replacing the original data set (composed of n data points) by a 'grouped' data set (with N groups as new data points with sampling weights, where N << n). The estimation is carried out on these N groups including the sampling weights in W i . For a detailed description of binning for kernel regression see Fan and Gijbels (1996).
Note that for minimizing (3) we need to solve some univariate integrals over nonparametric estimates which is therefore done numerically. This is calculated by the Simpson rule with 51 equidistant grid points over the entire range of the respective covariate, i.e. from its smallest to the largest observation. Simulations showed that finer grids led to no improvement of the final estimates.

Identification
When we introduced the class of GSM in Section The models that can be estimated by wsbackfit, we restricted our presentation to models that can be estimated by the here introduced package. At this stage -note that the package design invites further contributions of SB based methods -the package is able to estimate model (6) only for some link functions, and all g j being one-dimensional nonparametric functions. Lee et al. (2012) discuss identification of model (6) in a very general way, also allowing all g j to be multidimensional, and covariates X j being overlapping vectors (i.e. containing, at least partly, the same elements), and possibly also containing some elements of the Z covariate vector. Essentially, they clarify which overlaps would render a model unidentifiable. As such discussion is quite technical and would go beyond the slope of this paper, we only refer to them. Now recall models (4) and (5) which probably represent the most common cases in practice. In these models consider the identification of the g j with respect to location and scale. Each g j in model (6) has been decomposed into a linear effect α j · X j together with a purely nonparametric (beyond the linear) one, sayg j . In addition, for g j being varying coefficients, we have included constants g j0 , j = 1, . . . , d 1 . Then we can re-write model (6) as For this model we set E[g j (X j )] = 0 (j = 1, . . . , d) and E[X j ·g j (X j )] = 0 (j = 1, . . . , d 1 ). Apart from identification issues, this prevents us from biases in the linear direction, i.e., in the slope of the g j , such that we can estimate theg j by local constant SB speeding up the algorithm significantly. Finally, this reparametrization helps us to see how to choose the starting values for the iterative backfitting procedure. For the first step, you can simply start with a parametric GLM estimator, settingg j ≡ 0 for all j. Then, forĝ 0 ,β,ĝ j,0 , andα j , j = 1, . . . , d obtained from that GLM estimation, you proceed estimating theg j as outlined in Roca-Pardiñas and Sperlich (2010), to afterwards update all estimates via iteration.

Package description
The package is composed of several functions that enable users to fit the models with methods described above. Table 1 provides a summary of the presently available functions.

wsbackfit functions sback
Main function for fitting generalized structures models using smooth backfitting. sb Function used to indicate the nonparametric terms and varying coefficient terms in the sback() formula. plot Function that provides plots of sback objects produced by sback(). print The default print method for an sback() object. summary Function that takes a fitted object produced by sback() and produces various useful summaries from it. predict Function that provides predictions (e.g., fitted values and nonparametric terms) based on an sback object for a new data set (newdata). residuals Returns residuals of sback objects produced by sback(). Deviance, Pearson, working and response residuals are available. summary Summary for objects of class sback. The package has been designed similarly to other regression packages. The main function is sback. It fits GSM with SB, and creates an object of class sback. Numerical and graphical summaries of the fitted model can be obtained by using print, summary and plot, implemented for sback objects. Moreover, function predict allows obtaining predictions (e.g., fitted values and nonparametric terms) for data different from those used for estimation, called therefore newdata. The main arguments of the sback function are listed in Table 2, and the list of outputs is given in Table 3.

sback arguments -input values formula
A formula object specifying the model to be fitted. data Data frame representing the data and containing all needed variables. offset An optional numerical vector containing a priori known components to be included in the linear predictor during fitting. Default is zero. weights An optional numeric vector of 'prior weights' to be used in the fitting process. By default, the weights are set to one. kernel A character specifying the kernel function. Implemented are: Gaussian and Epanechnikov. By default "Gaussian". bw.grid Numeric vector; a grid for searching the bandwidth factor h c . The bandwidth for dimension j is h c σ j . Default is seq(0.01,0.99,length = 30) c.bw.factor logical; indicates whether the common factor scheme for bandwidth selection proposed by Han and Park (2018) is performed. If TRUE, and provided the user has specified the (marginal) bandwidths for all nonparametric functions, sayh j , the functions searches for the common factor c h that minimizes the deviance via (k-fold) cross-validation when the bandwidth used for dimension (covariate) j is c hhj . The search is done in an equispaced grid of length 15 between 0.5 and 1.5. The default is FALSE. KfoldCV Number of cross-validation folds to be used for either (1) automatically selecting the optimal bandwidth (in the sequence given in argument bw.grid) for each nonparametric function; or (2) automatically selecting the optimal common bandwidth factor (see argument c.bw.factor). Default is 5. kbin An integer value specifying the number of binning knots. Default is 30. family A character specifying the distribution family. Implemented are: Gaussian, Binomial and Poisson. In all cases, the link function is the canonical one. By default "gaussian".  sback(formula, data, offset = NULL, weights = NULL, kernel = c("Gaussian", "Epanechnikov"), bw.grid = seq(0.01, 0.99, length = 30), c.bw.factor = FALSE, KfoldCV = 5, kbin = 30, family = c("gaussian", "binomial", "poisson")) Argument formula corresponds to the model for the conditional mean function (6). This formula is similar to that used for glm, except that nonparametric functions can be added to the additive predictor by means of function sb (for details see Table 4). For instance, specification y~x1 + sb(x2,h = -1) assumes a parametric effect of x1 (with x1 either numerical or categorical), and a nonparametric effect of x2. Varying coefficient terms get incorporated similarly. For example, ys b(x1,by = x2) indicates that the coefficient(s) of x2 depend nonparametrically on x1. In this case both, x1 and x2, should be numerical predictors. More examples are provided further below. sb arguments x1 the univariate predictor. by numeric predictor of the same dimension as x1. If present, the coefficients of this predictor depend nonparametrically on x1, i.e., a varying coefficient term. h bandwidth for this term. If h = -1, the bandwidth is automatically selected using k-fold cross-validation. A value of 0 would indicate a linear fit. By default -1. The bandwidths associated with the nonparametric functions are specified inside sb through argument h (see Table 4), either as final bandwidth (by setting h = h j ; option 1), as starting value to be multiplied by an optimal constant c h found via CV from an equispaced grid of length 15 between 0.5 and 1.5 (by setting h =h j and c.bw.factor = TRUE; option 2), or by demanding a CV-bandwidth which is the product of a common factor hc (chosen from bw.grid) times the scale σ j of covariate X j (by setting h = -1; option 3), recall Section Cross validation, bandwidths, and computational issues. Actually, the user has even four options. These are specified through argument h of function sb, where option 4 is a mixture of options 1 and 3 by setting h = h j inside sb for some nonparametric functions, and h = -1 for the others. The number k of CV folds is specified through KfoldCV, with 5 being the default.
In Section The models that can be estimated by wsbackfit, recall expression (3) with subsequent discussion, we saw that the SB algorithm is implemented with potentially adjusted dependent variableỸ , and weights W . The arguments offset and weights allow the user to modify them on top of what is done automatically (e.g. due to the local scoring). More specifically, the vector offset is subtracted fromỸ when estimating. A standard application is the log of exposure in Poisson regression; see also the example in Section Poisson regression with offset. Regarding the weight W , recall that parts of it are automatically calculated (and updated in each backfitting iteration) by the local scoring procedure to account for the given link function, which is 1 if the link is the identity. This is multiplied by the optional vector weights, provided by the user. For an example in which this option is used to improve the efficiency of the estimators see Section Gaussian simulated data.
The desired smoothing kernel, either Epanechnikov or Gaussian, is specified through argument kernel (by default Gaussian). With kbin, the user indicates the number of binning knots (denoted as N in Section Cross validation, bandwidths, and computational issues). Argument family specifies the conditional distribution of the response variable. So far, the user can select among Gaussian, Poisson and binary. In all cases the canonical link function is considered. Finally, recall that predictions for data different from those being used for estimation can be obtained by means of function predict, specifying the new dataset in argument newdata.

Simulation study
This section reports the results of a small simulation. Two different scenarios are considered, namely Scenario I. Additive model. Covariates X 1 and X 2 are simulated independently from uniform distributions on the intervals [0, 1] and [−10, 1.5], respectively, and Here, Y is generated under two different distributions where the scaling factor in the Bernoulli case is used to control the signal-to-noise ratio.
Results for Scenarios I and II are shown in Figure 1 and Figure 2 respectively. In both cases, the true and estimated functions are centered before plotting to make results comparable. All results are based on a sample size of n = 500 with R = 500 replicates. Also, we use the Gaussian kernel, 30 binning knots, and all bandwidths being selected using 5-fold cross validation (option 3). We obtained essentially the same figures when we repeated these simulations with the Epanechnikov kernel. Not surprisingly, simulation results with options 1 and 2 for the bandwidth choice depended quite a bit on the setting of our (prior) bandwidths and are therefore not shown.

Examples and applications
The last section is dedicated to examples with generalized additive and/or varying coefficient, partial linear models with and without heteroscedasticity, given different link functions. In fact, we have examples for each of the three link functions presently available. Among other things, it is shown how the optional weighting can be used to improve efficiency. While some example are simulated, others illustrate applications from biometrics and health. Finally, we also show how to create useful graphics for interpreting the estimates.
To complement the numerical results, the wsbackfit package also provides graphical outputs by the use of plot. In particular, it provides the plots of the estimated nonparametric functions. Figure  3 shows the figures that appear as a result of the following code. We note that through argument select the user can specify the model term to be plotted and use ylim to indicate the range for the y-axis. This, however, is optional; alternatively the program provides plots that automatically explore the variation of the estimates.
R> names(m0) [1] "call" "formula" "data" "weights" [5] "offset" "kernel" "kbin" "family" [9] "effects" "fitted.values" "residuals" "h" [13] "coeff" "err.CV" This is the list of outputs created by sback. A detailed description of what each component of this list contains was given in Table 3. The user can access this information explicitly and individually, may it be to create its own plots or for further reporting.
As a next step in the analyses of our example, we use the fitted model to estimate the variance. In our example this is considered to be a function of the binary covariate X 5 . Call R> resid <-y -m0$fitted.values R> sig0 <-var(resid[x5 == 0]) R> sig1 <-var(resid[x5 == 1]) The third and final step is to re-estimate the mean model, for efficiency reasons with weights that are the inverse of the estimated variance. The code, including the summary, is R> w <-x5/sig1 + (1-x5)/sig0 R> m1 <-sback(formula = y~x5 + sb(x1, h = 0.1) + sb(x2, h = 0.1) + + sb ( In the previous fits of this example we specified all bandwidths used. For the rest of this example let us consider the case when we ask the program to choose the bandwidths via k-fold CV. We do this for all nonparametric functions, using the following code in which for the sake of clarity and presentation we specify h = -1 although this is actually the default. For convenience we also call directly the summary command: We do not further discuss the results because their interpretation is the same as before, also because the automatically found data-driven optimal bandwidths are close to what we used as prefixed bandwidths in the former codes. We conclude this Section with a brief example in which we specify the bandwidths for some of the nonparametric functions, while for the remaining ones we let our CV procedure select the bandwidths. For brevity we skip output and discussion. R> m2cv <-sback(formula = y~x5 + sb(x1, h = 0.1) + sb(x2, h = -1) + + sb(x3, h = 0.1) + sb(x4, h = 0.1), + weights = w, kbin = 30, KfoldCV = 5, data = df_gauss)

Postoperative infection data
The next example is an application with data studied, among others, in Roca-Pardiñas and Sperlich (2010). They were taken from a prospective analysis conducted at the University Hospital of Santiago de Compostela in the North of Spain. A total of n = 2318 patients who underwent surgery at this center between January 1996 and March 1997 were considered. The main interest is to learn about indicators that could predict whether patients may suffer (inf=1) or not post-operative infection (inf=0), and to see how they relate to the risk of infection. Such predictive indicators could be various, but given the previous studies we concentrate on the pre-operative values of plasma glucose (gluc) concentration (measured in mg/dl), and lymphocytes (linf, expressed as relative counts (in % of the white blood cell count). The data can be found in wsbackfit under the name infect: R> data(infect) R> head ( In the original studies it was controlled for other covariates like age (in years) and sex (coded as 1 = male; 0 = female). For illustrative purposes we limit here our analysis to the investigation of the association of the risk of post-operative infections inf with the predictors linf and gluc, putting all other covariates aside. It is well known that the effect of linf on inf varies strongly with the concentration of gluc. Therefore, one may think of a generalized varying coefficient model of type log P (inf = 1|linf, gluc) 1 − P (inf = 1|linf, gluc) = g 0 + g 1 (gluc) + g 2 (gluc)linf = g 0 + (α 1 · gluc +g 1 (gluc)) + (g 20 + α 2 · gluc +g 2 (gluc)) linf,  (13), contains in addition to the constant term (intercept), the main linear effects of gluc α 1 and linf g 20 , and the linear interaction between gluc and linf α 2 , all provided in the very last line. Our bandwidths have been chosen for graphical convenience. The graphical output, i.e., the plots of the estimated nonparametric functions, is obtained by the code R> op <-par(mfrow = c(1,3)) R> plot(m2, composed = FALSE, ask = FALSE, cex.main = 2, cex = 2, cex.lab = 1.5, + cex.axis = 2) R> par(op) R> op <-par(mfrow = c(1,3)) R> plot(m2, composed = FALSE, ask = FALSE, cex.main = 2, cex = 2, cex.lab = 1.5, + cex.axis = 2) R> par(op) Upper row: estimates of α 1 · gluc +g 1 (gluc) (left), α 2 · gluc +g 2 (gluc) (middle) and (α 2 · gluc +g 2 (gluc)) linf (right) obtained from model (13). The plots on the left and the center in the bottom line show the estimates of the linear and nonlinear components separately. In this example, the right plot is simply repeated.

Poisson regression with offset
Let us now consider a simulated example that illustrates the use of Poisson regression with a nontrivial use of option offset. We simulate data where each subject may have different levels of exposure to the event of interest. As explained in the above sections, this can be handled with the offset option. More specifically, for the level of exposure P we consider Y ∈ P oisson P · exp 2 + 3X 2 1 + 5X 3 2 .

Summary
We have first given a general introduction into the class of Generalized Structures Models, together with the powerful kernel based smooth backfitting estimator to fit the members of this model class. This was accompanied by a (certainly incomplete) literature review, and closed with a small review and discussion of existing methods and software for similar and related models. Except the varying coefficient model estimator in the np package, they are all based on splines. We concluded that, while there is a huge body of literature on SB and its advantages, it is hardly used in practice due to the lack of software. The wsbackfit package intends to close this gap.
Next we provided some insight into the weighted SB and the objective function that is minimized by our algorithm. This allowed us to better explain the users' options like weights and offset. We outlined which models can be estimated by the presently available package. The description of the procedure was complemented by a section on the implemented CV, bandwidth choice, binning, convergence and identification issues to clarify the location and scaling of the resulting estimates.
The package description has been kept condense but its use has been illustrated along several examples that cover some of the estimable models. They comprise the use of all options provided. Moreover, the numerical examples give an idea of the estimators' performance. For more details we recommend to consult the cited articles dealing with the particular models.
We believe that this package is an important enrichment of the existing methods with many useful applications of flexible data analysis and prediction. It can almost straightforwardly be used for testing (Cadarso-Suárez et al., 2006;Mammen and Sperlich, 2021) or studying the heterogeneity of causal effects (Benini and Sperlich, 2021), and many other interesting applications. Next challenges will be the extension of this package to cover the analysis of more complex data (Jeon and Park, 2020