The R package BNSP provides a unified framework for semiparametric location-scale regression and stochastic search variable selection. The statistical methodology that the package is built upon utilizes basis function expansions to represent semiparametric covariate effects in the mean and variance functions, and spike-slab priors to perform selection and regularization of the estimated effects. In addition to the main function that performs posterior sampling, the package includes functions for assessing convergence of the sampler, summarizing model fits, visualizing covariate effects and obtaining predictions for new responses or their means given feature/covariate vectors.
There are many approaches to non- and semi-parametric modeling. From a Bayesian perspective, Müller and Mitra (2013) provide a review that covers methods for density estimation, modeling of random effects distributions in mixed effects models, clustering, and modeling of unknown functions in regression models.
Our interest is on Bayesian methods for modeling unknown functions in regression models. In particular, we are interested in modeling both the mean and variance functions non-parametrically, as general functions of the covariates. There are multiple reasons why allowing the variance function to be a general function of the covariates may be important (Chan et al. 2006). Firstly, it can result in more realistic prediction intervals than those obtained by assuming constant error variance, or as Müller and Mitra (2013) put it, it can result in more honest representation of uncertainties. Secondly, it allows the practitioner to examine and understand which covariates drive the variance. Thirdly, it results in more efficient estimation of the mean function. Lastly, it produces more accurate standard errors of unknown parameters.
In the R (R Core Team 2016) package BNSP (Papageorgiou 2018) we implemented Bayesian regression models with Gaussian errors and with mean and log-variance functions that can be modeled as general functions of the covariates. Covariate effects may enter the mean and log-variance functions parametrically or non-parametrically, with the nonparametric effects represented as linear combinations of basis functions. The strategy that we follow in representing unknown functions is to utilize a large number of basis functions. This allows for flexible estimation and for capturing true effects that are locally adaptive. Potential problems associated with large numbers of basis functions, such as over-fitting, are avoided in our implementation, and efficient estimation is achieved, by utilizing spike-slab priors for variable selection. A review of variable selection methods is provided by O’Hara and Sillanpää (2009).
The methods described here belong to the general class of models known as generalized additive models for location, scale, and shape (GAMLSS) (Rigby and Stasinopoulos 2005; Stasinopoulos and Rigby 2007) or the Bayesian analogue termed as BAMLSS (Umlauf et al. 2018) and implemented in package bamlss (Umlauf et al. 2017). However, due to the nature of the spike-and-slab priors that we have implemented, in addition to flexible modeling of the mean and variance functions, the methods described here can also be utilized for selecting promising subsets of predictor variables in multiple regression models. The implemented methods fall in the general class of methods known as stochastic search variable selection (SSVS). SSVS has received considerable attention in the Bayesian literature and its applications range from investigating factors that affect individual’s happiness (George and McCulloch 1993), to constructing financial indexes (George and McCulloch 1997), and to gene mapping (O’Hara and Sillanpää 2009). These methods associate each regression coefficient, either a main effect or the coefficient of a basis function, with a latent binary variable that indicates whether the corresponding covariate is needed in the model or not. Hence, the joint posterior distribution of the vector of these binary variables can identify the models with the higher posterior probability.
R packages that are related to BNSP include spikeSlabGAM (Scheipl 2016) that also utilizes SSVS methods (Scheipl 2011). A major difference between the two packages, however, is that whereas spikeSlabGAM utilizes spike-and-slab priors for function selection, BNSP utilizes spike-and-slab priors for variable selection. In addition, Bayesian GAMLSS models, also refer to as distributional regression models, can also be fit with R package brms using normal priors (Bürkner 2018). Further, the R package gamboostLSS (Hofner et al. 2018) includes frequentist GAMLSS implementation based on boosting that can handle high-dimensional data (Mayr et al. 2012). Lastly, the R package mgcv (Wood 2018) can also fit generalized additive models with Gaussian errors and integrated smoothness estimation, with implementations that can handle large datasets.
In BNSP we have implemented functions for fitting such semi-parametric
models, summarizing model fits, visualizing covariate effects and
predicting new responses or their means. The main functions are mvrm
,
mvrm2mcmc
, print.mvrm
, summary.mvrm
, plot.mvrm
, and
predict.mvrm
. A quick description of these functions follows. The
first one, mvrm
, returns samples from the posterior distributions of
the model parameters, and it is based on an efficient Markov chain Monte
Carlo (MCMC) algorithm in which we integrate out the coefficients in the
mean function, generate the variable selection indicators in blocks
(Chan et al. 2006), and choose the MCMC tuning parameters adaptively
(Roberts and Rosenthal 2009). In order to minimize random-access memory
utilization, posterior samples are not kept in memory, but instead
written in files in a directory supplied by the user. The second
function, mvrm2mcmc
, reads-in the samples from the posterior of the
model parameters and it creates an object of class "mcmc"
. This
enables users to easily utilize functions from package
coda (Plummer et al. 2006), including its
plot
and summary
methods for assessing convergence and for
summarizing posterior distributions. Further, functions print.mvrm
and
summary.mvrm
provide summaries of model fits, including models and
priors specified, marginal posterior probabilities of term inclusion in
the mean and variance models and models with the highest posterior
probabilities. Function plot.mvrm
creates plots of parametric and
nonparametric terms that appear in the mean and variance models. The
function can create two-dimensional plots by calling functions from the
package ggplot2
(Wickham 2009). It can also create static or interactive three-dimensional
plots by calling functions from the packages
plot3D (Soetaert 2016) and
threejs (Lewis 2016). Lastly,
function predict.mvrm
provides predictions either for new responses or
their means given feature/covariate vectors.
We next provide a detailed model description followed by illustrations on the usage of the package and the options it provides. Technical details on the implementation of the MCMC algorithm are provided in the Appendix. The paper concludes with a brief summary.
Let
Let
In the following subsections we describe how, within model ((3)), both parametric and nonparametric effects of explanatory variables on the mean and variance functions can be captured utilizing regression splines and variable selection methods. We begin by considering the special case where there is a single covariate entering the mean model and a single covariate entering the variance model.
Suppose that the observed dataset consists of triplets
In package BNSP we implemented two sets of basis functions. Firstly,
radial basis functions
Secondly, we implemented thin plate splines
In addition, BNSP supports the smooth constructors from package mgcv (e.g., the low-rank thin plate splines, cubic regression splines, P-splines, their cyclic versions, and others). Examples on how these smooth terms are used within BNSP are provided later in this paper.
Locally adaptive models for the mean and variance functions are obtained
utilizing the methodology developed by Chan et al. (2006). Considering the mean
function, local adaptivity is achieved by utilizing a large number of
basis functions,
Given vectors
We note that, as was suggested by Chan et al. (2006), we work with mean corrected
columns in the design matrices
Let
Further, the prior for
Independent priors are specified for the indicators variables
Similarly, for the indicators
We specify inverse Gamma priors for
It is straightforward to extend the methodology described earlier to
allow fitting of flexible mean and variance surfaces. In fact, the only
modification required is in the basis functions and knots. For fitting
surfaces, in package BNSP we implemented radial basis functions
We note that the prior specification presented earlier for fitting
flexible functions remains unchained for fitting flexible surfaces.
Further, for fitting bivariate or higher order functions, BNSP also
supports smooth constructors s
, te
, and ti
from mgcv.
In the presence of multiple covariates, the effects of which may be
modeled parametrically or semiparametrically, the mean model in
((4)) is extended to the following
Similarly, the variance model (5), in the
presence of multiple covariates, is expressed as
For additive models, local adaptivity is achieved using a similar
strategy, as in the single covariate case. That is, we utilize a
potentially large number of knots or basis functions in the flexible
components that appear in the mean model,
The model that we implemented in package BNSP specifies independent
priors for the indicators variables
Similarly, for the indicators
We specify the following independent priors for the inclusion
probabilities.
The rest of the priors are the same as those specified for the single covariate models.
In this section we provide results on simulation studies and real data analyses. The purpose is twofold: firstly we point out that the package works well and provides the expected results (in simulation studies) and secondly we illustrate the options that the users of BNSP have.
Here we present results from three simulations studies, involving one, two, and multiple covariates. For the majority of these simulation studies, we utilize the same data-generating mechanisms as those presented by Chan et al. (2006).
We consider two mechanisms that involve a single covariate that appears
in both the mean and variance model. Denoting the covariate by
We generate a single dataset of size
> mu <- function(u) {2 * u}
> stdev <- function(u) {0.1 + u}
> set.seed(1)
> n <- 500
> u <- sort(runif(n))
> y <- rnorm(n, mu(u), stdev(u))
> data <- data.frame(y, u)
Above we specified the seed value to be one, and we do so in what follows, so that our results are replicable.
To the generated dataset we fit a special case of the model that we
presented, where the mean and variance functions in ((4)) and
(5) are specified as
> model <- y ~ sm(u, k = 20, bs = "rd") | sm(u, k = 20, bs = "rd")
The above formula (Zeileis and Croissant 2010) specifies the response, mean and variance
models. Smooth terms are specified utilizing function sm
, that takes
as input the covariate
Next we specify the hyper-parameter values for the priors in
((8)) and ((9)). The default prior
for
We choose to run the MCMC sampler for
> DIR <- getwd()
> m1 <- mvrm(formula = model, data = data, sweeps = 10000, burn = 5000,
+ thin = 2, seed = 1, StorageDir = DIR,
+ c.betaPrior = "IG(0.5,0.5*n)", c.alphaPrior = "IG(1.1,1.1)",
+ pi.muPrior = "Beta(1,1)", pi.sigmaPrior = "Beta(1,1)", sigmaPrior = "HN(2)")
Samples from the posteriors of the model parameters
StorageDir
. If a storage directory is not
specified, then function mvrm
returns an error message, as without
these files there will be no output to process. Furthermore, the last
two lines of the above function call show the specified priors, which
are mvrm
by using c.betaPrior = "IG(1.01,1.01)"
. The second
parameter of the prior for c.betaPrior = "IG(1,0.4*n)"
is another acceptable specification.
Further, Beta priors are available for parameters sigmaPrior = "HN(5)"
defines sigmaPrior = "IG(1.1,1.1)"
defines
Function mvrm2mcmc
reads in posterior samples from the files that the
call to function mvrm
generated and creates an object of class
"mcmc"
. Hence, for summarizing posterior distributions and for
assessing convergence, functions summary
and plot
from package
coda can be used. As an example, here we read in the samples from the
posterior of summary
. For the sake of economizing space, only the part of the
output that describes the posteriors of
> beta <- mvrm2mcmc(m1, "beta")
> summary(beta)
Iterations = 5001:9999
Thinning interval = 2
Number of chains = 1
Sample size per chain = 2500
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 9.534e-01 0.004399 8.799e-05 0.0002534
u 1.864e+00 0.042045 8.409e-04 0.0010356
sm(u).1 3.842e-04 0.016421 3.284e-04 0.0003284
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) 0.946 0.9513 0.9533 0.9554 0.960
u 1.833 1.8565 1.8614 1.8682 1.923
sm(u).1 0.000 0.0000 0.0000 0.0000 0.000
Further, we may obtain a plot using
> plot(beta)
Figure 1 shows the first three of the plots created by
function plot
. These are the plots of the samples from the posteriors
of coefficients
Returning to the function mvrm2mcmc
, it requires two inputs. These are
an object of class "mvrm"
and the name of the file to be read in R.
For the parameters in the current model
beta
, gamma
, alpha
, delta
,
cbeta
, calpha
, and sigma2
respectively.
Summaries of mvrm
fits may be obtained utilizing functions
print.mvrm
and summary.mvrm
. The method print
takes as input an
object of class "mvrm"
. It returns basic information of the model fit,
as shown below:
> print(m1)
Call:
mvrm(formula = model, data = data, sweeps = 10000, burn = 5000,
thin = 2, seed = 1, StorageDir = DIR, c.betaPrior = "IG(0.5,0.5*n)",
c.alphaPrior = "IG(1.1,1.1)", pi.muPrior = "Beta(1,1)",
pi.sigmaPrior = "Beta(1,1)", sigmaPrior = "HN(2)")
2500 posterior samples
Mean model - marginal inclusion probabilities
u sm(u).1 sm(u).2 sm(u).3 sm(u).4 sm(u).5 sm(u).6 sm(u).7
1.0000 0.0040 0.0036 0.0032 0.0084 0.0036 0.0044 0.0028
sm(u).8 sm(u).9 sm(u).10 sm(u).11 sm(u).12 sm(u).13 sm(u).14 sm(u).15
0.0060 0.0020 0.0060 0.0036 0.0056 0.0056 0.0036 0.0052
sm(u).16 sm(u).17 sm(u).18 sm(u).19 sm(u).20
0.0060 0.0044 0.0056 0.0044 0.0052
Variance model - marginal inclusion probabilities
u sm(u).1 sm(u).2 sm(u).3 sm(u).4 sm(u).5 sm(u).6 sm(u).7
1.0000 0.6072 0.5164 0.5808 0.5488 0.6760 0.5320 0.6336
sm(u).8 sm(u).9 sm(u).10 sm(u).11 sm(u).12 sm(u).13 sm(u).14 sm(u).15
0.6936 0.6708 0.5996 0.4816 0.4912 0.3728 0.6268 0.5688
sm(u).16 sm(u).17 sm(u).18 sm(u).19 sm(u).20
0.5872 0.6528 0.4428 0.6900 0.5356
The function returns a matched call, the number of posterior samples obtained, and marginal inclusion probabilities of the terms in the mean and variance models.
Whereas the output of the print
method focuses on marginal inclusion
probabilities, the output of the summary
method focuses on the most
frequently visited models. It takes as input an object of class "mvrm"
and the number of (most frequently visited) models to be displayed,
which by default is set to nModels = 5
. Here to economize space we set
nModels = 2
. The information returned by the summary
method is shown
below
> summary(m1, nModels = 2)
Specified model for the mean and variance:
y ~ sm(u, k = 20, bs = "rd") | sm(u, k = 20, bs = "rd")
Specified priors:
[1] c.beta = IG(0.5,0.5*n) c.alpha = IG(1.1,1.1) pi.mu = Beta(1,1)
[4] pi.sigma = Beta(1,1) sigma = HN(2)
Total posterior samples: 2500 ; burn-in: 5000 ; thinning: 2
Files stored in /home/papgeo/1/
Null deviance: 1299.292
Mean posterior deviance: -88.691
Joint mean/variance model posterior probabilities:
mean.u mean.sm.u..1 mean.sm.u..2 mean.sm.u..3 mean.sm.u..4 mean.sm.u..5
1 1 0 0 0 0 0
2 1 0 0 0 0 0
mean.sm.u..6 mean.sm.u..7 mean.sm.u..8 mean.sm.u..9 mean.sm.u..10
1 0 0 0 0 0
2 0 0 0 0 0
mean.sm.u..11 mean.sm.u..12 mean.sm.u..13 mean.sm.u..14 mean.sm.u..15
1 0 0 0 0 0
2 0 0 0 0 0
mean.sm.u..16 mean.sm.u..17 mean.sm.u..18 mean.sm.u..19 mean.sm.u..20 var.u
1 0 0 0 0 0 1
2 0 0 0 0 0 1
var.sm.u..1 var.sm.u..2 var.sm.u..3 var.sm.u..4 var.sm.u..5 var.sm.u..6
1 1 1 1 1 1 1
2 1 0 1 1 1 1
var.sm.u..7 var.sm.u..8 var.sm.u..9 var.sm.u..10 var.sm.u..11 var.sm.u..12
1 1 1 1 0 1 0
2 1 1 1 1 1 1
var.sm.u..13 var.sm.u..14 var.sm.u..15 var.sm.u..16 var.sm.u..17 var.sm.u..18
1 1 1 1 1 1 1
2 0 1 1 1 1 0
var.sm.u..19 var.sm.u..20 freq prob cumulative
1 1 1 141 5.64 5.64
2 1 0 120 4.80 10.44
Displaying 2 models of the 916 visited
2 models account for 10.44% of the posterior mass
Firstly, the method provides the specified mean and variance models and
the specified priors. This is followed by information about the MCMC
chain and the directory where files have been stored. In addition, the
function provides the null and the mean posterior deviance. Finally, the
function provides the specification of the joint mean/variance models
that were visited most often during MCMC sampling. This specification is
in terms of a vector of indicators, each consisting of zeros and ones,
that show which terms are in the mean and variance model. To make clear
which terms pertain to the mean and which to the variance function, we
have preceded the names of the model terms by “mean.
” or “var.
”. In
the above output, we see that the most visited model specifies a linear
mean model (only the linear term is included in the model) while the
variance model includes twelve terms. See also Figure 2.
We next describe the function plot.mvrm
which creates plots of terms
in the mean and variance functions. Two calls to the plot
method can
be seen in the code below. Argument x
expects an object of class
"mvrm"
, as created by a call to the function mvrm
. The model
argument may take on one of three possible values: "mean"
, "stdev"
,
or "both"
, specifying the model to be visualized. Further, the term
argument determines the term to be plotted. In the current example there
is only one term in each of the two models which leaves us with only one
choice, term = "sm(u)"
. Equivalently, term
may be specified as an
integer, term = 1
. If term
is left unspecified, then, by default,
the first term in the model is plotted. For creating two-dimensional
plots, as in the current example, the plot
method utilizes the package
ggplot2. Users of BNSP may add their own options to plots via the
argument plotOptions
. The code below serves as an example.
> x1 <- seq(0, 1, length.out = 30)
> plotOptionsM <- list(geom_line(aes_string(x = x1, y = mu(x1)), col = 2, alpha = 0.5,
+ lty = 2), geom_point(data = data, aes(x = u, y = y)))
> plot(x = m1, model = "mean", term = "sm(u)", plotOptions = plotOptionsM,
+ intercept = TRUE, quantiles = c(0.005, 0.995), grid = 30)
> plotOptionsV = list(geom_line(aes_string(x = x1, y = stdev(x1)), col = 2,
+ alpha = 0.5, lty = 2))
> plot(x = m1, model = "stdev", term = "sm(u)", plotOptions = plotOptionsV,
+ intercept = TRUE, quantiles = c(0.05, 0.95), grid = 30)
The resulting plots can be seen in Figure 2, panels (a) and
(b). Panel (a) displays the simulated dataset, showing the expected
increase in both the mean and variance with increasing values of the
covariate grid = 30
option for plot
. For each sample
plot
method computes
intercept = TRUE
specifies that the intercept
intercept = FALSE
. The posterior means are computed by the
usual quantiles = c(0.005, 0.995)
. This option specifies that for each value
quantiles = NULL
.
Figure 2, panel (b) displays the posterior mean of the
standard deviation function
intercept = TRUE
specifies
that intercept = FALSE
, which will result in plots of
|
|
(a) | (b) |
|
|
(c) | (d) |
We use the second simulated dataset to show how the s
constructor from
the package mgcv may be used. In our example, we use s
to specify
the model:
> model <- y ~ s(u, k = 15, bs = "ps", absorb.cons=TRUE) |
+ s(u, k = 15, bs = "ps", absorb.cons=TRUE)
Function BNSP::s
calls in turn mgcv::s
and mgcv::smoothCon
. All
options of the last two functions may be passed to BNSP::s
. In the
example above we used options k
, bs
, and absorb.cons
.
The remaining R code for the second simulated example is precisely the same as the one for the first example, and hence omitted. Results are shown in Figure 2, panels (c) and (d).
We conclude the current section by describing the function
predict.mvrm
. The function provides predictions and posterior credible
or prediction intervals for given feature vectors. The two types of
intervals differ in the associated level of uncertainty: prediction
intervals attempt to capture a future response and are usually much
wider than credible intervals that attempt to capture a mean response.
The following code shows how credible and prediction intervals can be
obtained for a sequence of covariate values stored in x1
> x1 <- seq(0, 1, length.out = 30)
> p1 <- predict(m1, newdata = data.frame(u = x1), interval = "credible")
> p2 <- predict(m1, newdata = data.frame(u = x1), interval = "prediction")
where the first argument in the predict
method is a fitted mvrm
model, the second one is a data frame containing the feature vectors at
which predictions are to be obtained and the last one defines the type
of interval to be created. We applied the predict
method to the two
simulated datasets. To each of those datasets we fitted two models: the
first one is the one we saw earlier, where both the mean and variance
are modeled in terms of covariates, while the second one ignores the
dependence of the variance on the covariate. The latter model is
specified in R using
> model <- y ~ sm(u, k = 20, bs = "rd") | 1
Results are displayed in Figure 3. Each of the two figures displays a credible interval and two prediction intervals. The figure emphasizes a point that was discussed in the introductory section, that modeling the variance in terms of covariates can result in more realistic prediction intervals. The same point was recently discussed by Mayr et al. (2012).
|
|
(a) | (b) |
Interactions between two predictors can be modeled by appropriate
specification of either the built-in sm
function or the smooth
constructors from mgcv. Function sm
can take up to two covariates,
both of which may be continuous or one continuous and one discrete. Next
we consider an example that involves two continuous covariates. An
example involving a continuous and a discrete covariate is shown later
on, in the second application to a real dataset.
Let
In R, we simulate data from the above mechanism using
> mu1 <- matrix(c(0.25, 0.75))
> sigma1 <- matrix(c(0.03, 0.01, 0.01, 0.03), 2, 2)
> mu2 <- matrix(c(0.65, 0.35))
> sigma2 <- matrix(c(0.09, 0.01, 0.01, 0.09), 2, 2)
> mu <- function(x1, x2) {x <- cbind(x1, x2);
+ 0.1 + dmvnorm(x, mu1, sigma1) + dmvnorm(x, mu2, sigma2)}
> Sigma <- function(x1, x2) {x <- cbind(x1, x2);
+ 0.1 + (dmvnorm(x, mu1, sigma1) + dmvnorm(x, mu2, sigma2)) / 2}
> set.seed(1)
> n <- 500
> w1 <- runif(n)
> w2 <- runif(n)
> y <- vector()
> for (i in 1:n) y[i] <- rnorm(1, mean = mu(w1[i], w2[i]),
+ sd = sqrt(Sigma(w1[i], w2[i])))
> data <- data.frame(y, w1, w2)
We fit a model with mean and variance functions specified as
> Model <- y ~ sm(w1, w2, k = 10, bs = "rd") | sm(w1, w2, k = 10, bs = "rd")
> m2 <- mvrm(formula = Model, data = data, sweeps = 10000, burn = 5000, thin = 2,
+ seed = 1, StorageDir = DIR)
As in the univariate case, convergence assessment and univariate
posterior summaries may be obtained by using function mvrm2mcmc
in
conjunction with functions plot.mcmc
and summary.mcmc
. Further,
summaries of the mvrm
fits may be obtained using functions
print.mvrm
and summary.mvrm
. Plots of the bivariate effects may be
obtained using function plot.mvrm
. This is shown below, where argument
plotOptions
utilizes the package
colorspace (Zeileis et al. 2009).
> plot(x = m2, model = "mean", term = "sm(w1,w2)", static = TRUE,
+ plotOptions = list(col = diverge_hcl(n = 10)))
> plot(x = m2, model = "stdev", term = "sm(w1,w2)", static = TRUE,
+ plotOptions = list(col = diverge_hcl(n = 10)))
Results are shown in Figure 4. For bivariate predictors,
function plot.mvrm
calls function ribbon3D
from the package
plot3D. Dynamic plots, viewable in a browser, can be created by
replacing the default static=TRUE
by static=FALSE
. When the latter
option is specified, function plot.mvrm
calls the function
scatterplot3js
from the package threejs. Users may pass their own
options to plot.mvrm
via the plotOptions
argument.
|
|
(a) | (b) |
We consider fitting general additive models for the mean and variance
functions in a simulated example with four independent continuous
covariates. In this scenario, we set
To the generated dataset we fit a model with mean and variance functions
modeled as
> Model <- y ~ sm(w1, k = 15, bs = "rd") + sm(w2, k = 15, bs = "rd") +
+ sm(w3, k = 15, bs = "rd") + sm(w4, k = 15, bs = "rd") |
+ sm(w1, k = 15, bs = "rd") + sm(w2, k = 15, bs = "rd") +
+ sm(w3, k = 15, bs = "rd") + sm(w4, k = 15, bs = "rd")
> m3 <- mvrm(formula = Model, data = data, sweeps = 50000, burn = 25000,
+ thin = 5, seed = 1, StorageDir = DIR)
By default the function sm
utilizes the radial basis functions, hence
there is no need to specify bs = "rd"
, as we did earlier, if radial
basis functions are preferred over thin plate splines. Further, we have
selected k = 15
for all smooth functions. However, there is no
restriction to the number of knots and certainly one can select a
different number of knots for each smooth function.
As discussed previously, for each term that appears in the right-hand
side of the mean and variance functions, the model incorporates
indicator variables that specify which basis functions are to be
included and which are to be excluded from the model. For the current
example, the indicator variables are denoted by pi.muPrior = "Beta(1,1)"
specifies that
pi.muPrior = "Beta(2,1)"
we specify that
pi.muPrior
will have to be
specified as a vector of length four, as an example,
pi.muPrior = c("Beta(1,1)","Beta(2,1)","Beta(3,1)","Beta(4,1)")
.
Specification of the priors for pi.sigmaPrior
.
We conclude this section by presenting plots of the four terms in the
mean and variance models. The plots are presented in
Figure 5. We provide a few details on how the plot
method
works in the presence of multiple terms, and how the comparison between
true and estimated effects is made. Starting with the mean function, to
create the relevant plots, that appear on the left panels of
Figure 5, the plot
method considers only the part of the
mean function term
while leaving all other terms out. For instance, in the code below we
choose term = "sm(u1)"
and hence we plot the posterior mean and a
posterior credible interval for
intercept = FALSE
. Further, comparison
is made with a centered version of the true curve, represented by the
dashed (red color) line and obtained by the first three lines of code
below.
> x1 <- seq(0, 1, length.out = 30)
> y1 <- mu1(x1)
> y1 <- y1 - mean(y1)
> PlotOptions <- list(geom_line(aes_string(x = x1, y = y1),
+ col = 2, alpha = 0.5, lty = 2))
> plot(x = m3, model = "mean", term = "sm(w1)", plotOptions = PlotOptions,
+ intercept = FALSE, centreEffects = FALSE, quantiles = c(0.005, 1 - 0.005))
The plots of the four standard deviation terms are shown in the right
panels of Figure 5. Again, these are created by considering
only the part of the model for term
. For instance, below we choose term = "sm(u1)"
.
Hence, in this case the plot will present the posterior mean and a
posterior credible interval for
intercept = FALSE
. Option
centreEffects = TRUE
scales the posterior realizations of
> y1 <- stdev1(x1) / mean(stdev1(x1))
> PlotOptions <- list(geom_line(aes_string(x = x1, y = y1),
+ col = 2, alpha = 0.5, lty = 2))
> plot(x = m3, model = "stdev", term = "sm(w1)", plotOptions = PlotOptions,
+ intercept = FALSE, centreEffects = TRUE, quantiles = c(0.025, 1 - 0.025))
|
|
|
|
|
|
|
|
In this section we present four empirical applications.
In the first empirical application, we analyse a dataset from Pagan and Ullah (1999)
that is available in the R package
np (Hayfield and Racine 2008). The dataset
consists of logwage
, the
logarithm of the individual’s wage, and covariate age
, the
individual’s age. The dataset comes from the 1971 Census of Canada
Public Use Sample Tapes and the sampling units it involves are males of
common education. Hence, the investigation of the relationship between
age and the logarithm of wage is carried out controlling for the two
potentially important covariates education and gender.
We utilize the following R code to specify flexible models for the mean
and variance functions, and to obtain
> data(cps71)
> DIR <- getwd()
> model <- logwage ~ sm(age, k = 30, bs = "rd") | sm(age, k = 30, bs = "rd")
> m4 <- mvrm(formula = model, data = cps71, sweeps = 50000,
+ burn = 25000, thin = 5, seed = 1, StorageDir = DIR)
After checking convergence, we use the following code to create the plots that appear in Figure 6.
> wagePlotOptions <- list(geom_point(data = cps71, aes(x = age, y = logwage)))
> plot(x = m4, model = "mean", term = "sm(age)", plotOptions = wagePlotOptions)
> plot(x = m4, model = "stdev", term = "sm(age)")
Figure 6 (a) shows the posterior mean and an age
and logwage
. Figure 6 (b) shows the posterior
mean and an age
and the
variability in logwage
. The relationship suggested by
Figure 6 (b) is also suggested by the spread of the
data-points around the estimated mean in Figure 6 (a). At ages
around logwage
is high. It then reduces
until about age
|
|
(a) | (b) |
In the second empirical application, we analyse a dataset from
Wooldridge (2008) that is also available in the R package np. The response
variable here is the logarithm of the individual’s hourly wage (lwage
)
while the covariates include the years of education (educ
), the years
of experience (exper
), the years with the current employer (tenure
),
the individual’s gender (named as female
within the dataset, with
levels Female
and Male
), and marital status (named as married
with
levels Married
and Notmarried
). The dataset consists of
As the variance function is modeled in terms of an exponential, see
((2)), to avoid potential numerical problems, we transform the
three continuous variables to have range in the interval
> data(wage1)
> wage1$ntenure <- wage1$tenure / max(wage1$tenure)
> wage1$nexper <- wage1$exper / max(wage1$exper)
> wage1$neduc <- wage1$educ / max(wage1$educ)
We choose to fit the following mean and variance models to the data
nexper
and female
is not necessary for the current data analysis. However, we
choose to add this term in the mean model in order to illustrate how
interaction terms can be specified. We illustrate further options below.
> knots1 <- seq(min(wage1$nexper), max(wage1$nexper), length.out = 30)
> knots2 <- c(0, 1)
> knotsD <- expand.grid(knots1, knots2)
> model <- lwage ~ fmarried + sm(ntenure) + sm(neduc, knots=data.frame(knots =
+ seq(min(wage1$neduc), max(wage1$neduc), length.out = 15))) +
+ sm(nexper, ffemale, knots = knotsD) | sm(nexper, knots=data.frame(knots =
+ seq(min(wage1$nexper), max(wage1$nexper), length.out=15)))
The first three lines of the R code above specify the matrix of
(potential) knots to be used for representing
sm
will be used. Furthermore, in the
specification of the mean model we use sm(ntenure)
. By this, we chose
to represent neduc
. Fifteen knots are also used to represent
The following code is used to obtain samples from the posteriors of the model parameters.
> DIR <- getwd()
> m5 <- mvrm(formula = model, data = wage1, sweeps = 100000,
+ burn = 25000, thin = 5, seed = 1, StorageDir = DIR))
After summarizing results and checking convergence, we create plots of
posterior means, along with married
does not have an effect on the mean of lwage
.
For this reason, we do not provide further results on the posterior of
the coefficient of covariate married
,
> PlotOptionsT <- list(geom_point(data = wage1, aes(x = ntenure, y = lwage)))
> plot(x = m5, model = "mean", term="sm(ntenure)", quantiles = c(0.025, 0.975),
+ plotOptions = PlotOptionsT)
> PlotOptionsEdu <- list(geom_point(data = wage1, aes(x = neduc, y = lwage)))
> plot(x = m5, model = "mean", term = "sm(neduc)", quantiles = c(0.025, 0.975),
+ plotOptions = PlotOptionsEdu)
> pchs <- as.numeric(wage1$female)
> pchs[pchs == 1] <- 17; pchs[pchs == 2] <- 19
> cols <- as.numeric(wage1$female)
> cols[cols == 2] <- 3; cols[cols == 1] <- 2
> PlotOptionsE <- list(geom_point(data = wage1, aes(x = nexper, y = lwage),
+ col = cols, pch = pchs, group = wage1$female))
> plot(x = m5, model = "mean", term="sm(nexper,female)", quantiles = c(0.025, 0.975),
+ plotOptions = PlotOptionsE)
> plot(x = m5, model = "stdev", term = "sm(nexper)", quantiles = c(0.025, 0.975))
> PlotOptionsF <- list(geom_boxplot(fill = 2, color = 1))
> plot(x = m5, model = "mean", term = "married", quantiles = c(0.025, 0.975),
+ plotOptions = PlotOptionsF)
|
|
(a) | (b) |
|
|
(c) | (d) |
ntenure
),
(b) f2(neduc
),
(c) f3(nexper
,female
),
and (d) the standard deviation function σi = σexp [f4(nexper
)/2].
Figure 7, panels (a) and (b) show the posterior means and
Lastly, we show how to obtain predictions and credible intervals for the
levels "Married"
and "Notmarried"
of variable fmaried
and the
levels "Female"
and "Male"
of variable fffemale
, with variables
ntenure
, nedc
, and nexper
fixed at their mid-range.
> p1 <- predict(m5, newdata = data.frame(fmarried = rep(c("Married", "Notmarried"), 2),
+ ntenure = rep(0.5, 4), neduc = rep(0.5, 4), nexper = rep(0.5, 4),
+ ffemale = rep(c("Female", "Male"), each = 2)), interval = "credible")
> p1
fit lwr upr
1 1.321802 1.119508 1.506574
2 1.320400 1.119000 1.505272
3 1.913341 1.794035 2.036255
4 1.911939 1.791578 2.034832
The predictions are suggestive of no “marriage” effect and of “gender” effect.
Here we analyse brain activity level data obtained by functional
magnetic resonance imaging. The dataset is available in package
gamair (Wood 2006) and it
was previously analysed by Landau et al. (2003). We are interested in three of the
columns in the dataset. These are the response variable, medFPQ
, which
is calculated as the median over three measurements of “Fundamental
Power Quotient” and the two covariates, X
and Y
, which show the
location of each voxel.
The following R code loads the relevant data frame, removes two outliers
and transforms the response variable, as was suggested by Wood (2006). In
addition, it plots the brain activity data using function levelplot
from the package lattice
(Sarkar 2008).
> data(brain)
> brain <- brain[brain$medFPQ > 5e-5, ]
> brain$medFPQ <- (brain$medFPQ) ^ 0.25
> levelplot(medFPQ ~ Y * X, data = brain, xlab = "Y", ylab = "X",
+ col.regions = gray(10 : 100 / 100))
The plot of the observed data is shown in Figure 8, panel
(a). Its distinctive feature is the noise level, which makes it
difficult to decipher any latent pattern. Hence, the goal of the current
data analysis is to obtain a smooth surface of brain activity level from
the noisy data. It was argued by Wood (2006) that for achieving this goal a
spatial error term is not needed in the model. Thus, we analyse the
brain activity level data using a model of the form
The R code that fits the above model is
> Model <- medFPQ ~ sm(Y, X, k = 10, bs = "rd") | 1
> m6 <- mvrm(formula = Model, data = brain, sweeps = 50000, burn = 20000, thin = 2,
+ seed = 1, StorageDir = DIR)
From the fitted model we obtain a smooth brain activity level surface
using function predict
. The function estimates the average activity at
each voxel of the brain. Further, we plot the estimated surface using
function levelplot
.
> p1 <- predict(m6)
> levelplot(p1[, 1] ~ Y * X, data = brain , xlab = "Y", ylab = "X",
+ col.regions = gray(10 : 100 / 100), contour = TRUE)
Results are shown in Figure 8, panel(b). The smooth surface makes it much easier to see and understand which parts of the brain have higher activity.
|
|
(a) | (b) |
In the fourth and final application we use the function mvrm
to
identify the best subset of predictors in a regression setting. Usually
stepwise model selection is performed, using functions step
from base
R and stepAIC
from the MASS package. Here we show how mvrm
can be
used as an alternative to those two functions. The data frame that we
apply mvrm
on is mtcars
, where the response variable is mpg
and
the explanatory variables that we consider are disp
, hp
, wt
, and
qsec
. The code below loads the data frame, specifies the model and
obtains samples from the posteriors of the model parameters.
> data(mtcars)
> Model <- mpg ~ disp + hp + wt + qsec | 1
> m7 <- mvrm(formula = Model, data = mtcars, sweeps = 50000, burn = 25000, thin = 2,
+ seed = 1, StorageDir = DIR)
The following is an excerpt of the output that the summary
method
produces showing the three models with the highest posterior
probability.
> summary(m7, nModels = 3)
Joint mean/variance model posterior probabilities:
mean.disp mean.hp mean.wt mean.qsec freq prob cumulative
1 0 1 1 0 1085 43.40 43.40
2 0 0 1 1 1040 41.60 85.00
3 0 0 1 0 128 5.12 90.12
Displaying 3 models of the 11 visited
3 models account for 90.12% of the posterior mass
The model with the highest posterior probability (hp
and wt
. The model that
includes wt
and qsec
has almost equal posterior probability,
wt
as
predictor, but its posterior probability is much lower,
In this section we present the technical details of how the MCMC
algorithm is designed for the case where there is a single covariate in
the mean and variance models. We first note that to improve mixing of
the sampler, we integrate out vector
The six steps of the MCMC sampler are as follows
Similar to Chan et al. (2006), the elements of
Vectors
Generating the proposed value for
Let
Next we define
Let
We update
The full conditional corresponding to the normal prior
Parameter
Parameter
The sampler utilizes the marginal in ((14)) to improve
mixing. However, if samples are required from the posterior of
We have presented a tutorial on several functions from the R package
BNSP. These functions are used for specifying, fitting and summarizing
results from regression models with Gaussian errors and with mean and
variance functions that can be modeled nonparametrically. Function sm
is utilized to specify smooth terms in the mean and variance functions
of the model. Function mvrm
calls an MCMC algorithm that obtains
samples from the posteriors of the model parameters. Samples are
converted into an object of class "mcmc"
by the function mvrm2mcmc
which facilitates the use of multiple functions from package coda.
Functions print.mvrm
and summary.mvrm
provide summaries of fitted
"mvrm"
objects. Further, function plot.mvrm
provides graphical
summaries of parametric and nonparametric terms that enter the mean or
variance function. Lastly, function predict.mvrm
provides predictions
for a future response or a mean response along with the corresponding
prediction/credible intervals.
BNSP, bamlss, spikeSlabGAM, brms, gamboostLSS, mgcv, coda, ggplot2, plot3D, threejs, colorspace, np, gamair, lattice
Bayesian, Econometrics, Environmetrics, GraphicalModels, MachineLearning, MetaAnalysis, MixedModels, Phylogenetics, Spatial, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Papageorgiou, "BNSP: an R Package for Fitting Bayesian Semiparametric Regression Models and Variable Selection", The R Journal, 2018
BibTeX citation
@article{RJ-2018-069, author = {Papageorgiou, Georgios}, title = {BNSP: an R Package for Fitting Bayesian Semiparametric Regression Models and Variable Selection}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {526-548} }