Two important recent advances in areal modeling are the centered autologistic model and the sparse spatial generalized linear mixed model (SGLMM), both of which are reparameterizations of traditional models. The reparameterizations improve regression inference by alleviating spatial confounding, and the sparse SGLMM also greatly speeds computing by reducing the dimension of the spatial random effects. Package ngspatial (‘ng’ = non-Gaussian) provides routines for fitting these new models. The package supports composite likelihood and Bayesian inference for the centered autologistic model, and Bayesian inference for the sparse SGLMM.
The traditional autologistic model (Besag 1972) and areal GLMM (Besag et al. 1991) have enjoyed widespread popularity: they have been applied thousands of times in many fields, e.g., epidemiology, marketing, agriculture, ecology, forestry, geography, and image analysis. But it was recently discovered that both models are spatially confounded (Clayton et al. 1993; Caragea and Kaiser 2009). This confounding can cause bias and/or variance inflation in the estimators of regression coefficients, leading to erroneous regression inference. This is a serious drawback because many spatial modelers are interested in regression (rather than, or in addition to, prediction).
To address the confounding of the traditional autologistic model, Caragea and Kaiser (2009) devised the centered autologistic model, so named because it replaces the traditional model’s autocovariate with a centered autocovariate (see below for details).
The confounding of the mixed model was first addressed by Reich et al. (2006) using a technique now known as restricted spatial regression (Hodges and Reich 2010). This technique alleviates spatial confounding and also yields a faster mixing Markov chain, but the computational burden remains high because the dimension of the spatial random effects is reduced only slightly relative to the traditional model. By using the so called Moran operator, Hughes and Haran (2013) were able to reparameterize the mixed model in a way that not only improves regression inference but also dramatically reduces the dimension of the random effects. The resulting model, which we will call the sparse SGLMM, can be fitted so efficiently that even the largest areal datasets can be analyzed quickly.
These promising new models cannot be applied using existing software, and so we have provided support for the models in version 1.0 of R package ngspatial, the subject of this article. First we discuss the two models in some detail. Then we present ngspatial 1.0, which permits composite likelihood and Bayesian inference for the centered autologistic model, and Bayesian inference for the sparse SGLMM. We conclude with a summary.
The autologistic model and the sparse SGLMM are areal models, i.e.,
models for data observed at the vertices of a graph
The traditional autologistic model was proposed by Besag (1972). The
model is a Markov random field (MRF) model (Kindermann and Snell 1980),
which is to say that
We see that
Caragea and Kaiser (2009) showed that the traditional autologistic model is
confounded. This is because the traditional autocovariate is not well
suited to the task of fitting small-scale structure in the data, i.e.,
clustering induced by spatial dependence and residual to the large-scale
structure
Caragea and Kaiser (2009) reparameterized the model by centering the
autocovariate. The resulting conditional log odds are
Maximum likelihood and Bayesian inference for the autologistic model are
complicated by an intractable normalizing function. To see this, first
let
where
The normalizing function is intractable for all but the smallest
datasets because the sample space
The traditional SGLMM for areal data—sometimes referred to as the BYM model—was proposed by Besag et al. (1991). The BYM model is hierarchical, inducing spatial dependence by way of a latent autonormal random vector. Conditional on these spatial random effects, the observations are independent and follow an ordinary GLM. Specifically, the transformed conditional means are
where
The most common specification for
Reich et al. (2006) showed that the BYM model is spatially confounded in
the sense that the random effects can “pollute” the regression manifold
Since the columns of
Restricted spatial regression is not only an effective remedy for
confounding but also speeds computing. Because the columns of
By taking full advantage of
In other words, the eigenvectors of the Moran operator form a
multiresolutional spatial basis for
By retaining only eigenvectors that exhibit positive spatial dependence, we can usually reduce the model dimension by at least half a priori. And Hughes and Haran (2013) showed that a much greater reduction is possible in practice, with 50–100 eigenvectors being sufficient for most datasets.
Let
We note that the sparse SGLMM was partly inspired by spatial filtering, also known as principle coordinates of neighbor matrices. See, e.g., Griffith (2003; Dray et al. 2006; Borcard et al. 2011) for more information. Package spdep provides tools for spatial filtering.
Package ngspatial supports composite likelihood and Bayesian inference
for the centered autologistic model, and Bayesian inference for the
sparse SGLMM. Usage of the package’s two main functions, autologistic
and sparse.sglmm
, is simple and intuitive while offering considerable
flexibility.
The fitting function for the centered autologistic model is called
autologistic
. This function returns an object of class
"autologistic"
. Auxiliary functions residuals
, summary
, and vcov
accept an object of type "autologistic"
and return various kinds of
residuals, summarize the fit, and return an estimated covariance matrix,
respectively.
A key component of our autologistic framework is function
rautologistic
, which simulates a draw from a centered autologistic
model. The function has the following signature.
rautologistic(X, A, theta)
The three arguments are as described above. The functions of ngspatial
require that an adjacency matrix be binary and symmetric (see function
isSymmetric
from the base package) and of type matrix
. Note that
package spdep provides function nb2mat
, which constructs an
adjacency matrix from a neighbors list. spdep also provides many other
functions that are useful for handling adjacency structures.
This function employs coupling from the past (Propp and Wilson 1996) to generate a vector distributed exactly according to ((1)). We use perfect sampling, rather than ordinary MCMC, for two reasons. First, the MCMC algorithm we use to do Bayesian inference requires that we draw a perfect sample during each iteration (Møller et al. 2006). Were we to use an approximation, the resulting Markov chain would not necessarily have the true posterior as its stationary distribution (Murray et al. 2006). Second, although perfect sampling can be computationally burdensome, a carefully implemented perfect sampler is fast enough to permit composite likelihood analysis of even very large datasets, while obviating convergence diagnosis.
Our perfect sampler was implemented in C++. More specifically, we used
the Armadillo linear algebra library (Sanderson 2010), which
provides two benefits: (1) intuitive syntax for linear algebra (e.g.,
Z.t() * X * beta
), and (2) speed (Armadillo uses delayed evaluation to
combine multiple operations, thus reducing or eliminating the need for
temporaries). We integrated the C++ and R code using the
Rcpp and
RcppArmadillo
packages (Eddelbuettel and François 2011). We tested our
Armadillo-based sampler in a number of circumstances and found it to be
over three times as fast as an optimal pure R version.
One way to overcome the intractability of the normalizing function
Although ((3)) is not the true log likelihood unless
We find the MPLE
optim
function to minimize
Confidence intervals can be obtained using a parametric bootstrap or
sandwich estimation. For the former we generate
The second approach for computing confidence intervals is based on (Varin et al. 2011)
where
optim
) in place of
Because the bootstrap sample can be generated in parallel (using the parallel package) and little subsequent processing is required, these approaches to inference are very efficient computationally, even for large datasets. We note that sandwich estimation is over twice as fast as the full bootstrap. Moreover, asymptotic inference and bootstrap inference yield comparable results for practically all sample sizes because ((4)) is not, in fact, an asymptotic result. This is because the log pseudolikelihood is approximately quadratic with Hessian approximately invariant in law, which implies that the MPLE is approximately normally distributed irrespective of sample size (Geyer 2013).
For Bayesian inference we use the auxiliary-variable MCMC algorithm of
Møller et al. (2006), which allows us to construct a proposal
distribution so that the normalizing constant cancels from the
Metropolis-Hastings ratio. The method requires that we can draw
independent realizations from the unnormalized density for any value of
Let
Because the auxiliary proposals cannot be generated in parallel, this approach to Bayesian analysis is computationally expensive. Our optimized perfect sampler eases the burden somewhat, as does our Armadillo implementation of ((5)). We achieve an additional gain in efficiency as follows.
We use a normal random walk Metropolis-Hastings algorithm, and so our
proposal for
As for the prior sigma
. The prior for eta.max
.
Now we present some usage examples for the autologistic model. The
fitting function is called autologistic
and has the following
signature.
autologistic(formula, data, A, method = c("PL", "Bayes"), model = TRUE, x = FALSE,
y = FALSE, verbose = FALSE, control = list())
Arguments formula
, data
, model
, x
, and y
are analogous to
those that appear in the signatures of other commonly used R model
fitting functions, e.g., lm
and glm
. A
is of course the binary
adjacency matrix for method
argument is used to select
pseudolikelihood inference or Bayesian inference. If verbose
is equal
to TRUE
, various messages may be sent to the standard output stream.
Most of those messages have to do with the default values for control
parameters that were not set by the user via the control
argument.
For the following example the underlying graph is the
R> library("ngspatial")
R> library("lattice")
R> n <- 50
R> A <- adjacency.matrix(n)
R> x <- rep(0:(n - 1) / (n - 1), times = n) - 0.5
R> y <- rep(0:(n - 1) / (n - 1), each = n) - 0.5
R> X <- cbind(x, y)
R> beta <- c(2, 2)
R> mu <- exp(X %*% beta)
R> mu <- mu / (1 + mu)
R> col1 <- "white"
R> col2 <- "black"
R> colfunc <- colorRampPalette(c(col1, col2))
R> dev.new()
R> levelplot(mu ~ x * y, aspect = "iso", col.regions = colfunc(n^2))
R> theta <- c(beta, eta = 0.6)
R> set.seed(123456)
R> Z <- rautologistic(X, A, theta)
R> dev.new()
R> levelplot(Z ~ x * y, aspect = "iso", col.regions = c("white", "black"),
+ colorkey = FALSE)
R> fit <- autologistic(Z ~ X - 1, A = A, control = list(confint = "none"))
R> summary(fit)
Call:
autologistic(formula = Z ~ X - 1, A = A, control = list(confint = "none"))
Control parameters:
confint none
Coefficients:
Estimate Lower Upper MCSE
Xx 1.846 NA NA NA
Xy 1.920 NA NA NA
eta 0.483 NA NA NA
Number of iterations: 0
The next two examples compute confidence intervals for the same dataset. The first example uses the normal approximation. The "filling" in the sandwich matrix is estimated using a parallel parametric bootstrap, where the computation is distributed across six cores on the host workstation. The second example computes bootstrap confidence intervals, where the size of the bootstrap sample is 500. The computation is once again parallelized.
R> set.seed(123456)
R> fit <- autologistic(Z ~ X - 1, A = A, verbose = TRUE,
+ control = list(confint = "sandwich", nodes = 6))
R> summary(fit)
Control parameter 'bootit' must be a positive whole number. Setting it to the default
value of 1,000.
Control parameter 'parallel' must be a logical value. Setting it to the default value
of TRUE.
Control parameter 'type' must be "SOCK", "PVM", "MPI", or "NWS". Setting it to "SOCK".
Warning: Bootstrapping may be time consuming.
Call:
autologistic(formula = Z ~ X - 1, A = A, verbose = TRUE,
control = list(confint = "sandwich", nodes = 6))
Control parameters:
confint sandwich
bootit 1000
parallel TRUE
type SOCK
nodes 6
Coefficients:
Estimate Lower Upper MCSE
Xx 1.846 1.4630 2.2300 NA
Xy 1.920 1.5410 2.2990 NA
eta 0.483 0.3619 0.6042 NA
Number of iterations: 1000
R> set.seed(123456)
R> fit <- autologistic(Z ~ X - 1, A = A, verbose = TRUE,
+ control = list(confint = "bootstrap", bootit = 500, nodes = 6))
R> summary(fit)
Control parameter 'parallel' must be a logical value. Setting it to the default value
of TRUE.
Control parameter 'type' must be "SOCK", "PVM", "MPI", or "NWS". Setting it to "SOCK".
Warning: Bootstrapping may be time consuming.
Call:
autologistic(formula = Z ~ X - 1, A = A, verbose = TRUE,
control = list(confint = "bootstrap", bootit = 500, nodes = 6))
Control parameters:
confint bootstrap
bootit 500
parallel TRUE
type SOCK
nodes 6
Coefficients:
Estimate Lower Upper MCSE
Xx 1.846 1.4980 2.2510 0.007607
Xy 1.920 1.5400 2.3270 0.007098
eta 0.483 0.3595 0.6063 0.003406
Number of iterations: 500
It took several minutes to run each of the two preceding examples on a
Mac Pro with two quad-core 2.8 GHz Intel Xeon CPUs. Since Bayesian
inference would be prohibitively expensive for a dataset so large, the
next example was run on a dataset for which
We use, and recommend, fixed-width output analysis
(Flegal et al. 2008). In fixed-width analysis, one chooses a tolerance
and stops sampling when all Monte Carlo standard errors fall below the
tolerance. The output shown below indicates that the Monte Carlo
standard errors fell below the default tolerance of 0.01 after 714,025
draws were made from the posterior, which is why the analysis was
terminated before having reached the (default) maximum number of
iterations (maxit = 1e6
). We use the
batchmeans package
(Haran and Hughes 2012) to compute Monte Carlo standard errors (denoted MCSE
in
the output).
In this example we use a training run of length 10,000, and the length of the subsequent inferential run will be at least 10,000.
R> n <- 20
R> A <- adjacency.matrix(n)
R> x <- rep(0:(n - 1) / (n - 1), times = n) - 0.5
R> y <- rep(0:(n - 1) / (n - 1), each = n) - 0.5
R> X <- cbind(x, y)
R> set.seed(123456)
R> Z <- rautologistic(X, A, theta)
R> fit <- autologistic(Z ~ X - 1, A = A, verbose = TRUE, method = "Bayes",
+ control = list(trainit = 10000, minit = 10000))
R> summary(fit)
Control parameter 'tol' must be a positive number. Setting it to the default value
of 0.01.
Control parameter 'maxit' must be a positive whole number greater than or equal to
'minit'. Setting it to 1e+06.
Control parameter 'sigma' must be a positive number. Setting it to the default value
of 1,000.
Control parameter 'eta.max' must be a positive number. Setting it to the default
value of 2.
Warning: MCMC may be time consuming.
Progress => 5%
Progress => 10%
Progress => 15%
Call:
autologistic(formula = Z ~ X - 1, A = A, method = "Bayes", verbose = TRUE,
control = list(trainit = 10000, minit = 10000))
Control parameters:
trainit 1e+04
tol 1e-02
minit 1e+04
maxit 1e+06
sigma 1e+03
eta.max 2e+00
Coefficients:
Estimate Lower Upper MCSE
Xx 2.4450 1.2120 3.667 0.009917
Xy 1.6800 0.4767 2.832 0.009800
eta 0.7363 0.4805 1.028 0.002631
Number of iterations: 714025
The object returned by function autologistic
is a list with many
useful elements. The following output resulted from applying the names
function to the object returned in the second example above.
R> names(fit)
[1] "coefficients" "fitted.values" "linear.predictors" "residuals" "convergence"
[6] "message" "value" "iter" "sample" "mcse"
[11] "xlevels" "call" "terms" "method" "control"
[16] "model"
The standard coef
and fitted
functions can be applied to an object
of type ‘autologistic
Ṫhe residuals
method for’autologistic
’
objects can be used to extract deviance (default), Pearson, or response
residuals.
R> library(lattice)
R> dev.new()
R> levelplot(fitted(fit) ~ x * y, aspect = "iso", col.regions = colfunc(n^2))
R> dev.new()
R> levelplot(residuals(fit) ~ x * y, aspect = "iso", col.regions = colfunc(n^2))
R> dev.new()
R> levelplot(residuals(fit, type = "pearson") ~ x * y, aspect = "iso",
+ col.regions = colfunc(n^2))
R> dev.new()
R> levelplot(residuals(fit, type = "response") ~ x * y, aspect = "iso",
+ col.regions = colfunc(n^2))
Other fields of interest are value
, which contains
sample
, which
contains the bootstrap or posterior sample.
The fitting function for the sparse SGLMM is called sparse.sglmm
. The
function returns an object of class "sparse.sglmm"
. Auxiliary
functions residuals
, summary
, and vcov
accept an object of type
"sparse.sglmm"
and return various kinds of residuals, summarize the
fit, and return an estimated covariance matrix, respectively.
The second stage of the sparse SGLMM, i.e., the prior for
sigma.b
, which defaults to 1,000 but can be controlled
by the user via argument hyper
. The prior for
When the response is normally distributed, the identity link is assumed,
in which case the first stage is
a.h
and b.h
. The default
values for these parameters are 0.01 and 100, respectively, or their
values can be set by the user through argument hyper
.
If the response is Bernoulli or Poisson, glm
fit to the data. The proposal for sigma.s
. If the
response is Poisson distributed and heterogeneity random effects are
included, those random effects are updated using a Metropolis-Hastings
random walk with a spherical normal proposal. The common standard
deviation is sigma.h
. Both sigma.s
and sigma.h
default to 0.01,
but they can be set via argument tune
.
The updates for
Function sparse.sglmm
has the following signature.
sparse.sglmm(formula, family = gaussian, data, offset, A, attractive = 50,
repulsive = 0, tol = 0.01, minit = 10000, maxit = 1e+06, tune = list(),
hyper = list(), model = TRUE, x = FALSE, y = FALSE, verbose = FALSE)
Most of the arguments are analogous to those already described above.
Arguments attractive
and repulsive
are used to select the number of
Moran eigenvectors. Both attractive and repulsive eigenvectors are
permitted, although repulsive
defaults to 0, which corresponds to pure
spatial smoothing. Function sparse.sglmm
checks the validity of
attractive
and repulsive
by eigendecomposing the Moran operator and
examining the spectrum. Execution is terminated (with an informative
error message) if either value is invalid. Arguments tune
and hyper
can be used to supply values for tuning parameters and hyperparameters,
respectively, as described above.
We will illustrate the usage of our sparse SGLMM function by analyzing the data shown in Figure 2. The plot shows infant mortality data for 3,071 US counties. Each shaded peak represents a ratio of deaths to births, i.e., an empirical infant mortality rate, for a given county. The data were obtained from the 2008 Area Resource File (ARF), a county-level database maintained by the Bureau of Health Professions, Health Resources and Services Administration, US Department of Health and Human Services. Specifically, three variables were extracted from the ARF: the three-year (2002–2004) average number of infant deaths before the first birthday, the three-year average number of live births, and the three-year average number of low birth weight infants.
To these data we fit the sparse Poisson SGLMM with
where deaths is the number of infant deaths;
births is the number of live births; low is
the rate of low birth weight; black is the percentage of
black residents (according to the 2000 US Census); hisp is
the percentage of hispanic residents (2000 US Census);
gini is the Gini coefficient, a measure of income
inequality (Gini 1921); aff is a
composite score of social affluence (Yang et al. 2009); and
stab is residential stability, an average
These data were analyzed previously in Hughes and Haran (2013), where it was found that the model with 50 attractive eigenvectors yields the smallest value of DIC among the sequence of models with 25, 50, 100, and 200 attractive eigenvectors. The following code duplicates the analysis of Hughes and Haran (2013).
R> data(infant)
R> infant$low_weight <- infant$low_weight / infant$births
R> attach(infant)
R> Z <- deaths
R> X <- cbind(1, low_weight, black, hispanic, gini, affluence, stability)
R> data(A)
R> set.seed(123456)
R> fit <- sparse.sglmm(Z ~ X - 1 + offset(log(births)), family = poisson, A = A,
+ tune = list(sigma.s = 0.02), verbose = TRUE)
R> summary(fit)
Hyperparameter 'sigma.b' must be a positive number. Setting it to the default
value of 1,000.
The Moran operator is being eigendecomposed. This operation may be time consuming.
Warning: MCMC may be time consuming.
Progress => 5%
.
.
.
Progress => 100%
Call:
sparse.sglmm(formula = Z ~ X - 1 + offset(log(births)), family = poisson,
A = A, tune = list(sigma.s = 0.02), verbose = TRUE)
Tuning parameters:
sigma.s 0.02
Hyperparameters:
sigma.b 1000
Coefficients:
Estimate Lower Upper MCSE
X -5.435000 -5.621000 -5.250000 9.466e-04
Xlow_weight 8.821000 7.552000 10.050000 5.673e-03
Xblack 0.004184 0.002864 0.005545 6.568e-06
Xhispanic -0.003792 -0.004870 -0.002678 6.342e-06
Xgini -0.553200 -0.984300 -0.126000 2.070e-03
Xaffluence -0.075600 -0.087760 -0.063740 5.148e-05
Xstability -0.028670 -0.043350 -0.013980 6.360e-05
DIC: 10110
Number of iterations: 1000000
The analysis took just under two hours. Fitting the RHZ model to these data took approximately two weeks and required over 40 GB of secondary storage.
We also used CARBayes to apply the ICAR model to these data. The results for the two fits are shown together in Table 1. We see that the ICAR fit suffers from spatial confounding. Specifically, some of the ICAR point estimates differ markedly from the sparse SGLMM estimates, the ICAR intervals are wider, and the Gini coefficient is not a significant predictor in the ICAR fit.
Sparse SGLMM | ICAR | |||
Covariate | Estimate | 95% HPD Interval | Estimate | 95% HPD Interval |
Intercept | ||||
Low Weight | 8.821 | (7.552, 10.050) | 7.754 | (6.396, 9.112) |
Black | 0.004 | (0.003, 0.006) | 0.004 | (0.002, 0.006) |
Hispanic | ||||
Gini | ||||
Affluence | ||||
Stability |
The object returned by function sparse.sglmm
is a list. The following
output resulted from applying the names
function to the object
produced in the example above.
R> names(fit)
[1] "coefficients" "fitted.values" "linear.predictors" "residuals" "beta.sample"
[6] "gamma.sample" "tau.s.sample" "beta.mcse" "gamma.mcse" "tau.s.mcse"
[11] "gamma.est" "tau.s.est" "iter" "D.bar" "pD"
[16] "dic" "beta.accept" "gamma.accept" "xlevels" "call"
[21] "terms" "formula" "family" "offset" "model"
[26] "tune" "hyper"
The standard coef
and fitted
functions can be applied to an object
of type ‘sparse.sglmm
’. The residuals
method for ‘sparse.sglmm
’
objects can be used to extract deviance (default), Pearson, or response
residuals.
Other particularly useful fields include beta.sample
, gamma.sample
,
and tau.s.sample
, which contain the posterior samples for
This article introduced version 1.0 of R package ngspatial, which
supports two promising new models for areal data, namely, the centered
autologistic model and the sparse SGLMM. The package is user friendly
because its model-fitting functions and auxiliary functions were
designed to behave like the analogous functions (e.g., lm
, glm
,
summary
) in the stats package. The package is also efficient because
the code uses vector and matrix operations where possible, and because
key portions of the code were written in C++.
This work was supported by a University of Minnesota Grant-in-Aid of Research, Artistry, and Scholarship.
ngspatial, CARBayes, spdep, Rcpp, RcppArmadillo, batchmeans
Bayesian, HighPerformanceComputing, NumericalMathematics, Spatial
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hughes, "ngspatial: A Package for Fitting the Centered Autologistic and Sparse Spatial Generalized Linear Mixed Models for Areal Data", The R Journal, 2015
BibTeX citation
@article{RJ-2014-026, author = {Hughes, John}, title = {ngspatial: A Package for Fitting the Centered Autologistic and Sparse Spatial Generalized Linear Mixed Models for Areal Data}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {6}, issue = {2}, issn = {2073-4859}, pages = {81-95} }