The Method of Anchored Distributions (MAD) is a method for Bayesian inversion designed for inferring both local (e.g. point values) and global properties (e.g. mean and variogram parameters) of spatially heterogenous fields using multi-type and multi-scale data. Software implementations of MAD exist in C++ and C# to import data, execute an ensemble of forward model simulations, and perform basic post-processing of calculating likelihood and posterior distributions for a given application. This article describes the R package anchoredDistr that has been built to provide an R-based environment for this method. In particular, anchoredDistr provides a range of post-processing capabilities for MAD software by taking advantage of the statistical capabilities and wide use of the R language. Two examples from stochastic hydrogeology are provided to highlight the features of the package for MAD applications in inferring anchored distributions of local parameters (e.g. point values of transmissivity) as well as global parameters (e.g. the mean of the spatial random function for hydraulic conductivity).
The field of geostatistics originated in the 1950s with the pioneering work of Krige (1951) and Matheron (1962) who tried to estimate the characteristics of subsurface properties with the limited measurements typically available in this field. This scarcity, caused by the high explorations costs, is exacerbated by the strong heterogeneity that many such subsurface properties exhibit. Both these factors combined make it impossible to describe any subsurface process with certainty, therefore necessitating the application of statistical tools. Today, geostatistics is used in many fields of earth science such as geology (Hohn 1962), hydrogeology (Kitanidis 2008), plus hydrology and soil science (Goovaerts 1999). To meet this demand, many software packages have been developed that provide practitioners and scientists alike with the much needed tools to apply geostatistics. In R, the best collection of such tools is arguably found in the gstat package (Pebesma 2004) developed and maintained by Pebesma and colleagues. With gstat, it is possible to estimate (Kriging) and simulate (Gaussian process generation) heterogenous fields in one, two or three dimensions, therefore providing necessary tools for geostatistical analysis.
Any such statistical analysis should draw on all available data that are connected to the variable of interest to infer, i.e. to learn about, its spatial distribution as much as possible. Examples for such spatially distributed variables in earth sciences would be, e.g. the hydraulic conductivity of an aquifer, evapotranspiration rates of different land surface areas, and soil moisture. In classical statistics, such information may consist of measurements of the variable itself or so-called local variables. Here, local means that a point-by-point relationship between both variables exists. However, many data are non-local, which means they are connected to the variable of interest via a complicated forward model. For instance, hydraulic conductivity may be connected by a solute transport model to break-through curves of said solutes and soil moisture may be connected by a hydraulic catchment model to river discharge. To learn about the input from the output of such forward models means to invert them, hence the name inversion for such techniques.
The Method of Anchored Distributions (MAD) provides a Bayesian framework for the geostatistical inversion of spatially heterogeneous variables. MAD solves the aforementioned problem by converting non-local data into equivalent local data using the tools of Bayesian inference. The result of such a conversion is the consistent representation of all data (local and non-local) as local data only, which is then amendable to further geostatistical analysis (Rubin et al. 2010). So far, applications of MAD have been focused on hydrogeology (Murakami et al. 2010; Chen et al. 2012; Heße et al. 2015) as well as soil science (Over et al. 2015). However, given the explanations above, MAD is in no way limited to these fields and can be employed wherever non-local data need to be incorporated into a geostatistical framework. This generality also extends to the spatial model being inferred. While there are R packages utilizing Bayesian inference for spatial models such as spBayes (Finley et al. 2015), spTimer (Bakar and Sahu 2015), and INLA (Lindgren and HåRue 2015 software available from http://www.r-inla.org/), these packages have several constraints compared to anchoredDistr. First, each method assumes a Gaussian process for the spatial variability. MAD has no inherent distributional assumptions, which allows its application to a wide variety of scenarios where, for example, Gaussian fields are not justified. In addition, these packages are either geared toward large data sets (spBayes and spTimer) or applied to only local data (spBayes, spTimer, and INLA) while MAD focuses on addressing uncertainty due to sparse data sets by incorporating non-local data. Finally, MAD employs a non-parameteric likelihood estimation, which allows for great flexibility, in particular for non-linear forward models. The presented R package anchoredDistr provides an interface to the C# implementation of MAD. It allows post-processing of calculating likelihood and posterior distributions as well as visualization of the data.
Equation (1) displays the general procedure of Bayesian
inference where
An important element of MAD is a strict classification of all data into
local
Open-source software implementations for applying the entirety of MAD are available both with a graphical interface and a command-line interface to guide users through connecting their forward models and random field generators and to execute the ensemble of forward simulations (Osorio-Murillo et al. 2015). This software (available at http://mad.codeplex.com) was inspired by the claim that inverse modeling will be widely applied in hydrogeology only if user-friendly software tools are available (Carrera et al. 2005).
The package anchoredDistr described here focuses on extending the
post-processing capabilities of MAD software, particularly the
calculation of the likelihood distribution
In the current version of anchoredDistr, which only handles the
post-processing of a MAD application, it is assumed that prior
distributions of local and global parameters,
MADproject
" that extract and analyze data from these
databases, i.e. handling information regarding the samples from the
prior distributions and the resulting ensemble of simulated MADproject
" manually. The usage of anchoredDistr will generally
follow the workflow below (also see Figure 1):
Create "MADproject"
object with new()
function (passing slot
information if manually filling data)
Read data from MAD# databases, if being used, into "MADproject"
object with readMAD()
View the observations and realizations with plotMAD()
Apply any necessary dimension reduction techniques to reduceData()
Test the convergence of the likelihood distribution with respect to
the number of realizations with testConvergence()
(return to MAD
software to run additional realizations if unsatisfactory)
Calculate likelihood and posterior distributions with
calcLikelihood()
and calcPosterior()
, respectively
View the posterior distribution with plotMAD()
.
To install the anchoredDistr package, the release version is available from CRAN:
install.packages("anchoredDistr")
library(anchoredDistr)
Alternatively, the development version can be obtained by using the devtools package (Wickham and Chang 2016) to download the necessary files from GitHub:
library(devtools)
install_github("hsavoy/anchoredDistr")
library(anchoredDistr)
Other packages used by anchoredDistr include RSQLite (Wickham et al. 2014) for reading from MAD databases, np (Hayfield and Racine 2008) for estimating non-parametric density distributions, plyr (Wickham 2011) and dplyr (Wickham and Francois 2016) for efficient data manipulation, and ggplot2 (Wickham 2009) for plotting. The methods included in anchoredDistr are listed in Table 1 and two examples utilizing these methods are provided next.
Method | Description |
---|---|
readMAD() |
Reads data from databases generated by MAD software |
reduceData() |
Applies dimension reduction to |
testConvergence() |
Tests for convergence of likelihood values for |
increasing number of realizations | |
calcLikelihood() |
Calculates the likelihood values for the samples |
calcPosterior() |
Calculates the posterior values for the samples |
plotMAD() |
Plots the observations, realizations, reduced data, |
and/or posteriors |
In this first example, we will use the tutorial example available from
the MAD website http://mad.codeplex.com. Within the anchoredDistr
package, this tutorial example is available as MAD# databases, as well
as a "MADproject"
object accessed by data(tutorial)
. The variable of
interest is transmissivity
In the first step, a "MADproject"
object is created with the new()
function. Three arguments must be provided to read the MAD databases:
madname
(the name of the MAD project, e.g. the filename for the .xmad
database), resultname
(the name of the result from MAD, e.g. the
result folder name), and xpath
(the path to where the .xresult
database and result folder are located). These three arguments ensure
the MAD databases can be read by the method readMAD()
, which will read
in the prior distribution samples for the global and local parameters
plus the observations and forward model predictions for the readMAD()
(see
Table 2) are provided manually (see next example). To
create a "MADproject"
object for this tutorial example, the code below
will read the MAD# databases stored in the anchoredDistr package
files.
tutorial <- new("MADproject", madname="Tutorial", resultname="example1",
xpath=paste0(system.file("extdata", package = "anchoredDistr"),"/"))
tutorial <- readMAD(tutorial, 1:3)
Slot | Description | Source |
---|---|---|
madname |
MAD project name | user provided |
resultname |
MAD result name | user provided |
xpath |
Path to .xresult database | user provided |
numLocations |
Number of |
readMAD() |
numTimesteps |
Number of time steps measured at each |
readMAD() |
numSamples |
Number of samples drawn from prior distributions | readMAD() |
numAnchors |
Number of local parameters / anchors placed in field | readMAD() |
numTheta |
Number of random global parameters to infer | readMAD() |
truevalues |
True values for the parameters to infer, if known | readMAD() |
observations |
Observed values of the |
readMAD() |
realizations |
Simulated values of the |
readMAD() |
priors |
Samples from the prior distributions of each parameter | readMAD() |
likelihoods |
Likelihood values for each sample | calcLikelihoods() |
posteriors |
Posterior values for each sample of each parameter | calcPosteriors() |
The prior distributions can be viewed by calling the plotMAD()
function with the "MADproject"
object and the string "priors" (see
below). Figure 3 shows the prior distributions for the
four anchors in Example 1. The distributions roughly follow a Gaussian
distribution due to the baseline field being a Gaussian field and the
prior distributions based on the kriging mean and variance at these four
locations from the "MADproject"
object’s priors
slot,
which contains the random parameter names as provided in the MAD
software.
plotMAD(tutorial, "priors")
After the information contained in the MAD databases has been read into
the "MADproject"
object, the likelihood and posterior distributions
can be calculated by calcLikelihood()
and calcPosterior()
,
respectively. The method calcLikelihood()
uses non-parametric kernel
density estimation (from the package np) to estimate the probability
density of measured inversion data from the probability density function
of inversion data simulated from the realizations per sample. The method
calcPosterior()
multiplies the resulting likelihood distribution
across the samples and the provided prior distribution to calculate the
posterior.
First, we can call the testConvergence()
method to visually inspect if
we have enough realizations for the likelihood values of samples to
converge (this method calls the calcLikelihood()
internally to perform
this test). Figure 4 depicts this qualitative
convergence test for Example 1 by plotting the likelihood values of a
sample with increasing number of realizations. In order to prevent
cluttering, the default number of samples to display is set to seven
samples randomly selected from those available in the project.
Convergence is achieved when the likelihood stabilizes with increasing
realizations. For this example, it appears that the log likelihood of
the samples have started to stabilize by 50 realizations, but more
realizations may be warranted.
The posterior distributions for each random parameter can be seen by
calling plotMAD()
with the "MADproject"
object and the string
"posteriors". Figure 5 shows the posteriors for
Example 1 along with the prior distribution and the true values for each
of the four anchors. The posterior distributions for Anchors 2 and 3,
which were surrounded by
testConvergence(tutorial)
tutorial <- calcLikelihood(tutorial)
tutorial <- calcPosterior(tutorial)
plotMAD(tutorial, "posteriors")
The second example depicts a different aquifer characterization scenario
for a two-dimensional field where the natural log transform of hydraulic
conductivity (new()
as shown below, as well as a pre-made "MADproject"
object
accessed with data(pumping)
.
load(system.file("extdata", "pumpingInput.RData", package = "anchoredDistr"))
pumping <- new("MADproject",
numLocations = 1,
numTimesteps = 100,
numSamples = 50,
numAnchors = 0,
numTheta = 1,
observations = obs,
realizations = realizations,
priors = priors)
When the pumping
dataset is initially loaded, we can view the
observation of
plotMAD(pumping, "observations")
plotMAD(pumping, "priors")
plotMAD(pumping, "realizations")
Even though we have the time series of drawdown, we cannot use these 100
individual values to calculate the likelihood because they are
correlated and the multivariate likelihood distribution would be
100-dimensional. Such dimensionality would require an unrealistic number
of realizations to resolve, known as "the curse of dimensionality." To
overcome this obstacle, dimension reduction is needed and the method to
use depends on the type of non-local data min()
function to collect the minimum head value
in the time series since the observed head reduces and converges to a
stable head value with time (Figure 7). The
anchoredDistr package can handle any non-parameterized function, such
as min()
, or a parameterized function if initial values for each
parameter are given and the nls()
function (R Core Team 2016) can perform the
fitting (see the package vignette for an example). The reduceData()
function is used to perform the dimension reduction on the time series:
pumping.min <- reduceData(pumping, min)
plotMAD(pumping.min, "realizations")
The reduceData()
function returns a "MADproject"
object with a
realizations
slot with reduced dimensions. The reduced data can be
viewed by calling plotMAD()
with the string "realizations". The plot
shows the distributions of each parameter for each sample. In this case,
Figure 10 shows the minimum head value distribution for
the three samples, which will be used to calculate the three likelihood
samples.
With this new "MADproject"
object, calcLikelihoods()
and
calcPosteriors()
can be called. In Figure 11, the
posterior distributions are shown for the three samples along with the
true value of
pumping.min <- calcLikelihood(pumping.min)
pumping.min <- calcPosterior(pumping.min)
plotMAD(pumping.min, "posteriors")
The examples given above show how the anchoredDistr package allows
flexible post-processing of results by virtue of the MAD software such
that users can apply their own post-processing analyses, such as
dimension-reduction techniques. The first example shown here is
available as external and internal datasets in the anchoredDistr
package. The second example is also included in anchoredDistr and is
further detailed in the package vignette. The release version of the
anchoredDistr package is hosted on CRAN and the development version is
hosted on GitHub, which can be accessed by calling
devtools::install_github("hsavoy/anchoredDistr")
or by downloading
from http://hsavoy.github.io/anchoredDistr.
This work was supported by the National Science Foundation under grant EAR-1011336, "The Method of Anchored Distributions (MAD): Principles and Implementation as a Community Resource." Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
gstat, spBayes, spTimer, anchoredDistr, devtools, RSQLite, np, plyr, dplyr, ggplot2
Bayesian, Databases, Econometrics, ModelDeployment, Phylogenetics, Spatial, SpatioTemporal, TeachingStatistics, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Savoy, et al., "anchoredDistr: a Package for the Bayesian Inversion of Geostatistical Parameters with Multi-type and Multi-scale Data", The R Journal, 2017
BibTeX citation
@article{RJ-2017-034, author = {Savoy, Heather and Heße, Falk and Rubin, Yoram}, title = {anchoredDistr: a Package for the Bayesian Inversion of Geostatistical Parameters with Multi-type and Multi-scale Data}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {2}, issn = {2073-4859}, pages = {6-17} }