Many environmental and geographical models, such as those used in land degradation, agro-ecological and climate studies, make use of spatially distributed inputs that are known imperfectly. The R package spup provides functions for examining the uncertainty propagation from input data and model parameters onto model outputs via the environmental model. The functions include uncertainty model specification, stochastic simulation and propagation of uncertainty using Monte Carlo (MC) techniques. Uncertain variables are described by probability distributions. Both numerical and categorical data types are handled. The package also accommodates spatial auto-correlation within a variable and cross-correlation between variables. The MC realizations may be used as input to the environmental models written in or called from R. This article provides theoretical background and three worked examples that guide users through the application of spup.
Environmental models that are used to understand and manage natural systems often rely on spatial data as input. Uncertainty in the input data propagates into model predictions. There is a need for systematic analysis of spatial uncertainty propagation in all major inputs and an accounting of spatial auto- and cross-correlation between input uncertainties. Spatial uncertainty propagation analysis has been widely recommended and practised across environmental disciplines, for example, hydrology and water quality (e.g. Hengl et al. 2010; Hamel and Guswa 2015; Muthusamy et al. 2016; Nijhof et al. 2016; Yu et al. 2016), biogeochemistry (e.g. Boyer et al. 2006; Nol et al. 2010; Pennock et al. 2010; Hugelius 2012; Vanguelova et al. 2016), or ecology and conservation planning (e.g. Jager et al. 2005; Fisher et al. 2010; Lechner et al. 2014). However, as far as we know, there is no widely applicable software that facilitates such analysis.
Currently, a growing number of software tools are available for uncertainty propagation and uncertainty assessment, some of which have functionality for dealing with spatial uncertainties. These include both free software, like OpenTURNS (Andrianov et al. 2007), Dacota (Adams et al. 2009) and DUE (Brown and Heuvelink 2007); commercial options, like COSSAN (Schueller and Pradlwarter 2006); or free packages for licenced software, like the SAFE (Pianosi et al. 2015) or UQLab (Marelli and Sudret 2014) toolboxes for MATLAB. A broad review of existing software packages are available in Bastin et al. (2013). The use of powerful but complex languages like C++ in Dakota, Python in OpenTURNS or Java in DUE often discourage relevant portions of the scientific community without formal programming skills from the adoption of otherwise powerful tools. There are a number of R packages relating to uncertainty analysis through sensitivity analysis or use of Bayesian frameworks for model calibration. We have found only three packages: propagate (Spiess 2014), errors (Ucar 2017) and metRology (Ellison 2017) that deal with uncertainty propagation explicitly. However, none of these packages provides functionality for spatial models and variables. Therefore, we undertook a project to develop an R package that facilitates uncertainty propagation analysis in spatial environmental and geographical modelling.
In this article, we present the spup package for R (Sawicka et al. 2017), which implements a methodology for spatial uncertainty propagation analysis using Monte Carlo (MC) methods. The spup package includes functions for uncertainty model specification, sampling, propagation of uncertainty using MC techniques and examples of results visualization. In this way, it provides support for the entire chain from uncertainty model definition, simulation and propagation to visualization. It is also a generic tool that can handle a great variety of cases. It is suitable for models that consider spatial variables saved in a spatial data frame or raster, as well as non-spatial variables, both continuous and categorical. It works with models written in R or other programming languages that can be called from R. The spup package is intended for researchers and practitioners who recognise the problems of uncertainty in data and models, and are looking for a simple, accessible implementation of a methodology for uncertainty propagation assessment and visualization. At the same time, it is designed to enable more experienced users to easily understand, customise, and possibly further develop the code.
Here, we provide a description of the implemented methodology, describe the package design and illustrate the usage with three simple examples. We demonstrate that the spup package is an effective and easy tool that can be used for multi-disciplinary research, model-based decision support and educational purposes.
Uncertainty propagation aims to analyse how uncertainty (e.g. from measurement error, sampling, or interpolation) in spatial data, combined with modeling uncertainty (e.g. in model parameters and structure) propagate through the model (Heuvelink et al. 2007). Many environmental and geographical phenomena of interest are spatial in nature and often have strong spatial correlations imposed by the physics and dynamics of the natural systems, making uncertainty evaluation difficult.
A common approach to uncertainty propagation analysis makes use of MC stochastic simulation (Hammersley and Handscomb 1979; Lewis and Orav 1989). The MC approach is very flexible and can reach an arbitrary level of accuracy, and is therefore generally preferred over analytical methods such as the Taylor series method (Heuvelink et al. 1989). The MC method for uncertainty propagation analysis consists of the following main steps: (i) uncertainty about the variables is expressed by probability distribution functions (pdfs); (ii) many sets of possible uncertain inputs are generated from their marginal or joint probability distribution using a pseudo-random number generator; (iii) the model is run for each of the simulated input sets; (iv) the set of model outputs forms a random sample from the output pdf, so that the parameters of the output distribution, such as the mean, standard deviation and quantiles, can be estimated from the sample. In other words, the spread in the model outputs characterises how uncertainty about the model inputs has propagated to the model output. Note that the above ignores uncertainty in model parameters and model structure, but these can easily be included if available as pdfs. For example, these might have been obtained through Bayesian calibration (Van Oijen et al. 2005). The above four steps are described in detail further below.
The most frequently used uncertainty quantification approach represents uncertainty with probability distribution functions (pdfs). In order to represent input uncertainty with a pdf we have to define, for each possible input value, the corresponding probability or probability density. Probability distribution functions can be very complex and can have many parameters. In practice, when information and data from which to derive the pdf are limited, assumptions and simplifications have to be made to obtain reliable estimates of the pdf. In order to reduce the complexity of a pdf for a continuous variable, the distribution function may be described with a simple parametric shape, such as the normal, uniform or lognormal distribution (Bertsekas and Tsitsiklis 2008). Categorical uncertain variables can be represented by non-parametric distribution, comprising a set of possible outcomes and their probabilities. The probabilities should be non-negative and the sum of the outcome probabilities should equal 1 (Heuvelink 2007).
Individual variables and the uncertainties associated with them may also be statistically dependent on other variables and their uncertainties. For example, visibility and water vapour in the atmosphere are correlated, and so is the uncertainty between these two properties. Note that correlation between variables as derived from pair observations need not be the same as the correlation between measurement or interpolation errors associated with these variables. For example, elevation and temperature in mountainous area are negatively correlated, while the associated uncertainties may not be correlated at all because of independent measurements methods. Thus, when multiple inputs are uncertain, cross-correlations in their uncertainties must be addressed. In that case, the uncertainty model is described using a joint probability distribution function (jpdf). If the uncertainties are independent, the jpdf is the product of the univariate pdfs and can be estimated by estimating each pdf separately. If the uncertainties are statistically dependent, these dependencies must be estimated alongside the pdfs. In practice, few parametric shapes are available to describe the jpdf of variables that are statistically dependent. For continuous numerical variables, a common assumption is that the multivariate uncertainty model follows a joint-normal distribution with a vector of means \(\mu\) and variance-covariance matrix \(\Sigma\), where the latter must be square, symmetric and positive-semidefinite.
In different environmental or geographical studies, many variables are spatially distributed and it is likely that there is a relationship between nearby values (i.e. they are auto-correlated). Uncertainty in such cases may be described by the following geostatistical model:
\[Z(x)=\mu(x) + \sigma(x)\varepsilon(x) \label{eq:Eq1} \tag{1}\] where \(\mu(x)\) is the mean of the variable \(Z(x)\) at any location \(x \in D\), \(D\) is the geographic domain of interest, \(\sigma(x)\) is a spatially variable standard deviation associated with \(\mu(x)\) (i.e. \(\sigma\) parameterises uncertainty such that it may be greater in some regions than in others), and \(\varepsilon\) is a zero-mean, unit-variance, second-order stationary residual (i.e. spatial auto-correlation depends only on a distance vector), modelled with a semivariogram or a correlogram (Diggle and Ribeiro 2007; Webster and Oliver 2007; Plant 2012). For defining spatial correlation, spup supports a wide range of positive-definite functions implemented in the gstat package (Pebesma 2004; Gräler et al. 2016). The mathematical background for these calculations is presented in Brown and Heuvelink (2007) and Heuvelink (2007). In case of spatial variables that are cross-correlated with other spatial variables, the linear model of coregionalization (Goovaerts 1997) is assumed.
Table 1 summarises common assumptions and simplifications in uncertainty propagation analysis for three categories of data: continuous, discrete and categorical. These assumptions are included in the implementation of spup.
Quantifying the univariate pdf | Uncertainty in spatially distributed variables | Cross-correlation between variables | |
---|---|---|---|
Continuous variable (e.g. rainfall, pollutant concentration, elevation) | A pdf is often assumed to follow the normal distribution with mean \(\mu\) and standard deviation \(\sigma\). Non-normal distributions (such as the exponential or lognormal distribution) may also be assumed. For distributions supported in spup see the spup manual. Alternatively, variables may be transformed to ensure that the transformed variable has a normal distribution. Common transformations are square root, logarithm and the Box-Cox transform. Hard minima and maxima can be imposed by using truncated distributions that leave the shape of distributions unchanged except that below the minimum and above the maximum the probability (density) is set to zero. | For continuous uncertain inputs that are spatially correlated, the normal distribution is often imposed because without it the uncertainty quantification would become too complex. Therefore, it is recommended to assume that the input or some transformation of it is multivariate normally distributed and define the uncertainty in terms of the normally distributed (transformed) variable. In addition, ‘second-order stationarity’ is also often assumed (Goovaerts 1997; Diggle and Ribeiro 2007), indicating that the spatial correlation between errors at two locations depends only on the distance vector between the locations. This enables us to represent spatial correlation with a simple function (correlogram). | The statistical dependencies between jointly normally-distributed uncertain variables are fully characterised with a correlation matrix. The correlation matrix is a square, symmetric and positive-semidefinite matrix with unit values on the diagonal and values between -1 and 1 on the off-diagonals. Correlations can be estimated using paired observations of the variables of interest. However, it is important to note that the correlation between variables as derived from paired observations need not be the same as the correlation between measurement or interpolation errors associated with these variables. E.g., DOC and NH4 concentrations of river water are typically positively correlated, while their respective measurement errors are not, if measured independently. |
Discrete variable (e.g. fish count, number of wildfires in a given area and time period) | The pdf can be represented by a parametric shape, e.g. Poisson distribution, with mean or rate \(\lambda\), where \(\mu_{Z} = \sigma_{Z}= \lambda\). In case no appropriate shape is available, all possible outcomes and associated probabilities must be tabulated or listed in a non-parametric pdf. | When discrete or categorical variables are spatially distributed, the simplest approach is to assume spatial statistical independence or perfect spatial dependence. More advanced approaches that allow intermediate levels of statistical dependence include the Markov random field approach (Blake et al. 2011) and the Generalised Linear Geostatistical Model (Diggle and Ribeiro 2007). It is also possible to divide or cluster the area into sub-regions, with a complete dependence within and independence between sub-regions. | Cross-dependencies between uncertain categorical variables are addressed by treating all combinations of categories as separate classes. Each combination is assigned a probability. This increases the number of classes dramatically, so that generalisation is often needed in practice and the total number of combined classes is reduced to ten or fewer. |
Categorical variable (e.g. land use, building type) | For most categorical variables, a parametric shape may not be available. In that case, each possible outcome and associated probability must be listed in a non-parametric pdf. |
When an uncertain variable is characterised by a pdf, simulation relies on a pseudo-random number generator. The most common sampling method is independent identically distributed (IID) sampling, in which case each realization is generated as independent draws from the same pdf. In case of a single parameter or univariate distribution, a realization can be easily drawn from various pdfs using algorithms implemented in the stats package in R (see https://cran.r-project.org/web/views/Distributions.html for details). For a spatially distributed variable (e.g. a DEM) that has a multivariate normal distribution, spup relies on the unconditional Gaussian simulation algorithm implemented in the gstat package (Pebesma 2004). For simulation from multivariate normal pdfs of non-spatial variables, spup uses the mvtnorm package (Genz and Bretz 2009; Genz et al. 2017).
In case the input is high-dimensional, straightforward random drawing from the jpdf may not yield a sample that is representative across the whole range of allowable values for each variable, unless the sample size is extremely large. Therefore stratified sampling methods, where values are selected for each variable from each of a pre-specified number of intervals with equal probability mass (Iman and Conover 1982), are also implemented in spup. The latter can also be done in a spatial context (Pebesma and Heuvelink 1999).
Step 3 consists of running a model (e.g. an environmental model) for all sample elements (“simulated realities”) generated in Step 2. It is crucial that all models are automated and can be run in batch mode, as in many practical cases, the number of model runs for a Monte Carlo analysis should be at least 100, and often many more (Heuvelink 1998). In general, the MC sample size must be large enough to guarantee that the outcome of the uncertainty propagation analysis is sufficiently accurate (Heuvelink 2006). The limitation here is that the required sample size cannot be calculated beforehand because it is derived from the MC output.
It is important to plan what information one needs to retrieve from the set of model outputs. Strategies will vary if the user is interested in the combined or separate effect of uncertain input data or parameters. The simplest approach to estimating the uncertainty contribution of a single uncertain input or parameter is to perform the uncertainty propagation analysis with only the input or, one or more parameters made uncertain. The remaining uncertain parameters are then assumed certain and fixed on their central value. Comparison of the output uncertainties of different inputs (e.g. ratio of variances) provides a measure of the relative contribution. More details on stochastic sensitivity analysis can be found in Saltelli et al. (2000).
Running the model for multiple simulated realities resulting from step 2 may be computationally demanding, particularly when the model is complex and requires much computing time. The computation time can be reduced, as the MC method is very suited for parallel computing with multi-core machines and for grid computing technology (Leyva-Suarez et al. 2015). In this way, computation times can be dramatically reduced. Numerous high-performance and parallel computing packages have been developed in R suiting various approaches. A continuously maintained descriptive list of these is available online at https://cran.r-project.org/web/views/HighPerformanceComputing.html (Eddelbuettel 2017).
The resulting set of model outputs forms a random sample from the model
output probability distribution. This means that output pdf parameters,
such as the mean, median, variance, standard deviation, interquartile
range, percentiles, coefficient of variation and threshold exceedance
probabilities can be estimated from the sample. For example, the
percentiles of the sample can be computed by the quantile()
function
in R. The accuracy of the sample estimates can also be computed
(Gruijter et al. 2006), although this is currently not implemented in
spup. The sample of output values should be stored in case the model
output is used as uncertain input for a next model.
It is important that the uncertainty statistics are communicated in an efficient way that is simultaneously informative and understandable to users that have varied backgrounds in statistics. There should be ample explanation of the results and a visual representation of the uncertainty (MacEachren 1992; Hengl et al. 2004). For non-spatial and non-temporal data, a number of simple statistical plots can be used to visualise uncertainty represented by probability distributions. If the full probability distribution function is known, this can be visualised in pdf or cumulative distribution function (cdf) plots. Error bars, interquartile range and box plots can be used to show the distribution of values. Uncertainties about categorical data can be shown using stacked bars for the probability for each of the categories, or entropy that provides a summary measure of the spread of probabilities over the categories.
It is important to distinguish between presentation techniques and presentation modes for uncertaintly visualisation of spatial data. General presentation techniques are (i) adjacent maps, (ii) sequential presentation and (iii) bi-variate maps (MacEachren et al. 2005). These techniques can be presented in three modes: static, dynamic and interactive (Kinkeldey et al. 2014). A common example of adjacent maps for continuous variables is to put maps of the mean and standard deviation next to each other. For visualising categorical attribute uncertainty, the simplest method is also using adjacent maps, for example by showing a map of the the most probable category next to a map of the probability of that category. Sequential presentation may be shown in dynamic mode.
The spup package is developed in R to provide a simple solution for those who wish to include uncertainty analysis in their studies, with emphasises on studies including spatial variables. The spup package implements the methodology described in Section 2. We describe spup version 1.3-1 available on CRAN. The development version is available on gitHub at https://github.com/ksawicka/spup. The package is freely available under the GPL3.
Three example datasets are provided with the spup package:
Digital Elevation Model of Zlatibor region in Serbia (3x4.5km, 30m resolution) (Hengl et al. 2008)
Soil organic carbon and total nitrogen content of the south region of Lake Alaotra in Madagascar (33x33km, 250m resolution) (Hengl et al. 2017)
Rotterdam neighbourhood buildings distribution (Kadaster, The Netherlands).
These datasets are used in Section 4 and in the vignettes included in the package. Additionally, the package includes simple R and C functions that represent environmental models used in examples.
Figure 1 presents a flowchart with the workflow and key functions implemented in spup. The package is designed in a way that the key functions correspond with the steps presented in Section 2. The output of functions for defining the uncertainty model becomes an input to functions for generating a MC sample. The generated MC sample in turn is used as input to functions facilitating propagation of uncertainty. Qualitative descriptions of the key functions are given in Table 2. More details about functions arguments and output can be found in the package manual (Sawicka et al. 2017) and in the examples below.
Function | Description |
---|---|
makeCRM() |
Define a spatial correlogram model. |
defineUM() |
Define marginal uncertainty distributions for model inputs/parameters for subsequent Monte Carlo analysis. Output class depends on the arguments provided. |
defineMUM() |
Allows the user to define joint uncertainty distributions for continuous model inputs/parameters for subsequent Monte Carlo analysis. |
genSample() |
Methods for generating a Monte Carlo sample of uncertain variables. |
template() |
Defines a "container" class to store all templates with model inputs. |
executable() |
Produces an R function wrapper around a model that can be called using system2() from R. |
render() |
Replaces the tags in moustaches in template file or character object by text. |
propagate() |
Runs the model repeatedly with Monte Carlo realizations of the uncertain input/parameters. |
In many geographical studies, variables are cross-correlated. As a result, uncertainty in one variable may be statistically dependent on uncertainty in the other. For example, soil properties such as organic carbon (OC) and total nitrogen (TN) content are cross-correlated. These variables are used to derive the C/N ratio, which is important information to evaluate soil management and increase crop productivity. Errors in OC and TN maps will propagate into the C/N ratio map. The cross-correlation between the uncertainties in OC and TN will affect the degree of uncertainty in the C/N ratio.
The example data for C/N calculations are a 250m resolution mean OC and TN of a 33km x 33km area adjacent to lake Alaotra in Madagascar (Figure 2) (Hengl et al. 2017). The datasets include four spatial objects: a mean OC and TN of the area, with their standard deviations assumed to be equal to \(10\%\) of the mean.
# load packages
library(spup)
library(raster)
library(purrr)
# set seed
set.seed(12345)
# load and view the data
data(OC, OC_sd, TN, TN_sd)
class(OC); class(TN)
par(mfrow = c(1, 2), mar = c(1, 1, 2, 2), mgp = c(1.7, 0.5, 0), oma = c(0, 0, 0, 1),
las = 1, cex.main = 1, tcl = -0.2, cex.axis = 0.8, cex.lab = 0.8)
plot(OC, main = "Mean of Organic Carbon", xaxt = "n", yaxt = "n")
plot(TN, main = "Mean of Total Nitrogen", xaxt = "n", yaxt = "n")
The first step in uncertainty propagation analysis is to define an
uncertainty model for the uncertain input variables, here OC and TN.
First, the marginal uncertainty model is defined for each variable
separately, and next the joint uncertainty model is defined for the
variables together. In case of OC and TN, the \(\varepsilon\) (Eq.
(1)) are spatially correlated. For each of the variables, the
makeCRM()
function collates all necessary information into a list. We
assume that the spatial autocorrelation of the OC and TN errors are both
described by a spherical correlation function with a short-distance
correlation of 0.6 for OC and 0.4 for TN, with a range parameter of
1000m (Figure 3). It is important at this step to ensure
that the correlation function types as well as the ranges are the same
for each variable. This is a requirement for further analysis, because
spup employs the linear model of co-regionalization
(Wackernagel 2003).
# define spatial correlogram models
<- makeCRM(acf0 = 0.6, range = 1000, model = "Sph")
OC_crm <- makeCRM(acf0 = 0.4, range = 1000, model = "Sph")
TN_crm
par(mfrow = c(1, 2), mar = c(3, 2.5, 2, 1), mgp = c(1.7, 0.5, 0), oma = c(0, 0, 0, 0),
las = 1, cex.main = 1, tcl = -0.2, cex.axis = 0.8, cex.lab = 0.8)
plot(OC_crm, main = "OC correlogram")
plot(TN_crm, main = "TN correlogram")
Spatial correlograms summarise patterns of spatial autocorrelation in
data and model residuals. They show the degree of correlation between
values at two locations as a function of the separation distance between
the locations. In the case above, as is usually the case, the
correlation declines with distance. Here, the correlation is zero for
distances greater than 1000m. More information about correlograms is
included in the spup DEM vignette. In order to complete the
description of each individual uncertain variable, the defineUM()
function collates all information about the OC and TN uncertainty.
# define uncertainty model for the OC and TN
<- defineUM(distribution = "norm", distr_param = c(OC, OC_sd),
OC_UM crm = OC_crm, id = "OC")
<- defineUM(distribution = "norm", distr_param = c(TN, TN_sd),
TN_UM crm = TN_crm, id = "TN")
class(OC_UM)
1] "MarginalNumericSpatial"
[class(TN_UM)
1] "MarginalNumericSpatial" [
Both variables are of the same class "MarginalNumericSpatial"
. This is
one of the requirements for defining a multivariate uncertainty model.
The defineMUM()
function collates information about uncertainty in
each variable, and information about their cross-correlation.
# define multivariate uncertainty model
<- defineMUM(UMlist = list(OC_UM, TN_UM),
mySpatialMUM cormatrix = matrix(c(1, 0.7, 0.7, 1),
nrow = 2, ncol = 2))
class(mySpatialMUM)
1] "JointNumericSpatial" [
In this example, a geostatistical model for OC and TN is defined that assumes multivariate normality and has spatial correlation functions as follows: \[\begin{split} \rho_{OC}(h) &= 1~~~\text{for}~h=0 \\ \rho_{OC}(h) &= 0.6 \cdot \lbrace 1-\frac{3}{2} \frac{h}{1000} + \frac{1}{2} (\frac{h}{1000})^3 \rbrace~~~\text{for}~0<h<1000 \\ \rho_{OC}(h) &= 0 ~~~\text{for}~h\geq 1000 \end{split} \label{eq:corOC} \tag{2}\]
\[\begin{split} \rho_{TN}(h) &= 1~~~\text{for}~h=0 \\ \rho_{TN}(h) &= 0.4 \cdot \lbrace 1-\frac{3}{2} \frac{h}{1000} + \frac{1}{2} (\frac{h}{1000})^3 \rbrace~~~\text{for}~0<h<1000 \\ \rho_{TN}(h) &= 0 ~~~\text{for}~h\geq 1000 \end{split} \label{eq:corTN} \tag{3}\]
\[\begin{split} \rho_{OC,TN}(h) &= 0.7~~~\text{for}~h=0 \\ \rho_{OC,TN}(h) &= 0.7 \cdot \sqrt{0.6 \cdot 0.4} \cdot \lbrace 1-\frac{3}{2} \frac{h}{1000} + \frac{1}{2} (\frac{h}{1000})^3 \rbrace~~~\text{for}~0<h<1000 \\ \rho_{OC,TN}(h) &= 0 ~~~\text{for}~h\geq 1000 \end{split} \label{eq:corOCTN} \tag{4}\] where \(h\) is Euclidean distance and where the mathematical definition of the spherical correlation function (i.e., the semivariogram) has been used (Webster and Oliver 2007).
Generating possible realities of the selected variables is completed by
using the genSample()
function:
# create possible realizations from the joint distribution of OC and TN
<- 100
MC <- genSample(UMobject = mySpatialMUM, n = MC, samplemethod = "ugs",
OCTN_sample nmax = 20, asList = FALSE)
Note the argument asList
has been set to FALSE
. This indicates that
the sampling function will return an object of a class of the
distribution parameters class. This is useful if there is a need to
visualise the sample or compute summary statistics quickly.
A first assessment of uncertainty is done by plotting the means and standard deviations of the sampled OC and TN. Note that if the sample size is very large, then the sample mean and standard deviation would approximate the mean and standard deviation (Figure 4).
# compute and plot OC and TN sample statistics
# e.g. mean and standard deviation
<- OCTN_sample[[1:MC]]
OC_sample <- OCTN_sample[[(MC+1):(2*MC)]]
TN_sample <- mean(OC_sample)
OC_sample_mean <- mean(TN_sample)
TN_sample_mean <- calc(OC_sample, fun = sd)
OC_sample_sd <- calc(TN_sample, fun = sd)
TN_sample_sd
par(mfrow = c(1, 2), mar = c(1, 1, 2, 2), mgp = c(1.7, 0.5, 0), oma = c(0, 0, 0, 1),
las = 1, cex.main = 1, tcl = -0.2, cex.axis = 0.8, cex.lab = 0.8)
plot(OC_sample_mean, main = "Mean of OC realizations", xaxt = "n", yaxt = "n")
plot(TN_sample_mean, main = "Mean of TN realizations", xaxt = "n", yaxt = "n")
In order to perform uncertainty propagation analysis using spup, the model through which uncertainty is propagated needs to be defined as an R function:
# C/N model
<- function (OC, TN) OC/TN C_N_model_raster
The propagation of uncertainty occurs when the model is run with
uncertain inputs. Running the model with a sample of realizations of
uncertain input variable(s) yields an equally large sample of model
outputs that can be further analysed. We use the propagate()
function
to run the C/N ratio model with the OC and TN realizations. To run
propagate()
, the samples of the uncertain input variables must be
saved in lists and then collated into a list of lists. The existing
OCTN_sample
object can be coerced or generated automatically by
setting asList = TRUE
in genSample()
.
# coerce a raster stack to a list
<- list()
l 1]] <- map(1:100, function(x){OCTN_sample[[x]]})
l[[2]] <- map(101:200, function(x){OCTN_sample[[x]]})
l[[<- l
OCTN_sample
# run uncertainty propagation
<- propagate(realizations = OCTN_sample, model = C_N_model_raster, n = MC) CN_sample
Uncertainty in C/N predictions can be visualised by calculating and plotting the relative error (Figure 5). The relative error gives an indication of the accuracy relative to its size.
# coerce C/Ns list to a raster stack
<- stack(CN_sample)
CN_sample names(CN_sample) <- paste("CN.", c(1:nlayers(CN_sample)), sep = "")
# compute and plot the slope sample statistics
<- mean(CN_sample)
CN_mean <- calc(CN_sample, fun = sd)
CN_sd
par(mfrow = c(1, 2), mar = c(1, 1, 2, 2), mgp = c(1.7, 0.5, 0), oma = c(0, 0, 0, 1),
las = 1, cex.main = 1, tcl = -0.2, cex.axis = 0.8, cex.lab = 0.8)
plot(CN_mean, main = "Mean of C/N", xaxt = "n", yaxt = "n")
plot((CN_sd/CN_mean)*100, main = "Relative error of C/N [\%]", xaxt = "n", yaxt = "n")
Further analysis may include identification of all locations where the C/N ratio is in a specific range with decreasing probability. For example identifying areas where the C/N ratio is higher than 20 might help farmers to identify which plots require action to improve soil quality (Figure 6).
# calculate quantiles
<- as(CN_sample, "SpatialGridDataFrame")
CN_sample_df <- quantile_MC_sgdf(CN_sample_df, probs = c(0.01, 0.1, 0.5, 0.9), na.rm = TRUE)
CN_q
$good4crops99perc <- ifelse(CN_q$prob1perc > 20, 1, 0)
CN_q$good4crops90perc <- ifelse(CN_q$prob10perc > 20, 1, 0)
CN_q$good4crops50perc <- ifelse(CN_q$prob50perc > 20, 1, 0)
CN_q$good4crops10perc <- ifelse(CN_q$prob90perc > 20, 1, 0)
CN_q
$good4crops <- CN_q$good4crops99perc + CN_q$good4crops90perc +
CN_q$good4crops50perc + CN_q$good4crops10perc
CN_q
$good4crops[CN_q$good4crops == 4] <- "No improvement needed"
CN_q$good4crops[CN_q$good4crops == 3] <- "Possibly improvement needed"
CN_q$good4crops[CN_q$good4crops == 2] <- "Likely improvement needed"
CN_q$good4crops[CN_q$good4crops == 1] <- "Definitely improvement needed"
CN_q$good4crops[CN_q$good4crops == 0] <- "Definitely improvement needed"
CN_q
$good4crops <- factor(CN_q$good4crops,
CN_qlevels = c("Definitely improvement needed",
"Likely improvement needed",
"Possibly improvement needed",
"No improvement needed"))
spplot(CN_q, "good4crops", col.regions = c("red3", "sandybrown", "greenyellow",
"forestgreen"), main = "Areas with sufficient C/N for cropping")
In many aspects of life, information take the form of categorical data. For example, in a city neighbourhood or a district, each building may be assigned a function such as housing, office, recreation, or other. The city council may impose different tax levels depending on the building function. Suppose that the function of a building is assigned depending on the percentage of the building used for living (i.e., the residential use) and number of addresses in the building. If the residential area is greater than 80% and has at least one address registered, then the building is classified as “residential”. If the residential area is smaller than 80% and at least one address is present, then the building is classified as “office”. If no address is present, the building function is classified as “other”. The city council imposes annual tax of €1000 for residential buidings, €10000 for office buildlings, and €10 for other buildings. The 80% threshold is approximate, and some buildings classified as “other” could have an assigned address that has not yet been entered into the tax system. Therefore, the council wishes to calculate the uncertainty in tax revenue introduced by erroneous building classification. For this example, we will use the spatial distribution of buildings in a neighborhood of Rotterdam, NL (Figure 7).
# load packages
library(sp)
library(spup)
library(purrr)
library(png)
set.seed(12345)
# load data
data(woon)
# Netherlands contour and Rotterdam location
<- readPNG("RotterdamNL.png")
NL # collate info about figure size and type
<- list("grid.raster", NL, x = 89987, y = 436047, width = 120, height = 152,
NL_map default.units = "native")
# collate info about a scale bar location in the figure, type and colour
<- list("SpatialPolygonsRescale", layout.scale.bar(),
scale offset = c(90000, 435600), scale = 100,
fill = c("transparent", "black"))
# collate info about minimum value on a scale bar
<- list("sp.text", c(89990, 435600), "0", cex = 0.8)
text1 # collate info about maximum value on a scale bar
<- list("sp.text", c(90130, 435600), "500 m", cex = 0.8)
text2 # collate info about North arrow
<- list("SpatialPolygonsRescale", layout.north.arrow(),
arrow offset = c(89990, 435630), scale = 50)
# plot "woon" object with a location miniature,
# North arrow and scale bar defined above
spplot(woon, "check", do.log = TRUE, col.regions = "transparent",
colorkey = FALSE, key.space = list(x = 0.1, y = 0.93, corner = c(0,1)),
sp.layout = list(scale, text1, text2, arrow, NL_map),
main = "Neighbourhood in Rotterdam, NL")
The woon
object is a SpatialPolygonDataFrame
where each building is
represented by one polygon (Figure 7). The attributes
include:
vbos
: The number of addresses present in the building
woonareash
: The percentage residential area
Function
: The assigned category depending on vbos
and
woonareash
, where 1 represents residental; 2, office, and 3,
other.
residential
: Probability that the building is residential
office
: Probability that the building is an office buildling
other
: Probability that the building has another function
Figure 8 illustrates the probabilities associated with each building for the three possible categories.
# plot probabilities for each polygon
spplot(woon[c(4,5,6)])
In case of categorical variables, the uncertainty is described by a
non-parametric pdf. In this case, it is a vector of probabilities that a
building belongs to a certain category. In case of spatially distributed
variables, this must be done for each polygon, hence the dataset has
maps of these probabilities saved in the same object. We further assume
that uncertainties are spatially independent, implying that the joint
pdf of all buildings in the neighbourhood is the product of the marginal
pdfs for all buildings. Thus, we can generate possible realities for all
buildings independently. To unite all information of the uncertainty
model for the building function, we use the defineUM()
function that
collates all information into one object.
# define uncertainty model for the building function
<- defineUM(TRUE, categories = c("residential", "office", "other"),
woonUM cat_prob = woon[, c(4:6)])
# create possible realizations of the building function
<- genSample(woonUM, 100, asList = FALSE)
woon_sample
# view several realizations
<- woon_sample@data
woon_s <- lapply(woon_sample@data, as.factor)
woon_s @data <- as.data.frame(woon_s)
woon_samplerm(woon_s)
spplot(woon_sample[c(3,4,1,2)], col.regions = c("#5ab4ac", "#f5f5f5", "#d8b365"),
main = list(label = "Examples of building function realizations", cex = 1))
Examples of building functions Monte Carlo realizations are shown in Figure 9.
To propagate uncertainty, we run the model repeatedly with the sample created above. The tax model is defined as:
# define tax model
<- function (building_Function)
tax
{$tax2pay <- NA
building_Function$tax2pay[building_Function$Function == "residential"] <- 1000
building_Function$tax2pay[building_Function$Function == "office"] <- 10000
building_Function$tax2pay[building_Function$Function == "other"] <- 10
building_Function<- sum(building_Function$tax2pay)
total_tax
total_tax }
As in the previous C/N ratio example, in order to run the propagation
function, the sample of an uncertain input variable must be saved in a
list. We must either coerce the existing woon_sample object or get it
automatically by setting up asList = TRUE
argument in genSample()
.
# coerce SpatialPolygonDataFrame to a list of individual SpatialPolygonDataFrames
<- map(1:ncol(woon_sample), function(x){woon_sample[x]})
woon_sample for (i in 1:100) names(woon_sample[[i]]) <- "Function"
# run uncertainty propagation
<- propagate(woon_sample, model = tax, n = 100)
tax_sample
<- unlist(tax_sample)
tax_sample <- c(mean(tax_sample) - 1.96*(sd(tax_sample)/10),
ci95perc mean(tax_sample) + 1.96*(sd(tax_sample)/10))
ci95perc1] 2281978 2312449 [
The result of the uncertainty propagation shows that on average, the city council should obtain a tax revenue of approximately €2,300,000 and that the associated 95% confidence interval is (€2,281,978, €2,312,449).
Environmental models are often developed in languages other than R, such as C or FORTRAN, that can significantly speed up processing. In this example, we demonstrate how to perform uncertainty analysis with a basic model written in C.
We begin by showcasing functions to manipulate model inputs stored in ASCII files. For external models, this additional code is needed to (i) modify ASCII input files, and (ii) run the external model. For rendering ASCII input files, the mustache templating framework is available on GitHub at https://mustache.github.io, and in the R package whisker.
Suppose we have a simple linear regression model: \(Y = b0 + b1*X\) that
requires an input file input.txt
. The file contains values for the two
parameters \(b0\) and \(b1\) as follows:
library(spup)
library(dplyr)
library(readr)
library(whisker)
library(purrr)
set.seed(12345)
read_lines("input.txt")
1] "-0.789 0.016" [
Function template()
allow user to define a "container"
class to
store templates with model inputs. The aim of this class is to organise
model input files and perform necessary checks. A print()
method is
available for the class "template"
. A template is simply an ASCII file
with:
The additional extension .template.
Input that needs to be modified is replaced by mustache-style tags.
The corresponding template file should be named input.txt.template
and
should replace the original numbers with b0
and b1
placed in
moustaches: {{...}}.
# function template() reads the file into R as an object of class "template"
<- template("input.txt.template")
my_template %>%
my_template
read_lines1] "{{b0}} {{b1}}" [
Rendering is the process of replacing the tags in moustaches with text.
Rendering creates a new file, called input.txt
.
%>%
my_template render(b0 = 3, b1 = 4)
1] "input.txt" [
Below is an example external model written in the C language. This code can be saved in a file with the .c extension, for example, dummy_model.c. The model below calculates a value of a simple linear regression:
#include <stdio.h>
main() {
int *fp;
FILE 9] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
double x[
double y;
double b0;
double b1;
int i;
= fopen("input.txt", "r");
fp if (fp == NULL) {
printf("Can't read input.txt\n");
1;
return
}fscanf(fp, "%lf %lf\n", &b0, &b1);
fclose(fp);
= fopen("output.txt", "w");
fp if (fp == NULL) {
printf("Can't create output.txt\n");
1;
return
}else {
for (i=0; i<9; i++) {
= b0 + b1 * x[i];
y fprintf(fp, "%10.2f\n", y);
}fclose(fp);
}0;
return }
To compile this code to a MS-Windows executable one can use GCC at the
DOS command prompt as follows: gcc dummy_model.c -o dummy_model
. This
creates a file dummy_model.exe
, which can be wrapped in R as follows:
<- executable("dummy_model.exe") dummy_model
Running the rendering procedure passes values for \(b0\) and \(b1\) to the model, which gives:
# render the template
render(my_template, b0 = 3.1, b1 = 4.2)
# run external model
dummy_model()
# read output (output file of dummy_model is "output.txt")
scan(file = "output.txt", quiet = TRUE)
1] 7.3 11.5 15.7 19.9 24.1 28.3 32.5 36.7 40.9 [
To perform the uncertainty propagation analysis, we derive multiple realizations of the model output in steps as follows: (i) render the template, (ii) run the model, (iii) read the results, and (iv) process the results. For example:
# number of Monte Carlo runs
<- 100
n_realizations
%>%
n_realizations ::rerun({
purrr# render template
render(my_template, b0 = rnorm(n = 1), b1 = runif(n = 1))
# run model
dummy_model()
# read output
scan("examples/output.txt", quiet = TRUE)
%>%
}) set_names(paste0("r", 1:n_realizations)) %>%
%>%
as_data_frame apply(MARGIN = 1, FUN = quantile)
1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[,0% -1.660 -0.9300 -0.7400 -0.730 -0.710 -0.700 -0.680 -0.660 -0.650
25% -0.060 0.4775 0.8625 1.203 1.720 1.970 2.345 2.607 2.848
50% 0.745 1.2400 1.7200 2.310 2.965 3.375 3.910 4.405 4.850
75% 1.518 2.1425 2.7725 3.453 4.147 4.832 5.453 6.135 6.947
100% 2.990 3.3800 4.2700 5.170 6.080 6.990 7.890 8.800 9.710
In this paper, we presented the R package spup which implements methodologies for spatial uncertainty propagation analysis. The package architecture and its core components were described and three examples were presented. We explained the package functions for examining the uncertainty propagation, starting from input data and model parameters via the environmental model to model predictions. Sources of uncertainty include model specification, stochastic simulation and propagation of uncertainty using MC techniques. We include recommendations for visualizing results. We also explained how numerical and categorical data types are handled, and how we accomodate spatial auto-correlation within an attribute and cross-correlation between attributes. The MC realizations may be used as an input to environmental models written in R or other programming languages that can be called from R. The presented examples and the vignettes included in the spup package provide insight into the selection of appropriate static uncertainty visualizations that are understandable to audiences of differing levels of statistical literacy. As an effective and easy tool, spup has potential to be used for multi-disciplinary research, model-based decision support and educational purposes.
The package could benefit from further development including facilitating uncertainty analysis with time series, one of the most common types of data in environmental studies. For uncertainty propagation, the package has implemented the MC approach with efficient sampling algorithms such as stratified random sampling; implementing Latin Hypercube sampling would complement this approach. Further direction may include interactive visualization methods could be developed via the shiny package (Chang et al. 2017). As it is based on MC methodology, spup is suitable to be used with parallel computing.
We thank Sytze de Bruin, Damiano Luzzi and Stefan van Dam for valuable contributions to the development of spup. We thank Filip Biljecki and Branislav Bajat for providing the data used in the Rotterdam and Zlatibor examples. This project has received funding from the European Union’s Seventh Framework Programme for Research, Technological Development and Demonstration under grant agreement no 607000.
Kasia Sawicka is the main developer of spup and author of the text in this article. Gerard Heuvelink is a co-author of spup, he contributed to this article with his statistical and programming knowledge and edited the text. Dennis Walvoort is a co-author of spup and contributed to the development of the article.
propagate, errors, metRology, spup, gstat, stats, mvtnorm, whisker, shiny
ChemPhys, Distributions, Finance, Spatial, SpatioTemporal, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Sawicka, et al., "Spatial Uncertainty Propagation Analysis with the spup R Package", The R Journal, 2018
BibTeX citation
@article{RJ-2018-047, author = {Sawicka, Kasia and Heuvelink, Gerard B.M. and Walvoort, Dennis J.J.}, title = {Spatial Uncertainty Propagation Analysis with the spup R Package}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {180-199} }