Spatial Uncertainty Propagation Analysis with the spup R Package

Many environmental and geographical models, such as those used in land degradation, agro-ecological and climate studies, make use of spatially distributed inputs that are known imperfectly. The R package spup provides functions for examining the uncertainty propagation from input data and model parameters onto model outputs via the environmental model. The functions include uncertainty model speciﬁcation, stochastic simulation and propagation of uncertainty using Monte Carlo (MC) techniques. Uncertain variables are described by probability distributions. Both numerical and categorical data types are handled. The package also accommodates spatial auto-correlation within a variable and cross-correlation between variables. The MC realizations may be used as input to the environmental models written in or called from R. This article provides theoretical background and three worked examples that guide users through the application of spup .


Introduction and motivation
Environmental models that are used to understand and manage natural systems often rely on spatial data as input. Uncertainty in the input data propagates into model predictions. There is a need for systematic analysis of spatial uncertainty propagation in all major inputs and an accounting of spatial auto-and cross-correlation between input uncertainties. Spatial uncertainty propagation analysis has been widely recommended and practised across environmental disciplines, for example, hydrology and water quality (e.g. Hengl et al., 2010;Hamel and Guswa, 2015;Muthusamy et al., 2016;Nijhof et al., 2016;Yu et al., 2016), biogeochemistry (e.g. Boyer et al., 2006;Nol et al., 2010;Pennock et al., 2010;Hugelius, 2012;Vanguelova et al., 2016), or ecology and conservation planning (e.g. Jager et al., 2005;Fisher et al., 2010;Lechner et al., 2014). However, as far as we know, there is no widely applicable software that facilitates such analysis.
Currently, a growing number of software tools are available for uncertainty propagation and uncertainty assessment, some of which have functionality for dealing with spatial uncertainties. These include both free software, like OpenTURNS (Andrianov et al., 2007), Dacota (Adams et al., 2009) and DUE ; commercial options, like COSSAN (Schueller and Pradlwarter, 2006); or free packages for licenced software, like the SAFE (Pianosi et al., 2015) or UQLab (Marelli and Sudret, 2014) toolboxes for MATLAB. A broad review of existing software packages are available in Bastin et al. (2013). The use of powerful but complex languages like C++ in Dakota, Python in OpenTURNS or Java in DUE often discourage relevant portions of the scientific community without formal programming skills from the adoption of otherwise powerful tools. There are a number of R packages relating to uncertainty analysis through sensitivity analysis or use of Bayesian frameworks for model calibration. We have found only three packages: propagate (Spiess, 2014), errors (Ucar, 2017) and metRology (Ellison, 2017) that deal with uncertainty propagation explicitly. However, none of these packages provides functionality for spatial models and variables. Therefore, we undertook a project to develop an R package that facilitates uncertainty propagation analysis in spatial environmental and geographical modelling.
In this article, we present the spup package for R (Sawicka et al., 2017), which implements a methodology for spatial uncertainty propagation analysis using Monte Carlo (MC) methods. The spup package includes functions for uncertainty model specification, sampling, propagation of uncertainty using MC techniques and examples of results visualization. In this way, it provides support for the entire chain from uncertainty model definition, simulation and propagation to visualization. It is also a generic tool that can handle a great variety of cases. It is suitable for models that consider spatial variables saved in a spatial data frame or raster, as well as non-spatial variables, both continuous and categorical. It works with models written in R or other programming languages that can be called from R. The spup package is intended for researchers and practitioners who recognise the problems of uncertainty in data and models, and are looking for a simple, accessible implementation of a methodology for uncertainty propagation assessment and visualization. At the same time, it is designed to enable more experienced users to easily understand, customise, and possibly further develop the code.
Here, we provide a description of the implemented methodology, describe the package design and illustrate the usage with three simple examples. We demonstrate that the spup package is an effective and easy tool that can be used for multi-disciplinary research, model-based decision support and educational purposes.

Step 1 -Characterise uncertain model inputs with pdfs
The most frequently used uncertainty quantification approach represents uncertainty with probability distribution functions (pdfs). In order to represent input uncertainty with a pdf we have to define, for each possible input value, the corresponding probability or probability density. Probability distribution functions can be very complex and can have many parameters. In practice, when information and data from which to derive the pdf are limited, assumptions and simplifications have to be made to obtain reliable estimates of the pdf. In order to reduce the complexity of a pdf for a continuous variable, the distribution function may be described with a simple parametric shape, such as the normal, uniform or lognormal distribution (Bertsekas and Tsitsiklis, 2008). Categorical uncertain variables can be represented by non-parametric distribution, comprising a set of possible outcomes and their probabilities. The probabilities should be non-negative and the sum of the outcome probabilities should equal 1 (Heuvelink, 2007).
Individual variables and the uncertainties associated with them may also be statistically dependent on other variables and their uncertainties. For example, visibility and water vapour in the atmosphere are correlated, and so is the uncertainty between these two properties. Note that correlation between variables as derived from pair observations need not be the same as the correlation between measurement or interpolation errors associated with these variables. For example, elevation and temperature in mountainous area are negatively correlated, while the associated uncertainties may not be correlated at all because of independent measurements methods. Thus, when multiple inputs are uncertain, cross-correlations in their uncertainties must be addressed. In that case, the uncertainty model is described using a joint probability distribution function (jpdf). If the uncertainties are independent, the jpdf is the product of the univariate pdfs and can be estimated by estimating each pdf separately. If the uncertainties are statistically dependent, these dependencies must be estimated alongside the pdfs. In practice, few parametric shapes are available to describe the jpdf of variables that are statistically dependent. For continuous numerical variables, a common assumption is that the multivariate uncertainty model follows a joint-normal distribution with a vector of means µ and variance-covariance matrix Σ, where the latter must be square, symmetric and positive-semidefinite.
In different environmental or geographical studies, many variables are spatially distributed and it is likely that there is a relationship between nearby values (i.e. they are auto-correlated). Uncertainty in such cases may be described by the following geostatistical model: where µ(x) is the mean of the variable Z(x) at any location x ∈ D, D is the geographic domain of interest, σ(x) is a spatially variable standard deviation associated with µ(x) (i.e. σ parameterises uncertainty such that it may be greater in some regions than in others), and ε is a zero-mean, unit-variance, second-order stationary residual (i.e. spatial auto-correlation depends only on a distance vector), modelled with a semivariogram or a correlogram (Diggle and Ribeiro, 2007;Webster and Oliver, 2007;Plant, 2012). For defining spatial correlation, spup supports a wide range of positive-definite functions implemented in the gstat package (Pebesma, 2004;Gräler et al., 2016). The mathematical background for these calculations is presented in  and Heuvelink (2007). In case of spatial variables that are cross-correlated with other spatial variables, the linear model of coregionalization (Goovaerts, 1997) is assumed. Table 1 summarises common assumptions and simplifications in uncertainty propagation analysis for three categories of data: continuous, discrete and categorical. These assumptions are included in the implementation of spup.
Step 2 -Repeatedly sample from spatial pdfs for uncertain inputs When an uncertain variable is characterised by a pdf, simulation relies on a pseudo-random number generator. The most common sampling method is independent identically distributed (IID) sampling, in which case each realization is generated as independent draws from the same pdf. In case of a single parameter or univariate distribution, a realization can be easily drawn from various pdfs using algorithms implemented in the stats package in R (see https://cran.r-project.org/web/ views/Distributions.html for details). For a spatially distributed variable (e.g. a DEM) that has a multivariate normal distribution, spup relies on the unconditional Gaussian simulation algorithm implemented in the gstat package (Pebesma, 2004). For simulation from multivariate normal pdfs of non-spatial variables, spup uses the mvtnorm package (Genz et al., 2017;Genz and Bretz, 2009).
In case the input is high-dimensional, straightforward random drawing from the jpdf may not yield a sample that is representative across the whole range of allowable values for each variable, unless the sample size is extremely large. Therefore stratified sampling methods, where values are selected for each variable from each of a pre-specified number of intervals with equal probability mass (Iman and Conover, 1982), are also implemented in spup. The latter can also be done in a spatial context (Pebesma and Heuvelink, 1999).
Step 3 -Run a model with sampled inputs and store model outputs Step 3 consists of running a model (e.g. an environmental model) for all sample elements ("simulated realities") generated in Step 2. It is crucial that all models are automated and can be run in batch mode, as in many practical cases, the number of model runs for a Monte Carlo analysis should be at least 100, and often many more (Heuvelink, 1998). In general, the MC sample size must be large enough to guarantee that the outcome of the uncertainty propagation analysis is sufficiently accurate (Heuvelink, 2006). The limitation here is that the required sample size cannot be calculated beforehand because it is derived from the MC output.
It is important to plan what information one needs to retrieve from the set of model outputs. Strategies will vary if the user is interested in the combined or separate effect of uncertain input data or parameters. The simplest approach to estimating the uncertainty contribution of a single uncertain input or parameter is to perform the uncertainty propagation analysis with only the input or, one or more parameters made uncertain. The remaining uncertain parameters are then assumed certain and fixed on their central value. Comparison of the output uncertainties of different inputs (e.g. ratio of variances) provides a measure of the relative contribution. More details on stochastic sensitivity analysis can be found in Saltelli et al. (2000).
Running the model for multiple simulated realities resulting from step 2 may be computationally demanding, particularly when the model is complex and requires much computing time. The computation time can be reduced, as the MC method is very suited for parallel computing with multi-core machines and for grid computing technology (Leyva-Suarez et al., 2015). In this way, computation times can be dramatically reduced. Numerous high-performance and parallel computing packages have been developed in R suiting various approaches. A continuously maintained descriptive list of these is available online at https://cran.r-project.org/web/views/HighPerformanceComputing.html (Eddelbuettel, 2017).

Step 4 -Compute and visualise summary statistics of model outputs
The resulting set of model outputs forms a random sample from the model output probability distribution. This means that output pdf parameters, such as the mean, median, variance, standard deviation, interquartile range, percentiles, coefficient of variation and threshold exceedance probabilities can be estimated from the sample. For example, the percentiles of the sample can be computed by the quantile() function in R. The accuracy of the sample estimates can also be computed (de Gruijter et al., 2006), although this is currently not implemented in spup. The sample of output values should be stored in case the model output is used as uncertain input for a next model. Table 1: Common assumptions for defining uncertainty models with a probability distribution function for continuous and discrete variables, and categorical variables.

Cross-correlation between variables
Continuous variable (e.g. rainfall, pollutant concentration, elevation) A pdf is often assumed to follow the normal distribution with mean µ and standard deviation σ.
Non-normal distributions (such as the exponential or lognormal distribution) may also be assumed. For distributions supported in spup see the spup manual. Alternatively, variables may be transformed to ensure that the transformed variable has a normal distribution. Common transformations are square root, logarithm and the Box-Cox transform. Hard minima and maxima can be imposed by using truncated distributions that leave the shape of distributions unchanged except that below the minimum and above the maximum the probability (density) is set to zero. For continuous uncertain inputs that are spatially correlated, the normal distribution is often imposed because without it the uncertainty quantification would become too complex. Therefore, it is recommended to assume that the input or some transformation of it is multivariate normally distributed and define the uncertainty in terms of the normally distributed (transformed) variable. In addition, 'second-order stationarity' is also often assumed (Goovaerts, 1997;Diggle and Ribeiro, 2007), indicating that the spatial correlation between errors at two locations depends only on the distance vector between the locations. This enables us to represent spatial correlation with a simple function (correlogram). The statistical dependencies between jointly normally-distributed uncertain variables are fully characterised with a correlation matrix. The correlation matrix is a square, symmetric and positivesemidefinite matrix with unit values on the diagonal and values between -1 and 1 on the off-diagonals. Correlations can be estimated using paired observations of the variables of interest. However, it is important to note that the correlation between variables as derived from paired observations need not be the same as the correlation between measurement or interpolation errors associated with these variables. E.g., DOC and NH4 concentrations of river water are typically positively correlated, while their respective measurement errors are not, if measured independently.
Discrete variable (e.g. fish count, number of wildfires in a given area and time period) The pdf can be represented by a parametric shape, e.g. Poisson distribution, with mean or rate λ, In case no appropriate shape is available, all possible outcomes and associated probabilities must be tabulated or listed in a non-parametric pdf.
When discrete or categorical variables are spatially distributed, the simplest approach is to assume spatial statistical independence or perfect spatial dependence. More advanced approaches that allow intermediate levels of statistical dependence include the Markov random field approach (Blake et al., 2011) and the Generalised Linear Geostatistical Model (Diggle and Ribeiro, 2007). It is also possible to divide or cluster the area into sub-regions, with a complete dependence within and independence between sub-regions.
Cross-dependencies between uncertain categorical variables are addressed by treating all combinations of categories as separate classes. Each combination is assigned a probability. This increases the number of classes dramatically, so that generalisation is often needed in practice and the total number of combined classes is reduced to ten or fewer.
Categorical variable (e.g. land use, building type) For most categorical variables, a parametric shape may not be available. In that case, each possible outcome and associated probability must be listed in a non-parametric pdf.
The R Journal Vol. 10/2, December 2018 ISSN 2073-4859 It is important that the uncertainty statistics are communicated in an efficient way that is simultaneously informative and understandable to users that have varied backgrounds in statistics. There should be ample explanation of the results and a visual representation of the uncertainty (MacEachren, 1992;Hengl et al., 2004). For non-spatial and non-temporal data, a number of simple statistical plots can be used to visualise uncertainty represented by probability distributions. If the full probability distribution function is known, this can be visualised in pdf or cumulative distribution function (cdf) plots. Error bars, interquartile range and box plots can be used to show the distribution of values. Uncertainties about categorical data can be shown using stacked bars for the probability for each of the categories, or entropy that provides a summary measure of the spread of probabilities over the categories.
It is important to distinguish between presentation techniques and presentation modes for uncertaintly visualisation of spatial data. General presentation techniques are (i) adjacent maps, (ii) sequential presentation and (iii) bi-variate maps (MacEachren et al., 2005). These techniques can be presented in three modes: static, dynamic and interactive (Kinkeldey et al., 2014). A common example of adjacent maps for continuous variables is to put maps of the mean and standard deviation next to each other. For visualising categorical attribute uncertainty, the simplest method is also using adjacent maps, for example by showing a map of the the most probable category next to a map of the probability of that category. Sequential presentation may be shown in dynamic mode.

The spup package
The spup package is developed in R to provide a simple solution for those who wish to include uncertainty analysis in their studies, with emphasises on studies including spatial variables. The spup package implements the methodology described in Section Methodology implemented in spup. We describe spup version 1.3-1 available on CRAN. The development version is available on gitHub at https://github.com/ksawicka/spup. The package is freely available under the GPL3.
Three example datasets are provided with the spup package: 1. Digital Elevation Model of Zlatibor region in Serbia (3x4.5km, 30m resolution) (Hengl et al., 2008) 2. Soil organic carbon and total nitrogen content of the south region of Lake Alaotra in Madagascar (33x33km, 250m resolution) (Hengl et al., 2017) 3. Rotterdam neighbourhood buildings distribution (Kadaster, The Netherlands).   becomes an input to functions for generating a MC sample. The generated MC sample in turn is used as input to functions facilitating propagation of uncertainty. Qualitative descriptions of the key functions are given in Table 2. More details about functions arguments and output can be found in the package manual (Sawicka et al., 2017) and in the examples below.

Application examples Example 1 -Uncertainty propagation analysis with auto-and cross-correlated variables
In many geographical studies, variables are cross-correlated. As a result, uncertainty in one variable may be statistically dependent on uncertainty in the other. For example, soil properties such as organic carbon (OC) and total nitrogen (TN) content are cross-correlated. These variables are used to derive the C/N ratio, which is important information to evaluate soil management and increase crop productivity. Errors in OC and TN maps will propagate into the C/N ratio map. The cross-correlation between the uncertainties in OC and TN will affect the degree of uncertainty in the C/N ratio.
The example data for C/N calculations are a 250m resolution mean OC and TN of a 33km x 33km area adjacent to lake Alaotra in Madagascar (Figure 2) (Hengl et al., 2017). The datasets include four spatial objects: a mean OC and TN of the area, with their standard deviations assumed to be equal to 10% of the mean. # load packages library(spup) library(raster) library(purrr) # set seed set.seed (12345) # load and view the data data(OC, OC_sd, TN, TN_sd) class(OC); class(TN) par(mfrow = c(1, 2), mar = c(1, 1, 2, 2), mgp = c(1.7, 0.5, 0), oma = c(0, 0, 0, 1), las = 1, cex.main = 1, tcl = -0.2, cex.axis = 0.8, cex.lab = 0.8) plot(OC, main = "Mean of Organic Carbon", xaxt = "n", yaxt = "n") plot(TN, main = "Mean of Total Nitrogen", xaxt = "n", yaxt = "n") The first step in uncertainty propagation analysis is to define an uncertainty model for the uncertain input variables, here OC and TN. First, the marginal uncertainty model is defined for each variable separately, and next the joint uncertainty model is defined for the variables together. In case of OC and TN, the ε (Eq. 1) are spatially correlated. For each of the variables, the makeCRM() function collates all necessary information into a list. We assume that the spatial autocorrelation of the OC and TN errors are both described by a spherical correlation function with a short-distance correlation of 0.6 for OC and 0.4 for TN, with a range parameter of 1000m ( Figure 3). It is important at this step to ensure that the correlation function types as well as the ranges are the same for each variable. This is a requirement for further analysis, because spup employs the linear model of co-regionalization (Wackernagel, 2003).
Generating possible realities of the selected variables is completed by using the genSample() function: # create possible realizations from the joint distribution of OC and TN MC <-100 OCTN_sample <-genSample(UMobject = mySpatialMUM, n = MC, samplemethod = "ugs", nmax = 20, asList = FALSE) Note the argument asList has been set to FALSE. This indicates that the sampling function will return an object of a class of the distribution parameters class. This is useful if there is a need to visualise the sample or compute summary statistics quickly.

Mean of OC realizations
The Further analysis may include identification of all locations where the C/N ratio is in a specific range with decreasing probability. For example identifying areas where the C/N ratio is higher than 20 might help farmers to identify which plots require action to improve soil quality ( Figure 6).

Example 2 -Uncertainty propagation analysis with categorical data
In many aspects of life, information take the form of categorical data. For example, in a city neighbourhood or a district, each building may be assigned a function such as housing, office, recreation, or other. The city council may impose different tax levels depending on the building function. Suppose that the function of a building is assigned depending on the percentage of the building used for living (i.e., the residential use) and number of addresses in the building. If the residential area is greater than 80% and has at least one address registered, then the building is classified as "residential". If the residential area is smaller than 80% and at least one address is present, then the building is classified as "office". If no address is present, the building function is classified as "other". The city council imposes annual tax of €1000 for residential buidings, €10000 for office buildlings, and €10 for other buildings. The 80% threshold is approximate, and some buildings classified as "other" could have an assigned

Areas with sufficient C/N for cropping
Definitely improvement needed Likely improvement needed

Possibly improvement needed
No improvement needed address that has not yet been entered into the tax system. Therefore, the council wishes to calculate the uncertainty in tax revenue introduced by erroneous building classification. For this example, we will use the spatial distribution of buildings in a neighborhood of Rotterdam, NL (Figure 7). The woon object is a SpatialPolygonDataFrame where each building is represented by one polygon (Figure 7). The attributes include: • vbos: The number of addresses present in the building • woonareash: The percentage residential area • Function: The assigned category depending on vbos and woonareash, where 1 represents residental; 2, office, and 3, other.
• residential: Probability that the building is residential • office: Probability that the building is an office buildling • other: Probability that the building has another function In case of categorical variables, the uncertainty is described by a non-parametric pdf. In this case, it is a vector of probabilities that a building belongs to a certain category. In case of spatially distributed variables, this must be done for each polygon, hence the dataset has maps of these probabilities saved in the same object. We further assume that uncertainties are spatially independent, implying that the joint pdf of all buildings in the neighbourhood is the product of the marginal pdfs for all buildings. Thus, we can generate possible realities for all buildings independently. To unite all information of the uncertainty model for the building function, we use the defineUM() function that collates all information into one object. # define uncertainty model for the building function woonUM <-defineUM(TRUE, categories = c("residential", "office", "other"), cat_prob = woon[, c(4:6)]) # create possible realizations of the building function woon_sample <-genSample(woonUM, 100, asList = FALSE) # view several realizations woon_s <-woon_sample@data woon_s <-lapply(woon_sample@data, as.factor) woon_sample@data <-as.data.frame(woon_s) rm ( Examples of building functions Monte Carlo realizations are shown in Figure 9. To propagate uncertainty, we run the model repeatedly with the sample created above. The tax model is defined as: # define tax model tax <-function (building_Function) { building_Function$tax2pay <-NA building_Function$tax2pay[building_Function$Function == "residential"] <-1000 building_Function$tax2pay[building_Function$Function == "office"] <-10000 building_Function$tax2pay[building_Function$Function == "other"] <-10 total_tax <-sum(building_Function$tax2pay) total_tax } As in the previous C/N ratio example, in order to run the propagation function, the sample of an uncertain input variable must be saved in a list. We must either coerce the existing woon_sample object or get it automatically by setting up asList = TRUE argument in genSample().

Example 3 -Uncertainty propagation analysis with a model written in C
Environmental models are often developed in languages other than R, such as C or FORTRAN, that can significantly speed up processing. In this example, we demonstrate how to perform uncertainty analysis with a basic model written in C.
We begin by showcasing functions to manipulate model inputs stored in ASCII files. For external models, this additional code is needed to (i) modify ASCII input files, and (ii) run the external model. For rendering ASCII input files, the mustache templating framework is available on GitHub at https://mustache.github.io, and in the R package whisker.

Summary and future directions
In this paper, we presented the R package spup which implements methodologies for spatial uncertainty propagation analysis. The package architecture and its core components were described and three examples were presented. We explained the package functions for examining the uncertainty propagation, starting from input data and model parameters via the environmental model to model predictions. Sources of uncertainty include model specification, stochastic simulation and propagation of uncertainty using MC techniques. We include recommendations for visualizing results. We also explained how numerical and categorical data types are handled, and how we accomodate spatial auto-correlation within an attribute and cross-correlation between attributes. The MC realizations may be used as an input to environmental models written in R or other programming languages that can be called from R. The presented examples and the vignettes included in the spup package provide insight into the selection of appropriate static uncertainty visualizations that are understandable to audiences of differing levels of statistical literacy. As an effective and easy tool, spup has potential to be used for multi-disciplinary research, model-based decision support and educational purposes.
The package could benefit from further development including facilitating uncertainty analysis with time series, one of the most common types of data in environmental studies. For uncertainty propagation, the package has implemented the MC approach with efficient sampling algorithms such as stratified random sampling; implementing Latin Hypercube sampling would complement this approach. Further direction may include interactive visualization methods could be developed via the shiny package (Chang et al., 2017). As it is based on MC methodology, spup is suitable to be used with parallel computing.