Non-Parametric Analysis of Spatial and Spatio-Temporal Point Patterns

The analysis of spatial and spatio-temporal point patterns is becoming increasingly necessary, given the rapid emergence of geographically and temporally indexed data in a wide range of fields. Non-parametric point pattern methods are a highly adaptable approach to answering questions about the real-world using complex data in the form of collections of points. Several methodological advances have been introduced in the last few years. This paper examines the current methodology, including the most recent developments in estimation and computation, and shows how various R packages can be combined to run a set of non-parametric point pattern analyses in a guided and intuitive way. An example of non-specific gastrointestinal disease reports in Hampshire, UK, from 2001 to 2003 is used to illustrate the methods, procedures and interpretations.

Jonatan A. González (King Abdullah University of Science and Technology (KAUST)) , Paula Moraga http://www.paulamoraga.com (King Abdullah University of Science and Technology (KAUST))
2023-08-26

1 Introduction

A spatial point pattern \(X=\{\mathbf{u}_i\}_{i=1}^n\) is a random collection of points observed within a bounded region of the plane, \(W \subset\mathbb{R}^2\). Spatial point patterns arise in a broad range of applied disciplines such as climatology, ecology, epidemiology, and seismology. A spatial point pattern can represent the locations of forest fires (Turner 2009), the occurrence of species (Moraga 2021) or the residence locations of people with a certain disease (Moraga and Montes 2011). When the points evolve in space and time, we call them spatio-temporal point patterns. In this case, we treat the data as being generated as a snapshot in space-time, and they may be viewed as a spatial point process with a further (temporal) dimension (González et al. 2016). Similarly to the spatial case, we consider a spatio-temporal point pattern as a countable set of random points \(\{(\mathbf{u}_i,v_i)\}_{i=1}^n\) (non-overlapping), where \(\mathbf{u}_i\) represents the spatial location of the \(i^{\text{th}}\) event and all of them belong to a bounded planar region \(W \subset\mathbb{R}^2\). Here, \(v_i\) is associated with time and it belongs to a compact positive interval \(T\subset \mathbb{R}_+\). \(W \times T\) is usually called the spatio-temporal observation window. Examples of spatio-temporal point patterns include the locations of individuals with an infectious disease in which the time of infection is known (Diggle 2013), the starting (or ending) locations of tornadoes with their registered times (González et al. 2019) or the location and time at which some crime was committed (Rodrigues and Diggle 2012). Point pattern analyses may also be performed in smaller dimensions, such as straight segments (Daley and Vere-Jones 2003). For example, we can consider a time segment as a straight segment and record the exact time the heartbeat of a patient with arrhythmia occurs during an electrocardiogram. In addition, we can also have marked point patterns in which each of the points has some additional characteristics. For example, a marked point pattern may represent a pattern of beta-type ganglion cells in a cat’s retina that can be classified anatomically as the marks “on” or “off” (Wässle et al. 1981).

When dealing with point patterns, it is desirable to learn as much as possible about the probability distribution that governs the occurrence of points in a given point pattern. This is a difficult task since, in general, there is only a single set of points or realisation of the point process. However, many methods have been developed that serve to estimate specific characteristics of the distribution, such as the intensity, which describes the expected number of points per unit area; or the clustering or aggregation behaviour, that is, whether the points have a particular attraction or repulsion mechanism. Statistical models for point processes can also be formulated, and these may consider previous knowledge to describe trends, variability, and different factors that may impact them. Point patterns are usually considered a random sample of a stochastic mechanism called point process. The most typical families of point processes are Poisson processes. There are two classes of Poisson processes; the homogeneous, whose realisations in any subset of the observation window are uniform random samples with a constant mean \(\lambda\), and the inhomogeneous, obtained by replacing the intensity \(\lambda\) by a spatial, temporal or spatio-temporal varying one (Illian et al. 2008; Diggle 2013). Homogeneous Poisson processes are benchmark models for spatial and spatio-temporal point pattern data, often called complete spatial random (CSR) or complete spatio-temporal random (CSTR) point processes; i.e., point processes whereby point events appear in a completely random fashion.

A large body of literature has been developed in recent years to study point patterns and many statistical software packages have been developed for their estimation and analysis. For instance, packages such as spatial (Venables and Ripley 2002) or splancs (Rowlingson and Diggle 1993) initially developed for S-PLUS but available on CRAN are classics for displaying and analysing spatial point pattern data. Packages as sp (Pebesma and B. 2005; Bivand et al. 2013), or more recently, sf (Pebesma 2018), were created to provide a uniform interface for handling 2D and 3D data. spatstat (Baddeley and Turner 2005; Baddeley et al. 2015) is one of the most popular and improved packages for analysing spatial point patterns It is divided into several subpackages: spatstat.utils, spatstat.sparse, spatstat.data, spatstat.geom, spatstat.random, spatstat.explore, spatstat.model and spatstat.linnet. Some extra specialised packages such as smacpod or sparr (Davies et al. 2018) have methods for analysing case-control point patterns; or spatgraphs for graphs-related analyses of locations in 1D, 2D, 3D.

Even though authors have considered the combination of space and time in theory, and nowadays, spatial data often includes a time component, only a few examples show this in practice. This problem may be due to the complexity of the computations, added to the fact that the authors tend to delve into particular contexts. There are some tools available for dealing with spatio-temporal point pattern data. The stpp package (Gabriel et al. 2013) covers many models and functional descriptors related to point process methods to study spatio-temporal phenomena. The PtProcess package (Harte 2010) was designed to fit time-indexed point process models in the same simplified fashion as generalised linear models software. The lgcp (Taylor et al. 2013), as well as the inla (Rue et al. 2009; Moraga 2019); http://www.r-inla.org and inlabru (Bachl et al. 2019) packages, are suitable options for statistical inference in space and time.

In this paper, we demonstrate how R (R Core Team 2023) can be used to study spatial and spatio-temporal point process data using a dataset of geographic locations and times of individuals with non-specific gastrointestinal infections in Hampshire, UK, from 2001 to 2003 (Diggle et al. 2003). We consider the data as realisations of spatial and spatio-temporal point processes that lack spatial and temporal homogeneity; i.e., the expected number of points in each area unit of the study region depends on their location and time. Consequently, the first problem we solve is the intensity estimation. We then investigate the nature of the stochastic interactions between the process points after adjusting for spatial and spatio-temporal inhomogeneity. We use second-order descriptors to describe these interactions.

Along with this paper, we employ several R packages to successfully analyse the complex point patterns seen in both the spatial and the spatio-temporal settings. For dealing with spatial point patterns, we mainly use the spatstat package. In addition, we use the sparr package for analysing the relative risk and the GET package (Myllymäki and Mrkvička 2020) for testing hypotheses with envelopes. For the spatio-temporal analysis, we use the stpp package. We also make use of other R packages such as parallel, foreach (Microsoft and Weston 2022b) and doParallel (Microsoft and Weston 2022a) to optimise the computing time by dividing the burden by several processors. An R script containing all the necessary code for reading the data and running the analyses shown in this paper is provided as supplementary material.

Spatio-temporal point pattern data

We use the geographic locations and time information of individuals who had non-specific gastrointestinal infections in Hampshire, UK, between 2001 and 2003 (Diggle et al. 2003). These data come from the Ascertainment and Enhancement of Gastrointestinal Infection Surveillance and Statistics (AEGISS) project (Diggle et al. 2003), which, using the Hampshire area as a test, seeks to identify irregularities, i.e., high or low intensity spots in the spatio-temporal distribution of non-specific gastrointestinal infections in the UK.

The data consist of a collection of points \(\{(\mathbf{u}_i,v_i)\}_{i=1}^n\), where \(\mathbf{u}_i\) is a spatial coordinate with two components, i.e., \(\mathbf{u}_i = (a_i, b_i), a_i,b_i \in \mathbb{R}\), and \(v_i\in \mathbb{R}_+\) is a temporal coordinate. These points represent the locations and times of daily healthcare providers reports of non-specific gastrointestinal disease in Hampshire from 2001 to 2003. After cleaning the data to remove non-Hampshire residents (out of the Hampshire region) and multiple (coincident) points, there were \(n = 10443\) cases of non-specific gastrointestinal disease by a phone-in triage service operating within the UK’s National Health Service. The command sequence

library(spatstat)
library(viridis)
load("AegissData.RData")

loads the data in the local environment of R. Each location corresponds to the centroid of the unit post-code of the residential address of the caller. The unit of distance is one kilometre. The unit of time is one day, with day 1 corresponding to 1 January 2001. We can inspect the first spatio-temporal coordinates of our dataset by writing

head(as.data.frame(Aegiss))
       x      y marks
1 478.55 134.85     1
2 458.15 107.35     1
3 449.05 128.95     1
4 450.75 106.95     1
5 461.05  99.55     1
6 446.35 109.15     1

Notice that x and y are the coordinates. Times are not labelled as dates, but they are integers, and their column is named as ‘marks’. Finally, the planar region that encloses the points corresponds to a 120-sided polygon representing the boundary of the study region of Hampshire. To visualise the spatio-temporal point pattern, we can depict a spatial map using the symbolmap() function of the spatstat.geom package setting the times as colours. The colours are selected by using the viridis (Garnier et al. 2023) (Figure 1).

Times <- Aegiss$marks
timelabels <- as.Date(Times - 1, origin = "2001-01-01")
colmap <- viridis(length(Times))
sy <- symbolmap(pch = 21, col = "black", bg = colmap, range = range(timelabels))

We use the locations and times of the point pattern and the population density of Hampshire to study the point process that generates these data. We obtain the mean population density of Hampshire from WorldPop (University of Southampton). The population information detailed for each squared kilometre is available in a yearly basis. Figure 1 shows the point pattern of the cases and the population density. First, we save only the spatial locations with no marks by writing X <- unmark(Aegiss). Then, we attach the date labels to the points using the operator %mark%. Note that symap = sy sets the colour and point shapes previously defined and that the six dates in the legend are a sample to represent the colour variation. To summarise the population density, given in a set of three images, 2001, 2002 and 2003, we use the function im.apply() of the package spatstat.geom to average them.

par(mfrow = c(1, 2), mar = c(0,0,2,0.5)) 
X <- unmark(Aegiss)
PopDens <- im.apply(Population, mean)
plot(X %mark% timelabels, symap = sy, 
     main = "Gastrointestinal disease reports")
plot(PopDens, col = viridis(prod(PopDens$dim)), log = T, box = F,
     main = "Population intensity")
Left: Locations of 10443 reports of non-specific gastrointestinal disease in Hampshire from January 2001 to July 2003. The time is treated as a quantitative mark where darker dots correspond to the older events, and lighter dots match the most recent reports. Right: Hampshire population density averaged over the years 2001-2003. The colour map is evenly-spaced on a logarithmic scale.

Figure 1: Left: Locations of 10443 reports of non-specific gastrointestinal disease in Hampshire from January 2001 to July 2003. The time is treated as a quantitative mark where darker dots correspond to the older events, and lighter dots match the most recent reports. Right: Hampshire population density averaged over the years 2001-2003. The colour map is evenly-spaced on a logarithmic scale.

2 Spatial analysis

Spatial intensity and relative risk

The first-order intensity function or the intensity of a point process can be defined as the expected number of events per unit space, or mathematically (Diggle 2013), \[ \lambda(\mathbf{u})=\lim_{|\text{d}\mathbf{u}|\rightarrow 0}\frac{\mathbb{E}N(\text{d}\mathbf{u})}{|\text{d}\mathbf{u}|}, \] where \(N()\) denotes the number of points of a region. In the inhomogeneous case, we can compute a kernel smoothed intensity function from a point pattern \(\{\mathbf{u}_i\}_{i=1}^n\) without the temporal coordinates. The estimator is given by \[\begin{equation} \hat{\lambda}(\mathbf{u})=\frac{1}{e^2(\mathbf{u})}\sum_{i=1}^{n}\kappa_{\epsilon}^{2}(\mathbf{u}-\mathbf{u}_i), \quad \text{and} \quad e^2(\mathbf{u})=\int_{W} \kappa_{\epsilon}^{2}(\mathbf{s}-\mathbf{u}) \text{d} \mathbf{s}, \tag{1} \end{equation}\] where \(\mathbf{u}\in W\) and \(\kappa_{\epsilon}^{2}\) is a two-dimensional Gaussian kernel. \(e()\) is known as uniform edge correction and is intended to correct the bias of the estimation at the edges of the region (Baddeley et al. 2015). One of the most challenging considerations is selecting the bandwidth \(\epsilon\); there is no single best procedure for doing that, but the estimation can be improved by using a spatially varying bandwidth despite its computational complexity (Davies and Baddeley 2018). For computing the adaptive estimate of the intensity function, we use the adaptative.density() function of the spatstat.explore package,

# This takes roughly 12 seconds to be executed in an
# iMAC 2019, 3GHz intel core i5 processor with 16GB of RAM (hereinafter)
SpatialDens <- adaptive.density(X, method = "kernel", edge = T)

The estimated intensity is shown in Figure 2 (left). We can see a high concentration of cases (more than two per square kilometre) in Southampton and its coastal environs. More to the north, there are also local foci in Winchester (the centre of the study region) and the primary urban centres located in the north and northeast of the study region.

The kernel log relative risk or density ratio function is a descriptor aimed to compare two estimated densities on the study region \(W\). In epidemiology, it is handy to examine disease risk fluctuations based on case and control samples. In recent years, many improvements have been developed to the estimation methodology, including spatially adaptive smoothers and inference based on asymptotic theory (Davies et al. 2018). The related techniques are implemented in the sparr package. Given two spatial point patterns, \(X\), with intensity \(\hat{\lambda}(\mathbf{u}|X)\) (that can be thought of as cases) and \(Y\), with intensity \(\hat{\lambda}(\mathbf{u}|Y)\) (that can be thought of as controls); the estimated log relative risk is defined as \[\begin{equation} \hat{r}(\mathbf{u})=\log \left\{\frac{\hat{\lambda}(\mathbf{u}|X)}{\hat{\lambda}(\mathbf{u}|Y) } \cdot \frac{n_Y}{n_X} \right\}, \quad \mathbf{u} \in W, \tag{2} \end{equation}\] where \(n_X\) and \(n_Y\) are the number of points of \(X\) and \(Y\), respectively. When \(\hat{r}(\mathbf{u}) \approx 0\), the densities are roughly the same; peaks greater than zero mean a higher local ratio of cases to controls, and \(\hat{r}(\mathbf{u}) < 0\) indicates lack of concentration of cases than controls.

Here, we compute the relative risk by comparing the point pattern of gastrointestinal infections and a point pattern with controls derived from a random sample of the underlying population. We first generate a set of random controls by simulating \(n=1443\) locations with intensity proportional to the population density using the rpoispp() function of package spatstat.random. In order to do that, we calculate the population density as the population intensity normalised by its mass (its integral over the observation window) calculated by using the function integral.im() of the package spatstat.geom in each region of Hampshire. Then we can evaluate expressions involving one or more pixel images through the eval.im() function of package spatstat.geom,

N <- integral.im(PopDens) 
n <- X$n
ControlDens <- eval.im((PopDens / N) * n)
Controls <- rpoispp(ControlDens)

Then we assign Hampshire’s window to the new point pattern by using the Window() function of package spatstat.geom,

Window(Controls) <- X$window

For estimating the relative risk, we use the sparr package to perform an adaptive smoothing adequately designed to deal with the reduced smoothing in densely populated areas. The code

# This takes roughly 22 seconds to be executed 
library(sparr)
RelativeRisk <- risk(f = X, g = Controls, adapt = T, 
                     pilot.symmetry = "pooled", tolerate = T)

takes both point patterns (cases f and controls g), computes an adaptive smoothing for estimating the risk, and computes asymptotic tolerance contours as per Hazelton and Davies (2009). Maps displayed in Figure 2 showing the intensity and relative risk estimates are produced by executing

par(mfrow = c(1, 2), mar = c(0,0,2,0.5)) 
plot(SpatialDens, col = viridis(1200), auto.axes = F,
     main = "Adaptative kernel intensity estimate", box = F)
plot(RelativeRisk, auto.axes = F, tol.type = "upper", 
     main = "Relative risk estimate")