The IDSpatialStats R Package: Quantifying Spatial Dependence of Infectious Disease Spread

John R Giles; Henrik Salje; Justin Lessler

1 Introduction

The transmission process which drives an epidemic can be characterized by the spatial distance separating linked cases. When these transmission events accumulate over time, they are observed as areas of elevated disease prevalence. Knowledge of the extent of the affected area and where new cases may arise is crucial for many disease control strategies (e.g. ring vaccination, vector control etc). In epidemiology, case occurrence data—(\(x\),\(y\)) coordinates with temporal information (\(t\)) and other covariates—are often used to understand these types of infectious disease dynamics. These data are typically treated as a generic point process so that they can be described in terms of spatial intensity (expected number of cases per unit area) or clustering due to spatial dependence (covariance in \(x\),\(y\) space).

In the broader field of spatial statistics, there are many methods that measure the spatial intensity or clustering of a generic point process on a Cartesian (\(x\),\(y\)) coordinate system (Table 1). These methods primarily fall into three categories: first-order first-moment (FOFM), first-order second-moment (FOSM), and second-order second-moment (SOSM). The FOFM measures use quadrature (aggregate counts of points within cells) to quantify intensity of the point pattern continuously over (\(x\),\(y\)) space. Packages such as lgcp (Taylor et al. 2013, 2015) and ppmlasso (Renner and Warton 2013) allow users to model first-order intensity as a count process using a regressive function of (\(x\),\(y\)) coordinates and other covariates. The FOSM measures—Moran’s I and Geary’s C (Moran 1950; Geary 1954)—also use quadrature, but they describe general covariance among cell counts across the (\(x\),\(y\)) dimensions. These spatial statistics can be calculated using the spdep R package (Bivand and Piras 2015). The SOSM measures, such as the \(K\)-function and its non-cumulative analogue, the pair correlation function, quantify clustering among neighboring points. Both the FOSM and SOSM measures are considered global spatial statistics because they describe spatial dependence for the entire study area. However, the SOSM can further describe how spatial dependence changes as a function of distance by comparing the observed intensity of neighboring points within distance \(d\) to that expected under complete spatial randomness. The \(K\)-function and pair correlation function can be calculated using the ads (Pélissier and Goreaud 2015), spatstat (Baddeley et al. 2016), and splancs (Rowlingson and Diggle 2017) R packages.

These classic spatial measures are limited in their ability to describe infectious disease dynamics primarily because they treat case occurrence data as a generic point process. The FOFM and FOSM measures use quadrature, which make them vulnerable to error associated with data aggregation (Robinson 2009) and the modifiable areal unit problem (Openshaw and Taylor 1979). The SOSM measures, like the \(K\)-function and pair correlation function, are more common in epidemiology. However, their statistical interpretation is less intuitive in terms of classic epidemiological quantities of relative disease risk, such as the incidence rate ratio or hazard ratio. Additionally, even the temporal forms of these functions (e.g. the space-time \(K\)-function) do not capture the typical distances traveled in a single transmission generation as they quantify the overall spatial dependence between all cases, not just those epidemiologically linked. The mean distance between sequential cases in a transmission chain is an important epidemiological quantity because it provides insight into potential mechanisms driving spread as well as helping inform interventions. Therefore, we developed novel measures that build upon concepts in spatial statistics to characterize infectious disease spread using case occurrence data. Importantly, these measures are robust to heterogeneities in the underlying population, and substantial case under-reporting, which is common in epidemiology.

We describe these two measures of spatial dependence for infectious diseases and show how they can be calculated with the IDSpatialStats R package in the following three sections. First, we introduce a function which simulates infectious disease spread as a spatial branching process. This function is primarily intended to simulate example datasets for the est.transdist family of functions and \(\tau\)-statistics that use temporal information to indicate linked cases. Second, we demonstrate how to estimate the mean and standard deviation of the spatial transmission kernel (Salje et al. 2016b). Estimating the spatial transmission kernel requires an understanding of the number of transmission generations separating cases at different time points of the epidemic. This method provides a measure of fine-scale spatial dependence between two cases, which can be interpreted as the mean distance between sequential cases in a transmission chain. Third, we describe a measure of global clustering—the \(\tau\)-statistic—that calculates the relative risk of infection given some criteria to identify cases closely related along a chain of transmission (Lessler et al. 2016). The \(\tau\)-statistic is a global clustering statistic—like the \(K\)-function and pair correlation function—that provides an overall measure of clustering for the entire course of an outbreak. Depending on the parameterization, the \(\tau\)-statistic represents the odds of observing another case with distance \(d\) of an infected case compared with either the underlying population or other pathogen types. The following sections contain a brief introduction to each statistic to provide context to the code implementation—for more detailed description of each statistic, see (Lessler et al. 2016) and (Salje et al. 2016b).

We have implemented these tools in the IDSpatialStats R package version 0.3.5 and above. The latest stable release depends on the doParallel and foreach packages (Microsoft and Weston 2017; Corporation and Weston 2018) and can be downloaded from CRAN. A development version of the package is also available on Github at https://github.com/HopkinsIDD/IDSpatialStats.

Table 1: A selective list of R packages for the analysis of spatial point pattern data. This list is not exhaustive. Visit the *Spatial* CRAN Task View for a more comprehensive list of resources.
Package	Description	Citation
ads	K function for enclosed point patterns	(Pélissier and Goreaud 2015)
DCluster	disease clustering for count data	(Gómez-Rubio et al. 2005)
lgcp	modeling point patterns	(Taylor et al. 2013, 2015)
	with log-Gaussian Cox processes
ppmlasso	modeling point patterns	(Renner and Warton 2013)
	with LASSO regularization
SGCS	third order clustering of point processes	(Rajala 2017)
sparr	spatial relative risk functions with kernel smoothing	(Davies et al. 2011)
spatstat	comprehensive tools for analyzing	(Baddeley et al. 2016)
	point patterns in many dimensions
spdep	classic statistics to test for spatial dependence	(Bivand and Piras 2015)
splancs	kernel smoothing and Poisson cluster processes	(Rowlingson and Diggle 2017)

Simulating spatial disease spread

We use a stochastic spatial branching process to simulate epidemiological data in the sim.epidemic function. Simulations begin with an index case at \((x,y,t) = (0,0,0)\) and transmission events that link two cases follow according to a random Markov process in \((x,y)\) space (i.e. Brownian motion). The number of transmission events occurs according to a Poisson distribution, with its mean and variance set to the effective reproduction number \(R\) of the infecting pathogen. The spatial distance traversed by each transmission event is given by a user specified probability distribution which serves as the dispersal kernel function. When specifying the dispersal kernel, the trans.kern.func argument expects a list object containing a probability distribution function and its named arguments. For example, to simulate an epidemic where transmission typically occurs at the local level, but long distance transmission events sometimes occur, an exponential transmission kernel might be used because of its long tail. Alternatively, if transmission is expected to consistently occur within a given range, then a normal kernel may be more appropriate.

In simulations where the basic reproductive number \(R_0\) is used to define a constant \(R\)-value and \(R_0>1\), the number of cases will continue to increase with each time step. This effect may not be appropriate when simulating settings where intervention efforts or depletion of susceptible individuals causes heterogeneity in \(R\) over the course of the epidemic. Thus, the sim.epidemic function accepts either a scalar value for a constant \(R\) value or a vector of \(R\) values with a length equal to tot.generations, allowing simulations with a variable \(R\) value, as shown in the following R code.

# Epidemic simulations with variable R value
R1 <- 2
R2 <- 0.5
tot.gen <- 12
R.vec <- seq(R1, R2, (R2 - R1)/(tot.gen - 1))
dist.func <- alist(n=1, a=1/100, rexp(n, a))
sim <- sim.epidemic(R=R.vec, gen.t.mean=7, gen.t.sd=2, min.cases=100,
                    tot.generations=tot.gen, trans.kern.func=dist.func)

head(sim, n=4)
          x          y t
1   0.00000   0.000000 0
2  24.46125   3.280527 3
3 -60.73475 184.885784 7
4 -12.79933 -57.798696 4

sim.plot(sim)

graphic without alt text — Figure 1: Left: the spatial distribution of simulated cases with the red cross showing the index case. Right: the epidemic curve for a simulation with an \(R\) value decreasing from 2 to 0.5 over the course of the epidemic.

2 The mean transmission distance

In (Salje et al. 2016b), we introduced a method to estimate the mean and standard deviation of the spatial transmission kernel using case occurrence data. These data include location \((x, y)\) and onset time \(t\) of each case (case times) and the infecting pathogen’s generation time \(g(x)\). To estimate these spatial statistics, we use the Wallinga-Teunis (WT) method (Wallinga and Teunis 2004) to probabilistically estimate the number of transmission events required to link two cases, denoted as \(\theta\). In settings where a phylogenetic model or contact tracing provide information on transmission pathways, the spatial kernel can be empirically estimated using the distribution of observed distances among all linked cases. The mean and standard deviation of this kernel can then be calculated for any time interval between \(t_1\) and \(t_2\) to give \(\mu^{obs}_t(t_1, t_2)\), with the assumption that the number of transmission events separating all case pairs is homogeneous (\(\theta = 1\)). When data that indicate case linkage are lacking, this assumption is incorrect because the distance between two cases depends on the number of transmission events separating them. In this case, the mean transmission distance at each time interval \(\mu_t\) must be estimated as a weighted mean: \[\mu_t(t_1,t_2,\mu_k,\sigma_k) = \sum_i w(\theta=i,t_1,t_2) \cdot \mu_a(\theta=i,\mu_k,\sigma_k).\] Where, \(w(\theta=i,t_1,t_2)\) gives the weight for each of the \(i\) elements of \(\theta\) and the second term \(\mu_a(\theta=i,\mu_k,\sigma_k)\) gives the mean distance separating case pairs that are linked by the \(i\)th value of \(\theta\).

We have implemented four nested functions that are used to estimate \(w(\theta=i,t_1,t_2)\) and describe them briefly below. Listed in order, they are comprised of est.wt.matrix.weights, est.wt.matrix, get.transdist.theta, and est.transdist.theta.weights. Although, these functions are documented separately, they are all driven by the est.transdist family of functions and do not need to be run manually unless desired.

Wallinga-Teunis matrices

The est.wt.matrix.weights function builds upon code from the R0 package (Obadia et al. 2012) to calculate the basic WT matrix (Wallinga and Teunis 2004). This matrix gives the probability that a case at time \(t_i\) (rows) infected a case at time \(t_j\) (columns), i.e. \(\theta = 1\), based on the generation time distribution of the pathogen \(g(x)\). For an epidemic with \(t\) unique case times, est.wt.matrix.weights gives a \(T \times T\) matrix.

The est.wt.matrix function produces a WT type matrix for all infector-infectee case pairs. Given the WT matrix produced by est.wt.matrix.weights and total case count \(n\), this function calculates an \(n \times n\) matrix giving the probability that case \(i\) (rows) infected case \(j\) (columns). The WT matrix object can be handed directly to est.wt.matrix via the basic.wt.weights argument, or if this argument is NULL, the est.wt.matrix.weights function is called automatically.

# Calculating Wallinga-Teunis matrices
case.times <- c(1,2,2,3,3) # times each case occurs
g.x <- c(0, 2/3, 1/3, 0, 0) # hypothetical generation time of a pathogen

mat.wts <- est.wt.matrix.weights(case.times=case.times, gen.t.dist=g.x)

# Calculate infector-infectee Wallinga-Teunis matrix
wt.mat1 <- est.wt.matrix(case.times=case.times, gen.t.dist=g.x,
                         basic.wt.weights=mat.wts)
wt.mat2 <- est.wt.matrix(case.times=case.times, gen.t.dist=g.x)

identical(wt.mat1, wt.mat2) # the two methods are equivalent
[1] TRUE

Estimation of \(\theta\) weights

The get.transdist.theta function estimates the number of transmission events \(\theta\) separating pairs of cases using the probabilities in the infector-infectee WT matrix produced by the est.wt.matrix function. Sampling all possible transmission trees is impractical for most datasets, so this function constructs a transmission tree by randomly selecting the infector of each case in the epidemic and then \(\theta\) is determined by finding the product of all probabilities in the chain of transmission that link the randomly sampled case pairs.

The object theta.wts (in the code segment below) contains a three-dimensional array [\(i\),\(j\),\(k\)], where the rows \(i\) and columns \(j\) represent unique case times and the third dimension \(k\) is the number of transmission events \(\theta\). Each cell gives the probability that two cases occurring at times \(i\) and \(j\) are connected by \(\theta\) transmission events in the randomly sampled transmission tree. Probabilities in each [\(i\),\(j\),\(\cdot\)] row are normalized across all \(\theta\) values. The get.transdist.theta function samples a single randomized transmission tree from the epidemic data, therefore we want to simulate many iterations of this random sampling to get a better estimate of the true distribution of \(\theta\).

The est.transdist.theta.weights function estimates the distribution of \(\theta\) across all \(t_i\) and \(t_j\) combinations by simulating many iterations of transmission trees using the get.transdist.theta function. Its output is the same as the get.transdist.theta function, however, it represents the normalized probabilities after n.rep number of simulations.

# Estimate theta weights
case.times <- c(1,2,2,3,3) # times each case occurs
g.x <- c(0, 2/3, 1/3, 0, 0) # hypothetical generation time distribution of a pathogen
gen.time <- 1 # mean generation time
n.gen <- round((max(case.times) - min(case.times)) / gen.time) + 1 # total generations

# Calculate infector-infectee Wallinga-Teunis matrix
wt.mat <- est.wt.matrix(case.times=case.times, gen.t.dist=g.x)

# Estimated theta weights from five randomized transmission trees
theta.wts <- est.transdist.theta.weights(case.times=case.times, n.rep=5,
                                         gen.t.mean=gen.time, t1=0, t.density=g.x)
theta.wts[,,1]
      [,1]   [,2] [,3]
[1,] 0.000    NaN  NaN
[2,] 0.625 0.0000  NaN
[3,] 0.000 0.4375    0

Estimating mean of transmission kernel

To estimate the mean transmission distance over the duration of the epidemic we must use the observed distances between case pairs given the time they occurred \(\mu^{obs}_t(t_i,t_j)\) and combine them into an overall estimate of the mean of the transmission kernel \(\mu_k\). The workhorse function est.trandsdist estimates the overall mean \(\mu_k\) and standard deviation \(\sigma_k\) of the kernel. This function first calls the est.wt.matrix.weights, est.wt.matrix, get.transdist.theta, and est.transdist.theta.weights functions described above to estimate the distribution of \(\theta\) across all case pairs and then calculates each of the weights \(w(\theta=i,t_1,t_2)\). The weights are calculated as the proportion of all case pairs occurring at \(t_i\) and \(t_j\) that are separated by each estimated \(\theta\) over all simulations: \[\hat{w}(\theta = i, t_1, t_2) = \frac{\sum^{N_{sim}}_{k=1} \sum^n_{i=1} \sum^n_{j=1} \boldsymbol{I}_1(t_i=t_1, t_j=t_2, \Theta_{ij} = \theta)} {N_{sim} \sum^n_{i=1} \sum^n_{j=1} \boldsymbol{I}_2(t_i=t_1, t_j=t_2)}.\] Here, the functions \(\boldsymbol{I}_1\) and \(\boldsymbol{I}_2\) indicate if two cases occurred at time \(t_i\) and \(t_j\) and were linked by \(\theta\) transmission events, or if they just occurred at \(t_i\) and \(t_j\) respectively. In words this can be written as: \[\hat{w}(\theta = i, t_1, t_2) = \frac{\text{Total cases at} \,t_1\, \text{and} \,t_2\, \text{across all simulations separated by} \,\theta\, \text{transmission events}} {\text{Total cases at} \,t_1\, \text{and} \,t_2\, \text{across all simulations}}.\] Once the weights of the \(\theta\) values are estimated, the est.transdist function calculates \(\mu_k\) and \(\sigma_k\) as the average weighted estimate over all combinations of \(t_i\) and \(t_j\). If we now let \(k\) index the vector of \(\theta\) values, then: \[\hat \mu_k = \hat \sigma_k = \frac{1}{\sum_i \sum_j n_{ij}} \sum_i \sum_j \frac{2 \cdot \mu^{obs}_t (t_i,t_j) \cdot n_{ij}}{\sum_k \hat w (\theta = k, t_i,t_j) \cdot \sqrt{2 \pi k}}.\] For a derivation of these equations, see sections 2.3 and 2.4 of (Salje et al. 2016b).

The est.trandist function requires case occurrence data—a matrix with three columns [\(x\),\(y\),\(t\)]—and the mean and standard deviation of the infecting pathogen’s generation time (for calculating WT matrices) as input. The function returns estimates of \(\mu_k\) and \(\sigma_k\) of the spatial transmission kernel. These estimates are made under the assumption that \(\mu_k = \sigma_k\), so the upper bound of \(\hat{\mu_k}\) and \(\hat{\sigma_k}\) are also provided for when this assumption is violated. Bound estimates are equal to \(\sqrt 2\) times the values estimated under the \(\mu_k = \sigma_k\) assumption (see section 2.5 of (Salje et al. 2016b)). Additional constraints on the estimation of \(\mu_k\) and \(\sigma_k\) can be defined in the remaining arguments, such as the time step in which the analysis should begin (t1), the maximum number of time steps (max.sep) and maximum spatial distance (max.dist) to consider when estimating \(\theta\), and the number of randomized transmission tree simulations to run (n.transtree.reps).

To estimate the uncertainty around \(\hat{\mu}_k\) due to sampling or observation error, we have implemented a wrapper function called est.transdist.bootstrap.ci that performs bootstrap iterations using the est.transdist function. Upon each iteration, the epidemiological data are resampled with replacement and \(\mu_k\) is re-estimated. The est.transdist.bootstrap.ci function contains all the same arguments as the est.transdist function, with additional arguments defining the number of bootstrapped iterations to perform, the high and low boundaries of the desired confidence interval, and options for running the bootstrap analysis in parallel.

When parallel computation is enabled (the default is parallel = FALSE), the function uses the makeCluster() function of the parallel package to make the appropriate cluster type for the operating system of the local machine (SOCK cluster for Windows or a Fork cluster for Unix-like machines). The cluster is then registered as the parallel backend for the foreach package, which is used to run the bootstraps in parallel. The user can define the number of cores to use when running in parallel using the n.cores argument. If parallel = TRUE and n.cores = NULL, the function will use half the total cores on the local machine.

# Estimate transmission distance for simulated data
set.seed(123)

dist.func <- alist(n=1, a=1/100, rexp(n, a)) # Dispersal kernel
sim <- sim.epidemic(R=2, gen.t.mean=7, gen.t.sd=2, min.cases=100,
                    tot.generations=8, trans.kern.func=dist.func)

# Estimate mean transmission distance
sim.transdist <- est.transdist(epi.data=sim, gen.t.mean=7, gen.t.sd=2, t1=0,
                               max.sep=1e10, max.dist=1e10, n.transtree.reps=10)

sim.transdist
$mu.est
[1] 92.79699
$sigma.est
[1] 91.45614
$bound.mu.est
[1] 131.2348
$bound.sigma.est
[1] 129.3385

# Estimate confidence intervals around mean
sim.transdist.ci <- est.transdist.bootstrap.ci(epi.data=sim,
                                               gen.t.mean=7,
                                               gen.t.sd=2,
                                               t1=0,
                                               max.sep=1e10,
                                               max.dist=1e10,
                                               n.transtree.reps=10,
                                               boot.iter=5,
                                               ci.low=0.025,
                                               ci.high=0.975)
sim.transdist.ci
$mu.est
[1] 131.2124
$mu.ci.low
2.5%
128.2505
$mu.ci.high
97.5%
134.3312

Change in mean transmission distance over time

An estimate of \(\mu_k\) over the duration of an epidemic is indicative of the overall spatial dependence. However, conditions may change over the course of an epidemic that alter the spatial scale upon which transmission operates. To quantify temporal heterogeneity in the mean transmission distance, we have implemented the est.transdist.temporal and est.transdist.temporal.bootstrap.ci functions, which estimate the change in \(\hat{\mu}_k\) over time and its bootstrapped confidence intervals respectively.

When applying the temporal versions of the est.transdist functions, it is important to consider the sample size at each time step because the est.transdist.temporal function estimates \(\mu_k\) for all cases leading up to each unique time step. Some time steps at the beginning of an epidemic may be returned as NA if there are not enough unique cases to estimate \(\mu_k\). Furthermore, in scenarios where time steps in the beginning of an epidemic have low sample sizes, such as an epidemic with a low \(R_0\), \(\hat{\mu}_k\) may be over- or under-estimated and display larger confidence intervals due to sampling error. Therefore, we recommend either setting the t1 argument to the first time step that contains a sufficient sample size, or plotting results along with cumulative sample size as we have done in Figure 3.

# Estimate temporal transmission distance for simulated data
set.seed(123)
dist.func <- alist(n=1, a=1/100, rexp(n, a)) # Dispersal kernel
sim <- sim.epidemic(R=2, gen.t.mean=7, gen.t.sd=2, min.cases=100,
                    tot.generations=8, trans.kern.func=dist.func)

# Estimate mean and confidence intervals at each time step
sim.temp.transdist.ci <- est.transdist.temporal.bootstrap.ci(epi.data=sim,
                                                             gen.t.mean=7,
                                                             gen.t.sd=2,
                                                             t1=0,
                                                             max.sep=1e10,
                                                             max.dist=1e10,
                                                             n.transtree.reps=10,
                                                             boot.iter=5,
                                                             ci.low=0.025,
                                                             ci.high=0.975)
head(sim.temp.transdist.ci)
   t    pt.est    ci.low   ci.high n
1  0        NA        NA        NA 1
2  3        NA        NA        NA 2
3  8        NA        NA        NA 3
4  9  44.61359  35.52047  52.24099 4
5 10 101.55313  43.99469 203.11247 5
6 11 189.29767 113.79560 224.79960 6

Application to foot-and-mouth disease

To provide an example of how the functions shown above can be applied to real data, we estimate the mean transmission distance for the 2001 foot-and-mouth epidemic among farms in Cumbria, UK. These data can be found in the fmd data object included in the sparr package (Davies et al. 2011). It contains transformed (\(x\),\(y\)) coordinates of the infected farms and the time step (\(t\)) in which it was infected, which is given in days since the index farm was infected (Figure 2). The generation time for foot-and-mouth disease is estimated to have a mean of 6.1 days and a standard deviation of 4.6 days (Haydon et al. 2003), so we use these in the gen.t.mean and gen.t.sd arguments in the est.transdist.temporal.bootstrap.ci function.

library(sparr)
data(fmd)
fmd <- cbind(fmd$cases$x, fmd$cases$y, fmd$cases$marks)

head(fmd, n=3)
[,1]     [,2] [,3]
[1,] 333.0328 541.3405   52
[2,] 336.1428 543.3462   46
[3,] 341.4762 551.1794   38

sim.plot(fmd)

# NOTE: this function may take a while depending on the data set
fmd.trans <- est.transdist.temporal.bootstrap.ci(epi.data=fmd,
                                                 gen.t.mean=6.1,
                                                 gen.t.sd=4.6,
                                                 t1=0,
                                                 max.sep=1e10,
                                                 max.dist=1e10,
                                                 n.transtree.reps=5,
                                                 boot.iter=10,
                                                 ci.low=0.025,
                                                 ci.high=0.975,
                                                 parallel=TRUE,
                                                 n.cores=detectCores())
par(mfrow=c(1,1))
fmd.trans[,2:4] <- fmd.trans[,2:4]/1000 # Convert to km
plot(fmd.trans$t, fmd.trans$pt.est, pch=19, col='grey', ylim=range(fmd.trans[,3:4], na.rm=T),
     xlab='Time step', ylab='Estimated mean of transmission kernel (km)')

tmp <- seq(1, nrow(fmd.trans), 5)
axis(3, tmp, fmd.trans[tmp,5])
mtext('Sample size (n)', side=3, line=3)

tmp <- which(fmd.trans$n >= 30)[1]
abline(v=tmp, lty=2)
text(16, 1, 'n = 30')

tmp <- tmp:nrow(fmd.trans)
lty <- c(NA,1,2,2)

for(i in 2:4) {
     low <- loess(fmd.trans[tmp,i] ~ as.vector(tmp), span=0.3)
     low <- predict(low, newdata=data.frame(as.vector(tmp)))
     lines(c(rep(NA, tmp[1]), low), lwd=2, lty=lty[i], col='blue')
}

Using our approach described above, we estimated the mean transmission distance between case farms in the sparr package foot-and-mouth disease data to be \(\hat{\mu}_k =\) 5.8 km (\(95\% \ \text{CI} =\) 5.7–6.1 km). Interestingly, this estimate of \(\mu_k\) is lower than that reported in (Salje et al. 2016b), where \(\hat{\mu}_k =\) 9.1 km (\(95\% \ \text{CI} =\) 8.4–9.7 km). The difference in \(\hat{\mu}_k\) is likely due to differences in data sources. The values estimated in (Salje et al. 2016b) include case farms from both Cumbria and Dumbfriesshire, UK with the additional constrain that only case farms where the source farm was confirmed were included. The sparr data set, on the other hand, contains all case farms from only Cumbria.

3 Global clustering: the \(\tau\)-statistic

Estimating the mean of the spatial transmission kernel (above) provides information on the small spatial scale of individual transmission events. After subsequent generations of transmission where different transmission chains overlap in space, a larger area of elevated disease prevalence will be observed. To describe this larger-scale process, we introduced the \(\tau\)-statistic in (Lessler et al. 2016). The \(\tau\)-statistic measures global clustering with an epidemiological interpretation—the relative risk of an individual being a related case (under some definition) given they are within a particular distance from another case. The spatial distances where the relative risk is high represent an area of elevated prevalence that is likely to have greater public health utility compared with the scale of individual transmission because interventions must account for ongoing transmission at the population level to contain an outbreak. Therefore, the \(\tau\)-statistic provides a more intuitive global measure of spatial clustering, which can be interpreted as the relative risk of infection.

Formulation of the \(\tau\)-statistic has a mathematical relationship to the \(K\)-function and pair correlation function. The \(K\)-function quantifies the expected number of neighboring points within distance \(d\) of a typical point \(\textbf{Z}\) relative to the intensity of the underlying population distribution \(\lambda\). \[K(d) = \frac{1}{\lambda} E[\text{neighbors within distance} \ d \ | \ x,y \ \text{coordinates of} \ \textbf{Z}]\] In the simplest scenario, \(\lambda\) is assumed to be homogeneously distributed, so that \(\lambda = N/A\), where \(N\) is the total number of cases and \(A\) is the total study area. Under the assumption of a heterogeneous underlying population distribution, \(\lambda\) becomes the location specific intensity \(\lambda(S)\), where \(S \subset A\) within distance \(d\) of location \(\textbf{Z}\). In both cases, the \(K\)-function is plotted with the theoretical value of the \(K\)-function for a homogeneous Poisson process \(\pi d^2\), which indicates clustering or dispersion relative to complete randomness. The cumulative aspect of the \(K\)-function (using all neighbors within distance \(d\)) is, however, a well-known constraint that makes it difficult to interpret changes in clustering over distance. The pair correlation function alleviates this constraint by applying the \(K\)-function within a distance range (\(d_1, d_2\)) and standardizing it by the complete spatial randomness expectation for a homogeneous Poisson process within this range: \[G(d_1, d_2) = \frac{K(d_1 + \Delta d) - K(d_1)}{2\pi d_1 \Delta d + \pi \Delta d^2},\] where, \(\Delta d = d_2 - d_1\). Both the \(K\)-function and pair correlation functions have seen general application due to developments that accommodate inhomogeneous underlying population distribution, clustering between typed points, and edge effects. However, these functions assume complete knowledge of the underlying population distribution and use a null statistical hypothesis of complete spatial randomness, which is precarious for scenarios in epidemiology where the underlying population is unknown and relative risk is used to understand disease dynamics.

To incorporate other metrics of global clustering, the IDSpatialStats package provides wrapper functions for calculating both the cross \(K\)- and cross pair correlation functions using the Kcross and PCFcross functions from the spatstat package (Baddeley et al. 2016). These wrapper functions allow for straightforward calculation of these statistics using typed epidemiological data that is formatted for IDSpatialStats functions (Figure 4).

# Calculate cross-K and cross pair correlation functions with simulated data
data(DengueSimRepresentative)

r.vals <- seq(0, 1000, 20)
labs <- seq(0, 1000, 200)

k <- get.cross.K(epi.data=DengueSimRepresentative, type=5, hom=1, het=NULL,
                 r=r.vals, correction='border')
head(k, n=3)
   r      theo    border
1  0     0.000     0.000
2 20  1256.637  2166.362
3 40  5026.548  5956.887

g <- get.cross.PCF(epi.data=DengueSimRepresentative, type=5, hom=1, het=NULL,
                   r=r.vals, correction='border')
head(g, n=3)
   r theo      pcf
1  0    1 1.000000
2 20    1 1.720848
3 40    1 1.178406

par(mfrow=c(1,2))
plot(k[,3], type='l', lwd=2, xaxt='n', xlab='distance (m)', ylab='cross K function')
lines(k[,2], col='red', lty=2, lwd=2)
axis(1, at=which(r.vals %in% labs), labels=labs)

legend(-3, 4.15e6, legend=c("Theoretical Poisson process", "Observed function"),
       col=c("red", "black"), lty=2:1, box.lty=0, bg='transparent',
       x.intersp=0.7, y.intersp = 1.2)

plot(g$pcf, type='l', lwd=2, xaxt='n', xlab='distance (m)',
     ylab='cross pair correlation function')
abline(h=1, col='red', lty=2, lwd=2)
axis(1, at=which(r.vals %in% labs), labels=labs)

A measure of relative risk that does not assume knowledge of the underlying population distribution was developed for point pattern data in veterinary epidemiology (Diggle et al. 2005). This function, which is implemented in the sparr (Davies et al. 2011) and spatstat (Baddeley et al. 2016) packages, uses spatial kernel functions to calculate a ratio of spatial intensity for two different point types \(\rho(S) = \lambda_1(S) / \lambda_0(S)\). This formulation can quantify the relative risk for case-control point data or case occurrences with multiple types, but it is not a global clustering statistic.

The \(\tau\)-statistic can be computed in two ways depending on the underlying assumption that the true population distribution is known. If this assumption is true, then the \(\tau\)-statistic is similar to other common measures of global clustering that rely on knowledge of the background population distribution to quantify generic clustering of a point process. \[\hat{\tau}(d_1, d_2) = \frac{\hat{\pi}(d_1, d_2)}{\hat{\pi}(0, \infty)},\] where \(\hat{\pi}(d_1, d_2)\) represents the incidence rate within distance \(d_1\) to \(d_2\) of a case and \(\hat{\pi}(0, \infty)\) represents the incidence rate over the entire extent of the study area. This can be implemented by defining the numerator and denominator using the get.pi function or by using the generalized get.tau function with comparison.type = ’representative’ argument. The \(\hat{\pi}(\cdot)\) terms in the numerator and denominator use the occurrence data to calculate the incidence rates for linked cases within some defined distance. Therefore, a critical step in performing an analysis with the \(\tau\)-statistic is specifying which cases are linked through some defined relationship (homologous) and those that are not (non-homologous). Homology can be defined statically or dynamically. When defined statically, the get.pi.typed and get.tau.typed functions can be used to assign case types based on a type column supplied by the data matrix. When defined dynamically, an indicator function \(I(\cdot)\) is used to delineate linked and unlinked cases in the data, which allows greater flexibility when defining case type homology.

# Calculate tau-statistic using get.pi.typed functions
data(DengueSimRepresentative)

type <- 2 - (DengueSimRepresentative[,'serotype'] == 1)
typed.data <- cbind(DengueSimRepresentative, type=type)
d2 <- seq(20, 1000, 20)
d1 <- d2 - 20

# Static definition of case type homology
num <- get.pi.typed(typed.data, typeA=1, typeB=2, r.low=d1, r=d2)
den <- get.pi.typed(typed.data, typeA=1, typeB=2, r.low=0, r=1e10)
head(num$pi/den$pi, n=4)
[1] 0.2641154 0.2104828 0.2451847 0.2487042

tau <- get.tau.typed(typed.data, typeA=1, typeB=2, r.low=d1, r=d2,
                     comparison.type = "representative") # Equivalent
head(tau, n=4)
  r.low  r       tau
1     0 20 0.2641154
2    20 40 0.2104828
3    40 60 0.2451847
4    60 80 0.2487042

# Calculate tau-statistic using dynamic expression indicating serotype homology
ind.func <- function(a, b){
     if (a[5] == 1 & b[5] == 1) {
          x <- 1
     } else {
          x <- 2
     }
     return(x)
}

num <- get.pi(posmat=DengueSimRepresentative, fun=ind.func, r.low=d1, r=d2)
den <- get.pi(posmat=DengueSimRepresentative, fun=ind.func, r.low=0, r=Inf)
head(num$pi/den$pi, n=4)
[1] 5.084735 4.967885 4.605805 4.409876

tau <- get.tau(posmat=DengueSimRepresentative, fun=ind.func, r.low=d1,  r=d2,
               comparison.type="representative") # Equivalent
head(tau, n=4)
  r.low  r      tau
1     0 20 5.084735
2    20 40 4.967885
3    40 60 4.605805
4    60 80 4.409876

plot(tau$r.low+tau$r/2, tau$tau, type='l', lwd=2, col='blue', xlab='distance (m)')
abline(h=1, lty=2, lwd=2, col='red')
abline(v=100)

The interpretation of the \(\tau\)-statistic is analogous to that of the pair-correlation function in two ways. First, the \(\tau\)-statistic is not compared to the theoretical measure of a random Poisson process because the metric is an incidence rate ratio with epidemiological meaning. Instead, this measure is plotted in comparison to a ratio of 1, indicating no relative difference in disease risk among homologous cases. Second, the \(\tau\)-statistic measures relative risk using case pairs within a distance range \((d_1 \leq d_{ij} < d_2)\). This approach can describe how fine-scale spatial dependence changes over distance. However, if the user wishes to estimate cumulative spatial dependence (analogous to the \(K\)-function), then \(d_1\) can be fixed at zero \((0 \leq d_{ij} < d_{2})\).

Estimating the \(\tau\)-statistic with \(\hat{\theta}\)

If the underlying population distribution is unknown, then the \(\tau\)-statistic can be computed so that it quantifies global clustering in terms of relative risk. This goes beyond classic measures of clustering by utilizing some relationship between linked and unlinked cases to distinguish transmission chains and quantify relative clustering between them. To do so, we use the function \(\hat{\theta}(\cdot)\) which gives the odds ratio of cases related to case \(i\) to those independent of \(i\) to give an estimate of the \(\tau\)-statistic that is not biased by assumptions about the underlying population distribution. \[\hat{\tau}(d_1, d_2) = \frac{\hat{\theta}(d_1, d_2)}{\hat{\theta}(0, \infty)},\] where, \[\hat{\theta}(d_1, d_2) = \frac{\sum_{\forall i} \sum_{\forall j} I_1(z_{ij}=1, \ d_1 \leq d_{ij} < d_2)}{\sum_{\forall i} \sum_{\forall j} I_2(z_{ij}=0, \ d_1 \leq d_{ij} < d_2)}.\] The indicator function \(I(\cdot)\) is applied to all \(ij\) case pairs within distance \(d_1\) and \(d_2\). It returns a binary response which is equal to 1 when case pairs meet user-specified criteria to be homologous and equal to 2 when they are non-homologous. The result is an \(i \times j\) relation matrix \(z_{ij}\) which is used to find the sums of homologous and non-homologous case pairs. Using an indicator function also allows additional criteria to be used to define case type homology, such as temporal proximity (Figure 6).

# Calculate tau-statistic using serotype homology and time
data(DengueSimR01)
d2 <- seq(20, 1000, 20)
d1 <- d2 - 20

# Dynamic expression indicating serotype homology and temporal proximity
ind.func <- function(a, b, t.limit=20){
     if (a[5] == b[5] & (abs(a[3] - b[3]) <= t.limit)){
          x <- 1
     } else {
          x <- 2
     }
     return(x)
}

num <- get.theta(DengueSimR01, ind.func, r.low=d1, r=d2)
den <- get.theta(DengueSimR01, ind.func, r.low=0, r=Inf)
head(num$theta/den$theta, n=4)
[1] 3.9148969 3.5145802 4.5963608 5.1082210

tau <- get.tau(posmat=DengueSimR01, fun=ind.func, r.low=d1, r=d2,
               comparison.type="independent") # Equivalent
head(tau, n=4)
  r.low  r      tau
1     0 20 3.914897
2    20 40 3.514580
3    40 60 4.596361
4    60 80 5.108221

plot(tau$r, tau$tau, type='l', lwd=2, col='blue', xlab='distance (m)')
abline(h=1, lty=2, lwd=2, col='red')
abline(v=100)

Calculating variance in point estimates

In the examples above, the get.pi, get.theta, and get.tau function families calculate point estimates for \(\hat{\pi}\), \(\hat{\theta}\), and \(\hat{\tau}\) respectively. In scenarios where observation error, sampling bias, or measurement error are expected to introduce additional variance, users may wish to place confidence intervals around these point estimates. For this purpose, each family of functions contains a function ending with a .bootstrap suffix, which generates point estimates for boot.iter number of bootstrapped samples of the data (Efron and Tibshirani 1994). Functions ending with a .ci suffix are wrappers that calculate user specified confidence intervals based on the bootstrapped samples (Figure 7).

# Calculate variance around point estimates of the tau-statistic
data(DengueSimR02)

d2 <- seq(20, 1000, 20)
d1 <- d2 - 20

# Function indicating genotype homology
ind.func <- function(a, b){
     if(a[4] == b[4]){
          x = 1
     } else{
          x = 2
     }
     return(x)
}

# Bootstrapped estimates of tau
tau.boot <- get.tau.bootstrap(DengueSimR02, ind.func, r.low=d1, r=d2, boot.iter=5)
head(tau.boot, n=4)
  r.low  r       X1       X2       X3       X4       X5
1     0 20 51.04283 49.17736 60.45922 43.36588 37.26332
2    20 40 20.80095 29.62483 26.54935 34.11416 31.32279
3    40 60 34.05415 35.66984 40.21975 31.02943 32.77966
4    60 80 30.52361 35.46972 27.77247 36.64628 32.43156

# Wrapper function of get.tau.bootstrap calculates confidence intervals
tau.ci <- get.tau.ci(DengueSimR02, ind.func, r.low=d1, r=d2, boot.iter=25)
head(tau.ci, n=4)
  r.low  r   pt.est   ci.low  ci.high
1     0 20 44.05161 22.73147 59.44465
2    20 40 30.83943 19.68758 42.05249
3    40 60 37.57434 30.48664 45.78121
4    60 80 33.54134 28.12390 38.76330

plot(tau.ci$r, tau.ci$pt.est, ylim=range(tau.ci[,4:5]), type="l", lwd=2, col='blue',
     xlab='distance (m)', ylab='tau')
lines(tau.ci$r, tau.ci$ci.low, lty=2, lwd=1, col='blue')
lines(tau.ci$r, tau.ci$ci.high, lty=2, lwd=1, col='blue')
abline(h=1, lty=2, lwd=2, col='red')

Null hypothesis testing

A common approach for interpreting spatial clustering statistics includes hypothesis testing using simulation envelopes to assess whether an observed spatial measure is statistically significant (Ripley 1979; Baddeley et al. 2014). To enable null hypothesis tests, we have implemented a permutation method (Good 2010) to simulate the nonparametric distribution of \(\hat{\pi}\), \(\hat{\theta}\), and \(\hat{\tau}\) under the null hypothesis of no spatial dependence. The permutation algorithm simulates the null distribution by randomly reassigning case coordinates to observations upon each permutation. Null distributions can be computed using functions ending in the .permute suffix and then plotted with observed measures to assess statistical significance as a function of distance (Figure 8).

# Compare tau statistic to its null distribution using permutation
data(DengueSimR02)
set.seed(123)

d2 <- seq(20, 1000, 20)
d1 <- d2 - 20

# Compare spatial dependence by time case occurs
type <- 2 - (DengueSimR02[,"time"] < 120)
typed.data <- cbind(DengueSimR02, type=type)

typed.tau <- get.tau.typed(typed.data, typeA=1, typeB=2, r.low=d1, r=d2,
                           comparison.type = "independent")

head(typed.tau, n=4)
  r.low  r       tau
1     0 20 0.4040661
2    20 40 0.5471728
3    40 60 0.7897655
4    60 80 0.8901166

# Perform permutations of observed case times and locations for null distribution
typed.tau.null <- get.tau.typed.permute(typed.data, typeA=1, typeB=2, r.low=d1, r=d2,
                                        permutations=100,
                                        comparison.type = "independent")
head(typed.tau.null[,1:7], n=4)
  r.low  r        X1        X2        X3        X4        X5
1     0 20 1.2570945 0.8530284 3.5019060 1.0326133 1.0775095
2    20 40 1.1448539 0.7045255 0.7224212 2.0742058 0.8754765
3    40 60 0.6947101 0.8904419 0.8249682 0.6990984 0.5128531
4    60 80 1.6266250 1.0916873 0.8326210 0.8432683 1.4815756

# 95% confidence intervals of null distribution
null.ci <- apply(typed.tau.null[,-(1:2)], 1, quantile, probs=c(0.025, 0.975))

plot(typed.tau$r, typed.tau$tau, type='l', lwd=2, ylim=range(c(typed.tau$tau, null.ci)),
     xlab="distance (m)", ylab="tau")
lines(typed.tau$r, null.ci[1,], lty=2)
lines(typed.tau$r, null.ci[2,], lty=2)
abline(h=1, lty=2, lwd=2, col='red')

4 Summary

Conventional spatial statistics are often used to describe the intensity or clustering of point processes. Quantifying spatial dependence of infectious disease spread, however, requires a modified approach that considers overlapping transmission chains and the likelihood of case linkage. Therefore, we have implemented two types of spatial statistics in the IDSpatialStats package (the mean transmission distance \(\mu_k\), and the \(\tau\)-statistic) that can be used along with other measures of spatial dependence (e.g. the cross \(K\)-function and cross pair correlation function) to understand the spatial spread of infectious diseases.

We showed how to simulate epidemiological data and estimate \(\mu_k\), and the \(\tau\)-statistic, which can be used as templates for other analyses. First, the transdist family of functions provides a measure of fine-scale spatial dependence by estimating the mean of the transmission distance \(\hat{\mu}_k\) between sequentially linked cases in a transmission chain (Salje et al. 2016b). Second, the get.tau family of functions measure spatial dependence on a larger-scale by estimating the \(\tau\)-statistic, which describes the area of elevated prevalence surrounding cases. This family of functions does so by estimating the relative risk of a case being homologous compared with non-homologous case types. The definition of case type homology is flexible and can utilize temporal or biological information, such as genotype and serotype of the pathogen.

The generalized structure of the get.tau family allows for diverse applications of the \(\tau\)-statistic to epidemiological data. Previous studies have used the \(\tau\)-statistic to quantify spatial and/or temporal dependence of transmission for Dengue, Cholera, HIV, and Measles disease systems (see Table 2 for detailed descriptions). These studies illustrate that, regardless of the system under study, analyses are enhanced when bootstrapping, permutation tests, and/or assessment of observation error is employed to understand the distribution of error and statistical significance for estimates of the \(\tau\)-statistic.

The IDSpatialStats package is undergoing continued development. Future directions include expanding the implementation of the \(\tau\)-statistic to facilitate estimation of spatio-temporal dependence by incorporating a temporal interval into the spatial search window. This technique was used by (Salje et al. 2012) in the form of the \(\phi\)-statistic to estimate \(\hat{\phi}(d_1, d_2, t_1, t_2)\). Additional developments include a theoretical framework for the \(\tau\)-statistic that incorporates uncertainty due to pathogen generation time, and to define case type homology more continuously using genetic distance matrices. We hope these developments will enable users to address more complex questions and incorporate more sources of uncertainty into estimates of spatial dependence. Check Github at https://github.com/HopkinsIDD/IDSpatialStats for latest development release.

Table 2: Descriptions of how previous studies have used the \(\tau\)-statistic to quantify spatial dependence for infectious diseases. Listed in chronological order.
Description	Citation
Spatial and temporal dependence of homotypic and heterotypic Dengue	(Salje et al. 2012)
virus serotypes over a 5 year period in Bangkok, Thailand
Clustering of HIV prevalence and incidence around HIV-seropositive	(Grabowski et al. 2014)
individuals using cohort data from rural Rakai District, Uganda
Overview of the \(\tau\)-statistic, its performance given observation error,	(Lessler et al. 2016)
and illustrations using Dengue, HIV, and Measles
Spatial dependence of seroconverted individuals in the 2012–2013	(Salje et al. 2016a)
Chikungunya outbreak in the Phillipines
Comparison of spatial dependence in endemic transmission of Dengue	(Quoc et al. 2016)
virus serotypes in Bangkok and Ho Chi Min City, Thailand
Risk of Cholera transmission within spatial and temporal zones after case	(Azman et al. 2018)
presentation during urban epidemics in Chad and D.R. Congo
Summary statistic to fit micro-simulations of Cholera interventions	(Finger et al. 2018)
to epidemic data using Approximate Bayesian Computation
Temporal clustering of subclinical infections and homologous serotypes	(Salje et al. 2018)
within schools using Dengue cohort data in Thailand

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.