Spatial statistics for infectious diseases are important because the spatial and temporal scale over which transmission operates determine the dynamics of disease spread. Many methods for quantifying the distribution and clustering of spatial point patterns have been developed (e.g. \(K\)-function and pair correlation function) and are routinely applied to infectious disease case occurrence data. However, these methods do not explicitly account for overlapping chains of transmission and require knowledge of the underlying population distribution, which can be limiting when analyzing epidemic case occurrence data. Therefore, we developed two novel spatial statistics that account for these effects to estimate: 1) the mean of the spatial transmission kernel, and 2) the \(\tau\)-statistic, a measure of global clustering based on pathogen subtype. We briefly introduce these statistics and show how to implement them using the IDSpatialStats R package.
The transmission process which drives an epidemic can be characterized by the spatial distance separating linked cases. When these transmission events accumulate over time, they are observed as areas of elevated disease prevalence. Knowledge of the extent of the affected area and where new cases may arise is crucial for many disease control strategies (e.g. ring vaccination, vector control etc). In epidemiology, case occurrence data—(\(x\),\(y\)) coordinates with temporal information (\(t\)) and other covariates—are often used to understand these types of infectious disease dynamics. These data are typically treated as a generic point process so that they can be described in terms of spatial intensity (expected number of cases per unit area) or clustering due to spatial dependence (covariance in \(x\),\(y\) space).
In the broader field of spatial statistics, there are many methods that measure the spatial intensity or clustering of a generic point process on a Cartesian (\(x\),\(y\)) coordinate system (Table 1). These methods primarily fall into three categories: first-order first-moment (FOFM), first-order second-moment (FOSM), and second-order second-moment (SOSM). The FOFM measures use quadrature (aggregate counts of points within cells) to quantify intensity of the point pattern continuously over (\(x\),\(y\)) space. Packages such as lgcp (Taylor et al. 2013, 2015) and ppmlasso (Renner and Warton 2013) allow users to model first-order intensity as a count process using a regressive function of (\(x\),\(y\)) coordinates and other covariates. The FOSM measures—Moran’s I and Geary’s C (Moran 1950; Geary 1954)—also use quadrature, but they describe general covariance among cell counts across the (\(x\),\(y\)) dimensions. These spatial statistics can be calculated using the spdep R package (Bivand and Piras 2015). The SOSM measures, such as the \(K\)-function and its non-cumulative analogue, the pair correlation function, quantify clustering among neighboring points. Both the FOSM and SOSM measures are considered global spatial statistics because they describe spatial dependence for the entire study area. However, the SOSM can further describe how spatial dependence changes as a function of distance by comparing the observed intensity of neighboring points within distance \(d\) to that expected under complete spatial randomness. The \(K\)-function and pair correlation function can be calculated using the ads (Pélissier and Goreaud 2015), spatstat (Baddeley et al. 2016), and splancs (Rowlingson and Diggle 2017) R packages.
These classic spatial measures are limited in their ability to describe infectious disease dynamics primarily because they treat case occurrence data as a generic point process. The FOFM and FOSM measures use quadrature, which make them vulnerable to error associated with data aggregation (Robinson 2009) and the modifiable areal unit problem (Openshaw and Taylor 1979). The SOSM measures, like the \(K\)-function and pair correlation function, are more common in epidemiology. However, their statistical interpretation is less intuitive in terms of classic epidemiological quantities of relative disease risk, such as the incidence rate ratio or hazard ratio. Additionally, even the temporal forms of these functions (e.g. the space-time \(K\)-function) do not capture the typical distances traveled in a single transmission generation as they quantify the overall spatial dependence between all cases, not just those epidemiologically linked. The mean distance between sequential cases in a transmission chain is an important epidemiological quantity because it provides insight into potential mechanisms driving spread as well as helping inform interventions. Therefore, we developed novel measures that build upon concepts in spatial statistics to characterize infectious disease spread using case occurrence data. Importantly, these measures are robust to heterogeneities in the underlying population, and substantial case under-reporting, which is common in epidemiology.
We describe these two measures of spatial dependence for infectious
diseases and show how they can be calculated with the
IDSpatialStats R
package in the following three sections. First, we introduce a function
which simulates infectious disease spread as a spatial branching
process. This function is primarily intended to simulate example
datasets for the est.transdist
family of functions and
\(\tau\)-statistics that use temporal information to indicate linked
cases. Second, we demonstrate how to estimate the mean and standard
deviation of the spatial transmission kernel (Salje et al. 2016b).
Estimating the spatial transmission kernel requires an understanding of
the number of transmission generations separating cases at different
time points of the epidemic. This method provides a measure of
fine-scale spatial dependence between two cases, which can be
interpreted as the mean distance between sequential cases in a
transmission chain. Third, we describe a measure of global
clustering—the \(\tau\)-statistic—that calculates the relative risk of
infection given some criteria to identify cases closely related along a
chain of transmission (Lessler et al. 2016). The \(\tau\)-statistic is
a global clustering statistic—like the \(K\)-function and pair
correlation function—that provides an overall measure of clustering
for the entire course of an outbreak. Depending on the parameterization,
the \(\tau\)-statistic represents the odds of observing another case with
distance \(d\) of an infected case compared with either the underlying
population or other pathogen types. The following sections contain a
brief introduction to each statistic to provide context to the code
implementation—for more detailed description of each statistic, see
(Lessler et al. 2016) and (Salje et al. 2016b).
We have implemented these tools in the IDSpatialStats R package
version 0.3.5 and above. The latest stable release depends on the
doParallel
and foreach
packages
(Microsoft and Weston 2017; Corporation and Weston 2018) and can be downloaded from CRAN.
A development version of the package is also available on Github at
https://github.com/HopkinsIDD/IDSpatialStats.
Package | Description | Citation |
---|---|---|
ads | K function for enclosed point patterns | (Pélissier and Goreaud 2015) |
DCluster | disease clustering for count data | (Gómez-Rubio et al. 2005) |
lgcp | modeling point patterns | (Taylor et al. 2013, 2015) |
with log-Gaussian Cox processes | ||
ppmlasso | modeling point patterns | (Renner and Warton 2013) |
with LASSO regularization | ||
SGCS | third order clustering of point processes | (Rajala 2017) |
sparr | spatial relative risk functions with kernel smoothing | (Davies et al. 2011) |
spatstat | comprehensive tools for analyzing | (Baddeley et al. 2016) |
point patterns in many dimensions | ||
spdep | classic statistics to test for spatial dependence | (Bivand and Piras 2015) |
splancs | kernel smoothing and Poisson cluster processes | (Rowlingson and Diggle 2017) |
We use a stochastic spatial branching process to simulate
epidemiological data in the sim.epidemic
function. Simulations begin
with an index case at \((x,y,t) = (0,0,0)\) and transmission events that
link two cases follow according to a random Markov process in \((x,y)\)
space (i.e. Brownian motion). The number of transmission events occurs
according to a Poisson distribution, with its mean and variance set to
the effective reproduction number \(R\) of the infecting pathogen. The
spatial distance traversed by each transmission event is given by a user
specified probability distribution which serves as the dispersal kernel
function. When specifying the dispersal kernel, the trans.kern.func
argument expects a list object containing a probability distribution
function and its named arguments. For example, to simulate an epidemic
where transmission typically occurs at the local level, but long
distance transmission events sometimes occur, an exponential
transmission kernel might be used because of its long tail.
Alternatively, if transmission is expected to consistently occur within
a given range, then a normal kernel may be more appropriate.
In simulations where the basic reproductive number \(R_0\) is used to
define a constant \(R\)-value and \(R_0>1\), the number of cases will
continue to increase with each time step. This effect may not be
appropriate when simulating settings where intervention efforts or
depletion of susceptible individuals causes heterogeneity in \(R\) over
the course of the epidemic. Thus, the sim.epidemic
function accepts
either a scalar value for a constant \(R\) value or a vector of \(R\) values
with a length equal to tot.generations
, allowing simulations with a
variable \(R\) value, as shown in the following R code.
# Epidemic simulations with variable R value
<- 2
R1 <- 0.5
R2 <- 12
tot.gen <- seq(R1, R2, (R2 - R1)/(tot.gen - 1))
R.vec <- alist(n=1, a=1/100, rexp(n, a))
dist.func <- sim.epidemic(R=R.vec, gen.t.mean=7, gen.t.sd=2, min.cases=100,
sim tot.generations=tot.gen, trans.kern.func=dist.func)
head(sim, n=4)
x y t1 0.00000 0.000000 0
2 24.46125 3.280527 3
3 -60.73475 184.885784 7
4 -12.79933 -57.798696 4
sim.plot(sim)
In (Salje et al. 2016b), we introduced a method to estimate the mean and standard deviation of the spatial transmission kernel using case occurrence data. These data include location \((x, y)\) and onset time \(t\) of each case (case times) and the infecting pathogen’s generation time \(g(x)\). To estimate these spatial statistics, we use the Wallinga-Teunis (WT) method (Wallinga and Teunis 2004) to probabilistically estimate the number of transmission events required to link two cases, denoted as \(\theta\). In settings where a phylogenetic model or contact tracing provide information on transmission pathways, the spatial kernel can be empirically estimated using the distribution of observed distances among all linked cases. The mean and standard deviation of this kernel can then be calculated for any time interval between \(t_1\) and \(t_2\) to give \(\mu^{obs}_t(t_1, t_2)\), with the assumption that the number of transmission events separating all case pairs is homogeneous (\(\theta = 1\)). When data that indicate case linkage are lacking, this assumption is incorrect because the distance between two cases depends on the number of transmission events separating them. In this case, the mean transmission distance at each time interval \(\mu_t\) must be estimated as a weighted mean: \[\mu_t(t_1,t_2,\mu_k,\sigma_k) = \sum_i w(\theta=i,t_1,t_2) \cdot \mu_a(\theta=i,\mu_k,\sigma_k).\] Where, \(w(\theta=i,t_1,t_2)\) gives the weight for each of the \(i\) elements of \(\theta\) and the second term \(\mu_a(\theta=i,\mu_k,\sigma_k)\) gives the mean distance separating case pairs that are linked by the \(i\)th value of \(\theta\).
We have implemented four nested functions that are used to estimate
\(w(\theta=i,t_1,t_2)\) and describe them briefly below. Listed in order,
they are comprised of est.wt.matrix.weights
, est.wt.matrix
,
get.transdist.theta
, and est.transdist.theta.weights
. Although,
these functions are documented separately, they are all driven by the
est.transdist
family of functions and do not need to be run manually
unless desired.
The est.wt.matrix.weights
function builds upon code from the R0
package (Obadia et al. 2012) to calculate the basic WT matrix
(Wallinga and Teunis 2004). This matrix gives the probability that a
case at time \(t_i\) (rows) infected a case at time \(t_j\) (columns),
i.e. \(\theta = 1\), based on the generation time distribution of the
pathogen \(g(x)\). For an epidemic with \(t\) unique case times,
est.wt.matrix.weights
gives a \(T \times T\) matrix.
The est.wt.matrix
function produces a WT type matrix for all
infector-infectee case pairs. Given the WT matrix produced by
est.wt.matrix.weights
and total case count \(n\), this function
calculates an \(n \times n\) matrix giving the probability that case \(i\)
(rows) infected case \(j\) (columns). The WT matrix object can be handed
directly to est.wt.matrix
via the basic.wt.weights
argument, or if
this argument is NULL
, the est.wt.matrix.weights
function is called
automatically.
# Calculating Wallinga-Teunis matrices
<- c(1,2,2,3,3) # times each case occurs
case.times <- c(0, 2/3, 1/3, 0, 0) # hypothetical generation time of a pathogen
g.x
<- est.wt.matrix.weights(case.times=case.times, gen.t.dist=g.x)
mat.wts
# Calculate infector-infectee Wallinga-Teunis matrix
<- est.wt.matrix(case.times=case.times, gen.t.dist=g.x,
wt.mat1 basic.wt.weights=mat.wts)
<- est.wt.matrix(case.times=case.times, gen.t.dist=g.x)
wt.mat2
identical(wt.mat1, wt.mat2) # the two methods are equivalent
1] TRUE [
The get.transdist.theta
function estimates the number of transmission
events \(\theta\) separating pairs of cases using the probabilities in the
infector-infectee WT matrix produced by the est.wt.matrix
function.
Sampling all possible transmission trees is impractical for most
datasets, so this function constructs a transmission tree by randomly
selecting the infector of each case in the epidemic and then \(\theta\) is
determined by finding the product of all probabilities in the chain of
transmission that link the randomly sampled case pairs.
The object theta.wts
(in the code segment below) contains a
three-dimensional array [\(i\),\(j\),\(k\)], where the rows \(i\) and columns
\(j\) represent unique case times and the third dimension \(k\) is the
number of transmission events \(\theta\). Each cell gives the probability
that two cases occurring at times \(i\) and \(j\) are connected by \(\theta\)
transmission events in the randomly sampled transmission tree.
Probabilities in each [\(i\),\(j\),\(\cdot\)] row are normalized across all
\(\theta\) values. The get.transdist.theta
function samples a single
randomized transmission tree from the epidemic data, therefore we want
to simulate many iterations of this random sampling to get a better
estimate of the true distribution of \(\theta\).
The est.transdist.theta.weights
function estimates the distribution of
\(\theta\) across all \(t_i\) and \(t_j\) combinations by simulating many
iterations of transmission trees using the get.transdist.theta
function. Its output is the same as the get.transdist.theta
function,
however, it represents the normalized probabilities after n.rep
number
of simulations.
# Estimate theta weights
<- c(1,2,2,3,3) # times each case occurs
case.times <- c(0, 2/3, 1/3, 0, 0) # hypothetical generation time distribution of a pathogen
g.x <- 1 # mean generation time
gen.time <- round((max(case.times) - min(case.times)) / gen.time) + 1 # total generations
n.gen
# Calculate infector-infectee Wallinga-Teunis matrix
<- est.wt.matrix(case.times=case.times, gen.t.dist=g.x)
wt.mat
# Estimated theta weights from five randomized transmission trees
<- est.transdist.theta.weights(case.times=case.times, n.rep=5,
theta.wts gen.t.mean=gen.time, t1=0, t.density=g.x)
1]
theta.wts[,,1] [,2] [,3]
[,1,] 0.000 NaN NaN
[2,] 0.625 0.0000 NaN
[3,] 0.000 0.4375 0 [
To estimate the mean transmission distance over the duration of the
epidemic we must use the observed distances between case pairs given the
time they occurred \(\mu^{obs}_t(t_i,t_j)\) and combine them into an
overall estimate of the mean of the transmission kernel \(\mu_k\). The
workhorse function est.trandsdist
estimates the overall mean \(\mu_k\)
and standard deviation \(\sigma_k\) of the kernel. This function first
calls the est.wt.matrix.weights
, est.wt.matrix
,
get.transdist.theta
, and est.transdist.theta.weights
functions
described above to estimate the distribution of \(\theta\) across all case
pairs and then calculates each of the weights \(w(\theta=i,t_1,t_2)\). The
weights are calculated as the proportion of all case pairs occurring at
\(t_i\) and \(t_j\) that are separated by each estimated \(\theta\) over all
simulations:
\[\hat{w}(\theta = i, t_1, t_2) = \frac{\sum^{N_{sim}}_{k=1} \sum^n_{i=1} \sum^n_{j=1} \boldsymbol{I}_1(t_i=t_1, t_j=t_2, \Theta_{ij} = \theta)}
{N_{sim} \sum^n_{i=1} \sum^n_{j=1} \boldsymbol{I}_2(t_i=t_1, t_j=t_2)}.\]
Here, the functions \(\boldsymbol{I}_1\) and \(\boldsymbol{I}_2\) indicate
if two cases occurred at time \(t_i\) and \(t_j\) and were linked by
\(\theta\) transmission events, or if they just occurred at \(t_i\) and
\(t_j\) respectively. In words this can be written as:
\[\hat{w}(\theta = i, t_1, t_2) = \frac{\text{Total cases at} \,t_1\, \text{and} \,t_2\, \text{across all simulations separated by} \,\theta\, \text{transmission events}}
{\text{Total cases at} \,t_1\, \text{and} \,t_2\, \text{across all simulations}}.\]
Once the weights of the \(\theta\) values are estimated, the
est.transdist
function calculates \(\mu_k\) and \(\sigma_k\) as the
average weighted estimate over all combinations of \(t_i\) and \(t_j\). If
we now let \(k\) index the vector of \(\theta\) values, then:
\[\hat \mu_k = \hat \sigma_k = \frac{1}{\sum_i \sum_j n_{ij}} \sum_i \sum_j \frac{2 \cdot \mu^{obs}_t (t_i,t_j) \cdot n_{ij}}{\sum_k \hat w (\theta = k, t_i,t_j) \cdot \sqrt{2 \pi k}}.\]
For a derivation of these equations, see sections 2.3 and 2.4 of
(Salje et al. 2016b).
The est.trandist
function requires case occurrence data—a matrix
with three columns [\(x\),\(y\),\(t\)]—and the mean and standard deviation
of the infecting pathogen’s generation time (for calculating WT
matrices) as input. The function returns estimates of \(\mu_k\) and
\(\sigma_k\) of the spatial transmission kernel. These estimates are made
under the assumption that \(\mu_k = \sigma_k\), so the upper bound of
\(\hat{\mu_k}\) and \(\hat{\sigma_k}\) are also provided for when this
assumption is violated. Bound estimates are equal to \(\sqrt 2\) times the
values estimated under the \(\mu_k = \sigma_k\) assumption (see section
2.5 of (Salje et al. 2016b)). Additional constraints on the
estimation of \(\mu_k\) and \(\sigma_k\) can be defined in the remaining
arguments, such as the time step in which the analysis should begin
(t1
), the maximum number of time steps (max.sep
) and maximum spatial
distance (max.dist
) to consider when estimating \(\theta\), and the
number of randomized transmission tree simulations to run
(n.transtree.reps
).
To estimate the uncertainty around \(\hat{\mu}_k\) due to sampling or
observation error, we have implemented a wrapper function called
est.transdist.bootstrap.ci
that performs bootstrap iterations using
the est.transdist
function. Upon each iteration, the epidemiological
data are resampled with replacement and \(\mu_k\) is re-estimated. The
est.transdist.bootstrap.ci
function contains all the same arguments as
the est.transdist
function, with additional arguments defining the
number of bootstrapped iterations to perform, the high and low
boundaries of the desired confidence interval, and options for running
the bootstrap analysis in parallel.
When parallel computation is enabled (the default is
parallel = FALSE
), the function uses the makeCluster()
function of
the parallel
package to make the appropriate cluster type for the
operating system of the local machine (SOCK cluster for Windows or a
Fork cluster for Unix-like machines). The cluster is then registered as
the parallel backend for the foreach
package, which is used to run the
bootstraps in parallel. The user can define the number of cores to use
when running in parallel using the n.cores
argument. If
parallel = TRUE
and n.cores = NULL
, the function will use half the
total cores on the local machine.
# Estimate transmission distance for simulated data
set.seed(123)
<- alist(n=1, a=1/100, rexp(n, a)) # Dispersal kernel
dist.func <- sim.epidemic(R=2, gen.t.mean=7, gen.t.sd=2, min.cases=100,
sim tot.generations=8, trans.kern.func=dist.func)
# Estimate mean transmission distance
<- est.transdist(epi.data=sim, gen.t.mean=7, gen.t.sd=2, t1=0,
sim.transdist max.sep=1e10, max.dist=1e10, n.transtree.reps=10)
sim.transdist$mu.est
1] 92.79699
[$sigma.est
1] 91.45614
[$bound.mu.est
1] 131.2348
[$bound.sigma.est
1] 129.3385
[
# Estimate confidence intervals around mean
<- est.transdist.bootstrap.ci(epi.data=sim,
sim.transdist.ci gen.t.mean=7,
gen.t.sd=2,
t1=0,
max.sep=1e10,
max.dist=1e10,
n.transtree.reps=10,
boot.iter=5,
ci.low=0.025,
ci.high=0.975)
sim.transdist.ci$mu.est
1] 131.2124
[$mu.ci.low
2.5%
128.2505
$mu.ci.high
97.5%
134.3312
An estimate of \(\mu_k\) over the duration of an epidemic is indicative of
the overall spatial dependence. However, conditions may change over the
course of an epidemic that alter the spatial scale upon which
transmission operates. To quantify temporal heterogeneity in the mean
transmission distance, we have implemented the est.transdist.temporal
and est.transdist.temporal.bootstrap.ci
functions, which estimate the
change in \(\hat{\mu}_k\) over time and its bootstrapped confidence
intervals respectively.
When applying the temporal versions of the est.transdist
functions, it
is important to consider the sample size at each time step because the
est.transdist.temporal
function estimates \(\mu_k\) for all cases
leading up to each unique time step. Some time steps at the beginning of
an epidemic may be returned as NA
if there are not enough unique cases
to estimate \(\mu_k\). Furthermore, in scenarios where time steps in the
beginning of an epidemic have low sample sizes, such as an epidemic with
a low \(R_0\), \(\hat{\mu}_k\) may be over- or under-estimated and display
larger confidence intervals due to sampling error. Therefore, we
recommend either setting the t1
argument to the first time step that
contains a sufficient sample size, or plotting results along with
cumulative sample size as we have done in Figure 3.
# Estimate temporal transmission distance for simulated data
set.seed(123)
<- alist(n=1, a=1/100, rexp(n, a)) # Dispersal kernel
dist.func <- sim.epidemic(R=2, gen.t.mean=7, gen.t.sd=2, min.cases=100,
sim tot.generations=8, trans.kern.func=dist.func)
# Estimate mean and confidence intervals at each time step
<- est.transdist.temporal.bootstrap.ci(epi.data=sim,
sim.temp.transdist.ci gen.t.mean=7,
gen.t.sd=2,
t1=0,
max.sep=1e10,
max.dist=1e10,
n.transtree.reps=10,
boot.iter=5,
ci.low=0.025,
ci.high=0.975)
head(sim.temp.transdist.ci)
t pt.est ci.low ci.high n1 0 NA NA NA 1
2 3 NA NA NA 2
3 8 NA NA NA 3
4 9 44.61359 35.52047 52.24099 4
5 10 101.55313 43.99469 203.11247 5
6 11 189.29767 113.79560 224.79960 6
To provide an example of how the functions shown above can be applied to
real data, we estimate the mean transmission distance for the 2001
foot-and-mouth epidemic among farms in Cumbria, UK. These data can be
found in the fmd
data object included in the sparr
package
(Davies et al. 2011). It contains transformed (\(x\),\(y\)) coordinates of
the infected farms and the time step (\(t\)) in which it was infected,
which is given in days since the index farm was infected (Figure
2). The generation time for foot-and-mouth disease
is estimated to have a mean of 6.1 days and a standard deviation of 4.6
days (Haydon et al. 2003), so we use these in the gen.t.mean
and gen.t.sd
arguments in the est.transdist.temporal.bootstrap.ci
function.
library(sparr)
data(fmd)
<- cbind(fmd$cases$x, fmd$cases$y, fmd$cases$marks)
fmd
head(fmd, n=3)
1] [,2] [,3]
[,1,] 333.0328 541.3405 52
[2,] 336.1428 543.3462 46
[3,] 341.4762 551.1794 38
[
sim.plot(fmd)
# NOTE: this function may take a while depending on the data set
<- est.transdist.temporal.bootstrap.ci(epi.data=fmd,
fmd.trans gen.t.mean=6.1,
gen.t.sd=4.6,
t1=0,
max.sep=1e10,
max.dist=1e10,
n.transtree.reps=5,
boot.iter=10,
ci.low=0.025,
ci.high=0.975,
parallel=TRUE,
n.cores=detectCores())
par(mfrow=c(1,1))
2:4] <- fmd.trans[,2:4]/1000 # Convert to km
fmd.trans[,plot(fmd.trans$t, fmd.trans$pt.est, pch=19, col='grey', ylim=range(fmd.trans[,3:4], na.rm=T),
xlab='Time step', ylab='Estimated mean of transmission kernel (km)')
<- seq(1, nrow(fmd.trans), 5)
tmp axis(3, tmp, fmd.trans[tmp,5])
mtext('Sample size (n)', side=3, line=3)
<- which(fmd.trans$n >= 30)[1]
tmp abline(v=tmp, lty=2)
text(16, 1, 'n = 30')
<- tmp:nrow(fmd.trans)
tmp <- c(NA,1,2,2)
lty
for(i in 2:4) {
<- loess(fmd.trans[tmp,i] ~ as.vector(tmp), span=0.3)
low <- predict(low, newdata=data.frame(as.vector(tmp)))
low lines(c(rep(NA, tmp[1]), low), lwd=2, lty=lty[i], col='blue')
}
Using our approach described above, we estimated the mean transmission
distance between case farms in the sparr
package foot-and-mouth
disease data to be \(\hat{\mu}_k =\) 5.8 km (\(95\% \ \text{CI} =\) 5.7–6.1
km). Interestingly, this estimate of \(\mu_k\) is lower than that reported
in (Salje et al. 2016b), where \(\hat{\mu}_k =\) 9.1 km
(\(95\% \ \text{CI} =\) 8.4–9.7 km). The difference in \(\hat{\mu}_k\) is
likely due to differences in data sources. The values estimated in
(Salje et al. 2016b) include case farms from both Cumbria and
Dumbfriesshire, UK with the additional constrain that only case farms
where the source farm was confirmed were included. The sparr
data set,
on the other hand, contains all case farms from only Cumbria.
Estimating the mean of the spatial transmission kernel (above) provides information on the small spatial scale of individual transmission events. After subsequent generations of transmission where different transmission chains overlap in space, a larger area of elevated disease prevalence will be observed. To describe this larger-scale process, we introduced the \(\tau\)-statistic in (Lessler et al. 2016). The \(\tau\)-statistic measures global clustering with an epidemiological interpretation—the relative risk of an individual being a related case (under some definition) given they are within a particular distance from another case. The spatial distances where the relative risk is high represent an area of elevated prevalence that is likely to have greater public health utility compared with the scale of individual transmission because interventions must account for ongoing transmission at the population level to contain an outbreak. Therefore, the \(\tau\)-statistic provides a more intuitive global measure of spatial clustering, which can be interpreted as the relative risk of infection.
Formulation of the \(\tau\)-statistic has a mathematical relationship to the \(K\)-function and pair correlation function. The \(K\)-function quantifies the expected number of neighboring points within distance \(d\) of a typical point \(\textbf{Z}\) relative to the intensity of the underlying population distribution \(\lambda\). \[K(d) = \frac{1}{\lambda} E[\text{neighbors within distance} \ d \ | \ x,y \ \text{coordinates of} \ \textbf{Z}]\] In the simplest scenario, \(\lambda\) is assumed to be homogeneously distributed, so that \(\lambda = N/A\), where \(N\) is the total number of cases and \(A\) is the total study area. Under the assumption of a heterogeneous underlying population distribution, \(\lambda\) becomes the location specific intensity \(\lambda(S)\), where \(S \subset A\) within distance \(d\) of location \(\textbf{Z}\). In both cases, the \(K\)-function is plotted with the theoretical value of the \(K\)-function for a homogeneous Poisson process \(\pi d^2\), which indicates clustering or dispersion relative to complete randomness. The cumulative aspect of the \(K\)-function (using all neighbors within distance \(d\)) is, however, a well-known constraint that makes it difficult to interpret changes in clustering over distance. The pair correlation function alleviates this constraint by applying the \(K\)-function within a distance range (\(d_1, d_2\)) and standardizing it by the complete spatial randomness expectation for a homogeneous Poisson process within this range: \[G(d_1, d_2) = \frac{K(d_1 + \Delta d) - K(d_1)}{2\pi d_1 \Delta d + \pi \Delta d^2},\] where, \(\Delta d = d_2 - d_1\). Both the \(K\)-function and pair correlation functions have seen general application due to developments that accommodate inhomogeneous underlying population distribution, clustering between typed points, and edge effects. However, these functions assume complete knowledge of the underlying population distribution and use a null statistical hypothesis of complete spatial randomness, which is precarious for scenarios in epidemiology where the underlying population is unknown and relative risk is used to understand disease dynamics.
To incorporate other metrics of global clustering, the IDSpatialStats
package provides wrapper functions for calculating both the cross \(K\)-
and cross pair correlation functions using the Kcross
and PCFcross
functions from the spatstat package (Baddeley et al. 2016). These
wrapper functions allow for straightforward calculation of these
statistics using typed epidemiological data that is formatted for
IDSpatialStats functions (Figure 4).
# Calculate cross-K and cross pair correlation functions with simulated data
data(DengueSimRepresentative)
<- seq(0, 1000, 20)
r.vals <- seq(0, 1000, 200)
labs
<- get.cross.K(epi.data=DengueSimRepresentative, type=5, hom=1, het=NULL,
k r=r.vals, correction='border')
head(k, n=3)
r theo border1 0 0.000 0.000
2 20 1256.637 2166.362
3 40 5026.548 5956.887
<- get.cross.PCF(epi.data=DengueSimRepresentative, type=5, hom=1, het=NULL,
g r=r.vals, correction='border')
head(g, n=3)
r theo pcf1 0 1 1.000000
2 20 1 1.720848
3 40 1 1.178406
par(mfrow=c(1,2))
plot(k[,3], type='l', lwd=2, xaxt='n', xlab='distance (m)', ylab='cross K function')
lines(k[,2], col='red', lty=2, lwd=2)
axis(1, at=which(r.vals %in% labs), labels=labs)
legend(-3, 4.15e6, legend=c("Theoretical Poisson process", "Observed function"),
col=c("red", "black"), lty=2:1, box.lty=0, bg='transparent',
x.intersp=0.7, y.intersp = 1.2)
plot(g$pcf, type='l', lwd=2, xaxt='n', xlab='distance (m)',
ylab='cross pair correlation function')
abline(h=1, col='red', lty=2, lwd=2)
axis(1, at=which(r.vals %in% labs), labels=labs)
A measure of relative risk that does not assume knowledge of the
underlying population distribution was developed for point pattern data
in veterinary epidemiology (Diggle et al. 2005). This function,
which is implemented in the sparr
(Davies et al. 2011) and spatstat
(Baddeley et al. 2016) packages, uses spatial kernel functions to
calculate a ratio of spatial intensity for two different point types
\(\rho(S) = \lambda_1(S) / \lambda_0(S)\). This formulation can quantify
the relative risk for case-control point data or case occurrences with
multiple types, but it is not a global clustering statistic.
The \(\tau\)-statistic can be computed in two ways depending on the
underlying assumption that the true population distribution is known. If
this assumption is true, then the \(\tau\)-statistic is similar to other
common measures of global clustering that rely on knowledge of the
background population distribution to quantify generic clustering of a
point process.
\[\hat{\tau}(d_1, d_2) = \frac{\hat{\pi}(d_1, d_2)}{\hat{\pi}(0, \infty)},\]
where \(\hat{\pi}(d_1, d_2)\) represents the incidence rate within
distance \(d_1\) to \(d_2\) of a case and \(\hat{\pi}(0, \infty)\) represents
the incidence rate over the entire extent of the study area. This can be
implemented by defining the numerator and denominator using the get.pi
function or by using the generalized get.tau
function with
comparison.type = ’representative’
argument. The \(\hat{\pi}(\cdot)\)
terms in the numerator and denominator use the occurrence data to
calculate the incidence rates for linked cases within some defined
distance. Therefore, a critical step in performing an analysis with the
\(\tau\)-statistic is specifying which cases are linked through some
defined relationship (homologous) and those that are not
(non-homologous). Homology can be defined statically or dynamically.
When defined statically, the get.pi.typed
and get.tau.typed
functions can be used to assign case types based on a type
column
supplied by the data matrix. When defined dynamically, an indicator
function \(I(\cdot)\) is used to delineate linked and unlinked cases in
the data, which allows greater flexibility when defining case type
homology.
# Calculate tau-statistic using get.pi.typed functions
data(DengueSimRepresentative)
<- 2 - (DengueSimRepresentative[,'serotype'] == 1)
type <- cbind(DengueSimRepresentative, type=type)
typed.data <- seq(20, 1000, 20)
d2 <- d2 - 20
d1
# Static definition of case type homology
<- get.pi.typed(typed.data, typeA=1, typeB=2, r.low=d1, r=d2)
num <- get.pi.typed(typed.data, typeA=1, typeB=2, r.low=0, r=1e10)
den head(num$pi/den$pi, n=4)
1] 0.2641154 0.2104828 0.2451847 0.2487042
[
<- get.tau.typed(typed.data, typeA=1, typeB=2, r.low=d1, r=d2,
tau comparison.type = "representative") # Equivalent
head(tau, n=4)
r.low r tau1 0 20 0.2641154
2 20 40 0.2104828
3 40 60 0.2451847
4 60 80 0.2487042
# Calculate tau-statistic using dynamic expression indicating serotype homology
<- function(a, b){
ind.func if (a[5] == 1 & b[5] == 1) {
<- 1
x else {
} <- 2
x
}return(x)
}
<- get.pi(posmat=DengueSimRepresentative, fun=ind.func, r.low=d1, r=d2)
num <- get.pi(posmat=DengueSimRepresentative, fun=ind.func, r.low=0, r=Inf)
den head(num$pi/den$pi, n=4)
1] 5.084735 4.967885 4.605805 4.409876
[
<- get.tau(posmat=DengueSimRepresentative, fun=ind.func, r.low=d1, r=d2,
tau comparison.type="representative") # Equivalent
head(tau, n=4)
r.low r tau1 0 20 5.084735
2 20 40 4.967885
3 40 60 4.605805
4 60 80 4.409876
plot(tau$r.low+tau$r/2, tau$tau, type='l', lwd=2, col='blue', xlab='distance (m)')
abline(h=1, lty=2, lwd=2, col='red')
abline(v=100)
The interpretation of the \(\tau\)-statistic is analogous to that of the pair-correlation function in two ways. First, the \(\tau\)-statistic is not compared to the theoretical measure of a random Poisson process because the metric is an incidence rate ratio with epidemiological meaning. Instead, this measure is plotted in comparison to a ratio of 1, indicating no relative difference in disease risk among homologous cases. Second, the \(\tau\)-statistic measures relative risk using case pairs within a distance range \((d_1 \leq d_{ij} < d_2)\). This approach can describe how fine-scale spatial dependence changes over distance. However, if the user wishes to estimate cumulative spatial dependence (analogous to the \(K\)-function), then \(d_1\) can be fixed at zero \((0 \leq d_{ij} < d_{2})\).
If the underlying population distribution is unknown, then the \(\tau\)-statistic can be computed so that it quantifies global clustering in terms of relative risk. This goes beyond classic measures of clustering by utilizing some relationship between linked and unlinked cases to distinguish transmission chains and quantify relative clustering between them. To do so, we use the function \(\hat{\theta}(\cdot)\) which gives the odds ratio of cases related to case \(i\) to those independent of \(i\) to give an estimate of the \(\tau\)-statistic that is not biased by assumptions about the underlying population distribution. \[\hat{\tau}(d_1, d_2) = \frac{\hat{\theta}(d_1, d_2)}{\hat{\theta}(0, \infty)},\] where, \[\hat{\theta}(d_1, d_2) = \frac{\sum_{\forall i} \sum_{\forall j} I_1(z_{ij}=1, \ d_1 \leq d_{ij} < d_2)}{\sum_{\forall i} \sum_{\forall j} I_2(z_{ij}=0, \ d_1 \leq d_{ij} < d_2)}.\] The indicator function \(I(\cdot)\) is applied to all \(ij\) case pairs within distance \(d_1\) and \(d_2\). It returns a binary response which is equal to 1 when case pairs meet user-specified criteria to be homologous and equal to 2 when they are non-homologous. The result is an \(i \times j\) relation matrix \(z_{ij}\) which is used to find the sums of homologous and non-homologous case pairs. Using an indicator function also allows additional criteria to be used to define case type homology, such as temporal proximity (Figure 6).
# Calculate tau-statistic using serotype homology and time
data(DengueSimR01)
<- seq(20, 1000, 20)
d2 <- d2 - 20
d1
# Dynamic expression indicating serotype homology and temporal proximity
<- function(a, b, t.limit=20){
ind.func if (a[5] == b[5] & (abs(a[3] - b[3]) <= t.limit)){
<- 1
x else {
} <- 2
x
}return(x)
}
<- get.theta(DengueSimR01, ind.func, r.low=d1, r=d2)
num <- get.theta(DengueSimR01, ind.func, r.low=0, r=Inf)
den head(num$theta/den$theta, n=4)
1] 3.9148969 3.5145802 4.5963608 5.1082210
[
<- get.tau(posmat=DengueSimR01, fun=ind.func, r.low=d1, r=d2,
tau comparison.type="independent") # Equivalent
head(tau, n=4)
r.low r tau1 0 20 3.914897
2 20 40 3.514580
3 40 60 4.596361
4 60 80 5.108221
plot(tau$r, tau$tau, type='l', lwd=2, col='blue', xlab='distance (m)')
abline(h=1, lty=2, lwd=2, col='red')
abline(v=100)
In the examples above, the get.pi
, get.theta
, and get.tau
function
families calculate point estimates for \(\hat{\pi}\), \(\hat{\theta}\), and
\(\hat{\tau}\) respectively. In scenarios where observation error,
sampling bias, or measurement error are expected to introduce additional
variance, users may wish to place confidence intervals around these
point estimates. For this purpose, each family of functions contains a
function ending with a .bootstrap
suffix, which generates point
estimates for boot.iter
number of bootstrapped samples of the data
(Efron and Tibshirani 1994). Functions ending with a .ci
suffix are
wrappers that calculate user specified confidence intervals based on the
bootstrapped samples (Figure 7).
# Calculate variance around point estimates of the tau-statistic
data(DengueSimR02)
<- seq(20, 1000, 20)
d2 <- d2 - 20
d1
# Function indicating genotype homology
<- function(a, b){
ind.func if(a[4] == b[4]){
= 1
x else{
} = 2
x
}return(x)
}
# Bootstrapped estimates of tau
<- get.tau.bootstrap(DengueSimR02, ind.func, r.low=d1, r=d2, boot.iter=5)
tau.boot head(tau.boot, n=4)
r.low r X1 X2 X3 X4 X51 0 20 51.04283 49.17736 60.45922 43.36588 37.26332
2 20 40 20.80095 29.62483 26.54935 34.11416 31.32279
3 40 60 34.05415 35.66984 40.21975 31.02943 32.77966
4 60 80 30.52361 35.46972 27.77247 36.64628 32.43156
# Wrapper function of get.tau.bootstrap calculates confidence intervals
<- get.tau.ci(DengueSimR02, ind.func, r.low=d1, r=d2, boot.iter=25)
tau.ci head(tau.ci, n=4)
r.low r pt.est ci.low ci.high1 0 20 44.05161 22.73147 59.44465
2 20 40 30.83943 19.68758 42.05249
3 40 60 37.57434 30.48664 45.78121
4 60 80 33.54134 28.12390 38.76330
plot(tau.ci$r, tau.ci$pt.est, ylim=range(tau.ci[,4:5]), type="l", lwd=2, col='blue',
xlab='distance (m)', ylab='tau')
lines(tau.ci$r, tau.ci$ci.low, lty=2, lwd=1, col='blue')
lines(tau.ci$r, tau.ci$ci.high, lty=2, lwd=1, col='blue')
abline(h=1, lty=2, lwd=2, col='red')
A common approach for interpreting spatial clustering statistics
includes hypothesis testing using simulation envelopes to assess whether
an observed spatial measure is statistically significant
(Ripley 1979; Baddeley et al. 2014). To enable null hypothesis
tests, we have implemented a permutation method (Good 2010)
to simulate the nonparametric distribution of \(\hat{\pi}\),
\(\hat{\theta}\), and \(\hat{\tau}\) under the null hypothesis of no spatial
dependence. The permutation algorithm simulates the null distribution by
randomly reassigning case coordinates to observations upon each
permutation. Null distributions can be computed using functions ending
in the .permute
suffix and then plotted with observed measures to
assess statistical significance as a function of distance (Figure
8).
# Compare tau statistic to its null distribution using permutation
data(DengueSimR02)
set.seed(123)
<- seq(20, 1000, 20)
d2 <- d2 - 20
d1
# Compare spatial dependence by time case occurs
<- 2 - (DengueSimR02[,"time"] < 120)
type <- cbind(DengueSimR02, type=type)
typed.data
<- get.tau.typed(typed.data, typeA=1, typeB=2, r.low=d1, r=d2,
typed.tau comparison.type = "independent")
head(typed.tau, n=4)
r.low r tau1 0 20 0.4040661
2 20 40 0.5471728
3 40 60 0.7897655
4 60 80 0.8901166
# Perform permutations of observed case times and locations for null distribution
<- get.tau.typed.permute(typed.data, typeA=1, typeB=2, r.low=d1, r=d2,
typed.tau.null permutations=100,
comparison.type = "independent")
head(typed.tau.null[,1:7], n=4)
r.low r X1 X2 X3 X4 X51 0 20 1.2570945 0.8530284 3.5019060 1.0326133 1.0775095
2 20 40 1.1448539 0.7045255 0.7224212 2.0742058 0.8754765
3 40 60 0.6947101 0.8904419 0.8249682 0.6990984 0.5128531
4 60 80 1.6266250 1.0916873 0.8326210 0.8432683 1.4815756
# 95% confidence intervals of null distribution
<- apply(typed.tau.null[,-(1:2)], 1, quantile, probs=c(0.025, 0.975))
null.ci
plot(typed.tau$r, typed.tau$tau, type='l', lwd=2, ylim=range(c(typed.tau$tau, null.ci)),
xlab="distance (m)", ylab="tau")
lines(typed.tau$r, null.ci[1,], lty=2)
lines(typed.tau$r, null.ci[2,], lty=2)
abline(h=1, lty=2, lwd=2, col='red')
Conventional spatial statistics are often used to describe the intensity or clustering of point processes. Quantifying spatial dependence of infectious disease spread, however, requires a modified approach that considers overlapping transmission chains and the likelihood of case linkage. Therefore, we have implemented two types of spatial statistics in the IDSpatialStats package (the mean transmission distance \(\mu_k\), and the \(\tau\)-statistic) that can be used along with other measures of spatial dependence (e.g. the cross \(K\)-function and cross pair correlation function) to understand the spatial spread of infectious diseases.
We showed how to simulate epidemiological data and estimate \(\mu_k\), and
the \(\tau\)-statistic, which can be used as templates for other analyses.
First, the transdist
family of functions provides a measure of
fine-scale spatial dependence by estimating the mean of the transmission
distance \(\hat{\mu}_k\) between sequentially linked cases in a
transmission chain (Salje et al. 2016b). Second, the get.tau
family of functions measure spatial dependence on a larger-scale by
estimating the \(\tau\)-statistic, which describes the area of elevated
prevalence surrounding cases. This family of functions does so by
estimating the relative risk of a case being homologous compared with
non-homologous case types. The definition of case type homology is
flexible and can utilize temporal or biological information, such as
genotype and serotype of the pathogen.
The generalized structure of the get.tau
family allows for diverse
applications of the \(\tau\)-statistic to epidemiological data. Previous
studies have used the \(\tau\)-statistic to quantify spatial and/or
temporal dependence of transmission for Dengue, Cholera, HIV, and
Measles disease systems (see Table 2 for detailed
descriptions). These studies illustrate that, regardless of the system
under study, analyses are enhanced when bootstrapping, permutation
tests, and/or assessment of observation error is employed to understand
the distribution of error and statistical significance for estimates of
the \(\tau\)-statistic.
The IDSpatialStats package is undergoing continued development. Future directions include expanding the implementation of the \(\tau\)-statistic to facilitate estimation of spatio-temporal dependence by incorporating a temporal interval into the spatial search window. This technique was used by (Salje et al. 2012) in the form of the \(\phi\)-statistic to estimate \(\hat{\phi}(d_1, d_2, t_1, t_2)\). Additional developments include a theoretical framework for the \(\tau\)-statistic that incorporates uncertainty due to pathogen generation time, and to define case type homology more continuously using genetic distance matrices. We hope these developments will enable users to address more complex questions and incorporate more sources of uncertainty into estimates of spatial dependence. Check Github at https://github.com/HopkinsIDD/IDSpatialStats for latest development release.
Description | Citation |
---|---|
Spatial and temporal dependence of homotypic and heterotypic Dengue | (Salje et al. 2012) |
virus serotypes over a 5 year period in Bangkok, Thailand | |
Clustering of HIV prevalence and incidence around HIV-seropositive | (Grabowski et al. 2014) |
individuals using cohort data from rural Rakai District, Uganda | |
Overview of the \(\tau\)-statistic, its performance given observation error, | (Lessler et al. 2016) |
and illustrations using Dengue, HIV, and Measles | |
Spatial dependence of seroconverted individuals in the 2012–2013 | (Salje et al. 2016a) |
Chikungunya outbreak in the Phillipines | |
Comparison of spatial dependence in endemic transmission of Dengue | (Quoc et al. 2016) |
virus serotypes in Bangkok and Ho Chi Min City, Thailand | |
Risk of Cholera transmission within spatial and temporal zones after case | (Azman et al. 2018) |
presentation during urban epidemics in Chad and D.R. Congo | |
Summary statistic to fit micro-simulations of Cholera interventions | (Finger et al. 2018) |
to epidemic data using Approximate Bayesian Computation | |
Temporal clustering of subclinical infections and homologous serotypes | (Salje et al. 2018) |
within schools using Dengue cohort data in Thailand |
lgcp, ppmlasso, spdep, ads, spatstat, splancs, IDSpatialStats, DCluster, SGCS, sparr
Spatial, SpatioTemporal, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Giles, et al., "The IDSpatialStats R Package: Quantifying Spatial Dependence of Infectious Disease Spread", The R Journal, 2019
BibTeX citation
@article{RJ-2019-043, author = {Giles, John R and Salje, Henrik and Lessler, Justin}, title = {The IDSpatialStats R Package: Quantifying Spatial Dependence of Infectious Disease Spread}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {2}, issn = {2073-4859}, pages = {308-327} }