This paper introduces the R package *HDSpatialScan*. This package allows users to easily apply spatial scan statistics to real-valued multivariate data or both univariate and multivariate functional data. It also permits plotting the detected clusters and to summarize them. In this article the methods are presented and the use of the package is illustrated through examples on environmental data provided in the package.

Spatial cluster detection methods are useful tools for objective
detection and localization of statistically significant aggregation of
events indexed in space. Examples of the applications of these methods
are numerous: in the field of epidemiology, these methods allow
epidemiologists to detect spatial clusters of disease cases and to
formulate etiological hypotheses; in the environmental sciences,
researchers can be led to search for particularly polluted geographical
areas, either by one pollutant in particular or by several pollutants
simultaneously. In astronomy, researchers may want to identify star
clusters from telescope image data.

Several cluster detection methods have been proposed in the literature.
In particular, spatial scan statistics (originally proposed by
Kulldorff and Nagarwalla (1995) and Kulldorff (1997) for Bernoulli and Poisson models)
are powerful methods for detecting statistically significant spatial
clusters, which can be defined by an aggregation of sites presenting an
abnormal concentration (mean, etc) of an observed variable, with a
variable scanning window and in the absence of pre-selection bias
(objective detection of the cluster). Following on from Kulldorff’s
initial work, several researchers have adapted spatial scan statistics
to other spatial data distributions: exponential (Huang et al. 2007),
ordinal (Jung et al. 2007), normal (Kulldorff et al. 2009), Weibull
(Bhatt and Tiwari 2014), etc. Others use nonparametric approaches such as
Jung and Cho (2015) and Cucala (2016) who respectively extend the
Wilcoxon-Mann-Whitney test for spatial scan statistics and for temporal
or spatial scan statistics. Note that in the case of spatial data the
two approaches are equivalent by generalizing the method of
Jung and Cho (2015) to detect either high or low clusters.

The applications of scan statistics are numerous. In the field of
epidemiology, Khan et al. (2021) detected significant clusters of
diabetes incidence in Florida between 2007 and 2010, which will help
guide local health policies. Marciano et al. (2018) sought to
detect spatial clusters of leprosy incidence in a hyperendemic Brazilian
municipality between 2000 and 2005 and 2006 and 2010. The study showed a
high percentage of contact between people which facilitates the
transmission of the disease. Genin et al. (2020) detected high-risk clusters
of Crohn’s disease in France over the period 2007-2014. As the causes of
this disease are still poorly understood, the detection of spatial
clusters of Crohn’s disease allows the researchers to make hypotheses on
possible risk factors, such as high-social deprivation or high
urbanization. In the context of environmental science, the detection of
clusters of symptomatic exposure to pesticides in rural areas
(Sudakin et al. 2002) would allow the monitoring and prevention of
pesticide-related diseases. Gao et al. (2014) focused on the presence of iodine
in drinking water in Shandong Province, China. The detection of spatial
clusters of iodine presence in drinking water allows an improvement of
the monitoring of drinking water quality in these geographical areas.
Finally in the context of pollution data, (Wan et al. 2020) and
(Shi et al. 2021) respectively detected clusters of high concentrations
of \(\text{PM}_{2.5}\) in America and China. Such results may allow local
authorities to specifically monitor these areas and make decisions to
reduce pollution.

When multiple variables are observed simultaneously at each spatial
location, researchers may be interested in detecting spatial clusters
with anomalous values of all measured variables. In this context,
Kulldorff et al. (2007) proposed a multivariate spatial scan statistic using a
combination of independent univariate scan statistics. However it fails
to take into account the potential correlations between the variables. A
first spatial scan statistic for multivariate data taking into account
the correlations was proposed by Cucala et al. (2017). Their method
is based on a multivariate normal probability model and a likelihood
ratio. Later, Cucala et al. (2019) proposed a nonparametric spatial scan
statistic for multivariate data based on a multivariate
Wilcoxon-Mann-Whitney test.

Technological developments in measurement tools and data storage
capacity have yielded to the increasing use of sensors, cell phones and
more generally connected devices that collect data continuously or
almost continuously over time. This has led to the introduction of new
analysis methods for functional data (Ramsay and Silverman 2005), as well as the
adaptation of classical statistical methods such as principal component
analysis (Boente and Fraiman 2000; Berrendero et al. 2011) or regression
(Cuevas et al. 2002; Ferraty and Vieu 2002; Chiou and Müller 2007).

In the field of spatial scan statistics, Frévent et al. (2021a) and
Smida et al. (2022) proposed new methods for univariate processes. However for
example, in environmental surveillance, numerous variables are
simultaneously measured, making a multivariate functional approach
necessary to detect environmental black-spots. These can be defined as
geographical areas characterized by elevated concentrations of multiple
pollutants. Although Smida et al. (2022) only studied their approach in the
univariate functional framework, they suggest that it could also be
adapted for multivariate processes. Frévent et al. (2021b) studied this
adaptation and also developed new efficient methods for multivariate
functional spatial scan statistics.

In R several packages provide spatial scan statistics implementations.
The best known is certainly the
*rsatscan* (Kleinman 2015)
package which provides functions to interface R and the SaTScan software
(Kulldorff 2021), allowing the latter to be launched from R. It implements
lots of univariate methods (ordinal, Bernoulli, Poisson, …) but also
the space-time spatial scan statistic (Kulldorff et al. 1998) and the multivariate
extensions proposed by Kulldorff et al. (2007). The function `kulldorff`

implemented in the R package
*SpatialEpi*
(Chen et al. 2018) also performs the spatial scan statistics based on the
Poisson and the Bernoulli models. Other softwares were created to detect
clusters such as ClusterSeer (Durbeck et al. 2012; Greiling et al. 2012) which
performs spatial, temporal and space-time clustering, and TreeScan
(Kulldorff 2018) which implements the tree-based scan statistic (Kulldorff et al. 2003). We
should also mention the R package
*DCluster* (Gómez-Rubio et al. 2015)
which implements the spatial scan statistics for Poisson or Bernoulli
models. The R package
*DClusterm*
(Gómez-Rubio et al. 2019; Gomez-Rubio et al. 2020) also implements a cluster detection method.
Briefly, it consists in considering a large number of generalized linear
models by including potential cluster indicators one by one, and then to
use a model selection procedure. The Shiny application SpatialEpiApp
(Moraga 2017a) and the R package
*SpatialEpiApp*
(Moraga 2017b) allow the detection and visualization of clusters by
using the scan statistics implemented in SaTScan. Finally the software
FlexScan (Takahashi et al. 2010) and the R package
*rflexscan* (Otani and Takahashi 2021)
implement the spatial scan statistic using a scan window with a non
pre-defined shape, defined by Takahashi and Tango (2005). Other R packages also allow
clusters detection such as
*graphscan* (Loche et al. 2016)
(the `cluster`

function), *SPATCLUS* (Dematteı̈ et al. 2006) or
*scanstatistics*
(Allévius 2018a,b) for spatial or space-time data. It
should be noted that these last two packages are no longer available on
the CRAN (The Comprehensive R Archive Network) repository. Although
existing packages implement a large number of statistical spatial scan
models, none of them propose multivariate scan models taking into
account the potential correlations between variables or scan models for
functional data. Thus, we have developed the R package *HDSpatialScan*
for high-dimensional spatial scan statistics. The latter allows on the
one hand the detection of spatial clusters in multivariate or functional
data, and on the other hand, their display on a map and the description
of their characteristics.

This paper is organized as follows: The following section presents the
different models implemented in the R package *HDSpatialScan*. Then, the
implementation of the methods is described and examples of use of the
package are given. The last section concludes the paper.

Let \(s_1, \dots, s_n\) be \(n\) different locations of an observation domain \(S \subset \mathbb{R}^2\) and \(X_1, \dots, X_n\) be the observations of a variable \(X\) in \(s_1, \dots, s_n\). Hereafter all observations are considered to be independent, which is a classical assumption in scan statistics. Three types of spatial data can be considered: either lattice data (the data are aggregated at the spatial level, e.g.: county), geostatistical data (the variable is defined on a continuous area and each individual measure corresponds to a unique fixed spatial location, e.g.: pollutant concentration measured by sensors over a region), or marked point data (each individual measure corresponds to a unique random spatial location, e.g.: height of the trees in a forest, the location of the trees is random).

Spatial scan statistics aim at detecting spatial clusters and testing their statistical significance. Hence, one tests a null hypothesis \(\mathcal{H}_0\) (the absence of a cluster) against a composite alternative hypothesis \(\mathcal{H}_1\) (the presence of at least one cluster \(w \subset S\) presenting abnormal values of \(X\)). For this purpose, a spatial scan statistic consists of two steps. The first one is a detection phase using a scanning window of variable size and shape. We will focus here on the approach of Kulldorff and Nagarwalla (1995) which use a circular scanning window of variable center and radius, however it should be noted that other shapes can be considered (Kulldorff et al. 2006; Cucala et al. 2013). An approach often advised is to limit the maximum size to half of the studied region since otherwise it would be like detecting a “negative cluster” in the areas outside the clusters covering almost all the studied region (Kulldorff and Nagarwalla 1995). Then the scanning window allows to define a set of potential clusters \(\mathcal{W}\) by \[\label{eq:cluster} \mathcal{W} = \{ w_{i,j} \ / \ 1 \le |w_{i,j}| \le \frac{n}{2}, \ 1 \le i,j \le n \}, \tag{1}\] where \(w_{i,j}\) is the disc centered on \(s_i\) that passes through \(s_j\) and \(|w_{i,j}|\) corresponds to the number of sites in \(w_{i,j}\). Figure 1 illustrates the set of potential clusters defined with a circular scanning window with Equation (1) on a set of eight administrative areas in France.