This paper introduces the R package HDSpatialScan. This package allows users to easily apply spatial scan statistics to real-valued multivariate data or both univariate and multivariate functional data. It also permits plotting the detected clusters and to summarize them. In this article the methods are presented and the use of the package is illustrated through examples on environmental data provided in the package.
Spatial cluster detection methods are useful tools for objective
detection and localization of statistically significant aggregation of
events indexed in space. Examples of the applications of these methods
are numerous: in the field of epidemiology, these methods allow
epidemiologists to detect spatial clusters of disease cases and to
formulate etiological hypotheses; in the environmental sciences,
researchers can be led to search for particularly polluted geographical
areas, either by one pollutant in particular or by several pollutants
simultaneously. In astronomy, researchers may want to identify star
clusters from telescope image data.
Several cluster detection methods have been proposed in the literature.
In particular, spatial scan statistics (originally proposed by
Kulldorff and Nagarwalla (1995) and Kulldorff (1997) for Bernoulli and Poisson models)
are powerful methods for detecting statistically significant spatial
clusters, which can be defined by an aggregation of sites presenting an
abnormal concentration (mean, etc) of an observed variable, with a
variable scanning window and in the absence of pre-selection bias
(objective detection of the cluster). Following on from Kulldorff’s
initial work, several researchers have adapted spatial scan statistics
to other spatial data distributions: exponential (Huang et al. 2007),
ordinal (Jung et al. 2007), normal (Kulldorff et al. 2009), Weibull
(Bhatt and Tiwari 2014), etc. Others use nonparametric approaches such as
Jung and Cho (2015) and Cucala (2016) who respectively extend the
Wilcoxon-Mann-Whitney test for spatial scan statistics and for temporal
or spatial scan statistics. Note that in the case of spatial data the
two approaches are equivalent by generalizing the method of
Jung and Cho (2015) to detect either high or low clusters.
The applications of scan statistics are numerous. In the field of
epidemiology, Khan et al. (2021) detected significant clusters of
diabetes incidence in Florida between 2007 and 2010, which will help
guide local health policies. Marciano et al. (2018) sought to
detect spatial clusters of leprosy incidence in a hyperendemic Brazilian
municipality between 2000 and 2005 and 2006 and 2010. The study showed a
high percentage of contact between people which facilitates the
transmission of the disease. Genin et al. (2020) detected high-risk clusters
of Crohn’s disease in France over the period 2007-2014. As the causes of
this disease are still poorly understood, the detection of spatial
clusters of Crohn’s disease allows the researchers to make hypotheses on
possible risk factors, such as high-social deprivation or high
urbanization. In the context of environmental science, the detection of
clusters of symptomatic exposure to pesticides in rural areas
(Sudakin et al. 2002) would allow the monitoring and prevention of
pesticide-related diseases. Gao et al. (2014) focused on the presence of iodine
in drinking water in Shandong Province, China. The detection of spatial
clusters of iodine presence in drinking water allows an improvement of
the monitoring of drinking water quality in these geographical areas.
Finally in the context of pollution data, (Wan et al. 2020) and
(Shi et al. 2021) respectively detected clusters of high concentrations
of \(\text{PM}_{2.5}\) in America and China. Such results may allow local
authorities to specifically monitor these areas and make decisions to
reduce pollution.
When multiple variables are observed simultaneously at each spatial
location, researchers may be interested in detecting spatial clusters
with anomalous values of all measured variables. In this context,
Kulldorff et al. (2007) proposed a multivariate spatial scan statistic using a
combination of independent univariate scan statistics. However it fails
to take into account the potential correlations between the variables. A
first spatial scan statistic for multivariate data taking into account
the correlations was proposed by Cucala et al. (2017). Their method
is based on a multivariate normal probability model and a likelihood
ratio. Later, Cucala et al. (2019) proposed a nonparametric spatial scan
statistic for multivariate data based on a multivariate
Wilcoxon-Mann-Whitney test.
Technological developments in measurement tools and data storage
capacity have yielded to the increasing use of sensors, cell phones and
more generally connected devices that collect data continuously or
almost continuously over time. This has led to the introduction of new
analysis methods for functional data (Ramsay and Silverman 2005), as well as the
adaptation of classical statistical methods such as principal component
analysis (Boente and Fraiman 2000; Berrendero et al. 2011) or regression
(Cuevas et al. 2002; Ferraty and Vieu 2002; Chiou and Müller 2007).
In the field of spatial scan statistics, Frévent et al. (2021a) and
Smida et al. (2022) proposed new methods for univariate processes. However for
example, in environmental surveillance, numerous variables are
simultaneously measured, making a multivariate functional approach
necessary to detect environmental black-spots. These can be defined as
geographical areas characterized by elevated concentrations of multiple
pollutants. Although Smida et al. (2022) only studied their approach in the
univariate functional framework, they suggest that it could also be
adapted for multivariate processes. Frévent et al. (2021b) studied this
adaptation and also developed new efficient methods for multivariate
functional spatial scan statistics.
In R several packages provide spatial scan statistics implementations.
The best known is certainly the
rsatscan (Kleinman 2015)
package which provides functions to interface R and the SaTScan software
(Kulldorff 2021), allowing the latter to be launched from R. It implements
lots of univariate methods (ordinal, Bernoulli, Poisson, …) but also
the space-time spatial scan statistic (Kulldorff et al. 1998) and the multivariate
extensions proposed by Kulldorff et al. (2007). The function kulldorff
implemented in the R package
SpatialEpi
(Chen et al. 2018) also performs the spatial scan statistics based on the
Poisson and the Bernoulli models. Other softwares were created to detect
clusters such as ClusterSeer (Durbeck et al. 2012; Greiling et al. 2012) which
performs spatial, temporal and space-time clustering, and TreeScan
(Kulldorff 2018) which implements the tree-based scan statistic (Kulldorff et al. 2003). We
should also mention the R package
DCluster (Gómez-Rubio et al. 2015)
which implements the spatial scan statistics for Poisson or Bernoulli
models. The R package
DClusterm
(Gómez-Rubio et al. 2019; Gomez-Rubio et al. 2020) also implements a cluster detection method.
Briefly, it consists in considering a large number of generalized linear
models by including potential cluster indicators one by one, and then to
use a model selection procedure. The Shiny application SpatialEpiApp
(Moraga 2017a) and the R package
SpatialEpiApp
(Moraga 2017b) allow the detection and visualization of clusters by
using the scan statistics implemented in SaTScan. Finally the software
FlexScan (Takahashi et al. 2010) and the R package
rflexscan (Otani and Takahashi 2021)
implement the spatial scan statistic using a scan window with a non
pre-defined shape, defined by Takahashi and Tango (2005). Other R packages also allow
clusters detection such as
graphscan (Loche et al. 2016)
(the cluster
function), SPATCLUS (Dematteı̈ et al. 2006) or
scanstatistics
(Allévius 2018a,b) for spatial or space-time data. It
should be noted that these last two packages are no longer available on
the CRAN (The Comprehensive R Archive Network) repository. Although
existing packages implement a large number of statistical spatial scan
models, none of them propose multivariate scan models taking into
account the potential correlations between variables or scan models for
functional data. Thus, we have developed the R package HDSpatialScan
for high-dimensional spatial scan statistics. The latter allows on the
one hand the detection of spatial clusters in multivariate or functional
data, and on the other hand, their display on a map and the description
of their characteristics.
This paper is organized as follows: The following section presents the
different models implemented in the R package HDSpatialScan. Then, the
implementation of the methods is described and examples of use of the
package are given. The last section concludes the paper.
Let \(s_1, \dots, s_n\) be \(n\) different locations of an observation domain \(S \subset \mathbb{R}^2\) and \(X_1, \dots, X_n\) be the observations of a variable \(X\) in \(s_1, \dots, s_n\). Hereafter all observations are considered to be independent, which is a classical assumption in scan statistics. Three types of spatial data can be considered: either lattice data (the data are aggregated at the spatial level, e.g.: county), geostatistical data (the variable is defined on a continuous area and each individual measure corresponds to a unique fixed spatial location, e.g.: pollutant concentration measured by sensors over a region), or marked point data (each individual measure corresponds to a unique random spatial location, e.g.: height of the trees in a forest, the location of the trees is random).
Spatial scan statistics aim at detecting spatial clusters and testing their statistical significance. Hence, one tests a null hypothesis \(\mathcal{H}_0\) (the absence of a cluster) against a composite alternative hypothesis \(\mathcal{H}_1\) (the presence of at least one cluster \(w \subset S\) presenting abnormal values of \(X\)). For this purpose, a spatial scan statistic consists of two steps. The first one is a detection phase using a scanning window of variable size and shape. We will focus here on the approach of Kulldorff and Nagarwalla (1995) which use a circular scanning window of variable center and radius, however it should be noted that other shapes can be considered (Kulldorff et al. 2006; Cucala et al. 2013). An approach often advised is to limit the maximum size to half of the studied region since otherwise it would be like detecting a “negative cluster” in the areas outside the clusters covering almost all the studied region (Kulldorff and Nagarwalla 1995). Then the scanning window allows to define a set of potential clusters \(\mathcal{W}\) by \[\label{eq:cluster} \mathcal{W} = \{ w_{i,j} \ / \ 1 \le |w_{i,j}| \le \frac{n}{2}, \ 1 \le i,j \le n \}, \tag{1}\] where \(w_{i,j}\) is the disc centered on \(s_i\) that passes through \(s_j\) and \(|w_{i,j}|\) corresponds to the number of sites in \(w_{i,j}\). Figure 1 illustrates the set of potential clusters defined with a circular scanning window with Equation (1) on a set of eight administrative areas in France.
Then the spatial scan statistic can be defined as the maximum of a concentration index over the set of potential clusters \(\mathcal{W}\). The second step is the determination of the statistical significance of the spatial scan statistic. For this, since the distribution of the scan statistic is intractable under \(\mathcal{H}_0\) due to the overlapping nature of \(\mathcal{W}\), a common approach, which will be considered here, is to use a Monte-Carlo method (see Section 2.4 for more details).
Here we consider the case where several continuous variables are
simultaneously observed in each spatial location:
\(X = (X^{(1)}, \dots, X^{(p)})^\top\) is a \(p\)-dimensional variable
(\(p \ge 2\)). In this context the objective is to identify multivariate
spatial clusters that are aggregations of sites in which \(X\) takes
higher or lower values (in terms of mean, median, etc.) than elsewhere.
For example one could observe the average concentrations of several
pollutants over a day: a vector can be associated with each site, each
element of which corresponds to the average concentration of one
pollutant. In this context a spatial cluster corresponds to a set of
sites under or overexposed to multiple pollutants. Different approaches
will be presented: a parametric method based on a Gaussian model and a
nonparametric one.
Figure 2 summarizes the different types of
multivariate data with examples, and provides guidelines on the spatial
scan statistics methods to be used for these data (and the argument to
use in the scan function of the package). More precisely, we distinguish
three types of spatial data: lattice data which are aggregated data for
example at the scale of the regions of a country, geostatistical data
which are defined on a continuous space (typically temperature,
sunshine, or atmospheric pressure) although they are observed only at
discrete sites, and marked point data for which the location is random
(for example the distribution of trees in a forest) and we observe at
each location the circumference and height of the tree for example. To
detect spatial clusters, in the case of Gaussian data we will prefer the
Gaussian approach (MG) and otherwise we will use the nonparametric
approach (MNP).
Cucala et al. (2017) proposed a parametric spatial scan statistic
for multivariate data based on a multivariate normal model taking into
account the correlations between the variables.
The null hypothesis \(\mathcal{H}_0\), corresponding to the absence of any
cluster in the data, is the following:
\(\forall i \in \left[\!\left[ 1 ; n \right]\!\right], \ X_i \sim \mathcal{N}_p(\mu, \Sigma)\)
and the alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a
potential cluster \(w\) can be defined as:
\(\forall i \in \left[\!\left[ 1 ; n \right]\!\right], \ X_i \sim \left\{ \begin{array}{ll} \mathcal{N}_p(\mu_w, \Sigma_{w,w^\mathsf{c}}) & \text{ if } s_i \in w \\ \mathcal{N}_p(\mu_{w^\mathsf{c}}, \Sigma_{w,w^\mathsf{c}}) & \text{ otherwise} \end{array} \right.\).
Then we can compute the MLE estimates of
\(\mu, \mu_w, \mu_{w^\mathsf{c}}, \Sigma\) and \(\Sigma_{w,w^\mathsf{c}}\):
\(\hat{\mu}, \widehat{\mu}_w, \widehat{\mu}_{w^\mathsf{c}}, \widehat{\Sigma}\)
and \(\widehat{\Sigma}_{w,w^\mathsf{c}}\), and we can show that the
log-likelihood ratio between these two hypotheses is
\[\begin{aligned}
\widehat{LLR^w} = &-\frac{n}{2}\ln\bigg[\text{det}\Big(\sum_{\substack{i\\ s_i \in w}} \left(X_i - \widehat{\mu}_w \right) \left(X_i - \widehat{\mu}_w \right)^\top + \sum_{\substack{i \\ s_i \in w^\mathsf{c}}} \left(X_i - \widehat{\mu}_{w^\mathsf{c}} \right) \left(X_i - \widehat{\mu}_{w^\mathsf{c}} \right)^\top \Big)\bigg] \\
&+ \frac{n}{2}\ln\bigg[\text{det}\Big(\sum_{i=1}^{n} \left(X_i - \hat{\mu} \right) \left(X_i - \hat{\mu} \right)^\top \Big)\bigg],
\end{aligned}\]
where
\(\displaystyle{ \hat{\mu}_g = \dfrac{1}{|g|} \sum_{i, s_i \in g} X_i }\)
for \(g \in \{ w, w^\mathsf{c} \}\) and
\(\displaystyle{ \hat{\mu} = \dfrac{1}{n} \sum_{i=1}^n X_i }\).
Finally the log-likelihood ratio is used as a concentration index and
maximised over the set of potential clusters \(\mathcal{W}\).
Thus we can show that the multivariate Gaussian (MG) scan statistic is
\[\lambda_{\text{MG}} = \underset{w \in \mathcal{W}}{\min} \ \text{det}\Big(\sum_{\substack{i\\ s_i \in w}} \left(X_i - \widehat{\mu}_w \right) \left(X_i - \widehat{\mu}_w \right)^\top + \sum_{\substack{i \\ s_i \in w^\mathsf{c}}} \left(X_i - \widehat{\mu}_{w^\mathsf{c}} \right) \left(X_i - \widehat{\mu}_{w^\mathsf{c}} \right)^\top \Big).\]
This test performs very well against Gaussian alternatives but faces
problems when the data is not normal, which is often the case when
dealing with environmental data exhibiting extreme values. For that
reason Cucala et al. (2019) developed a nonparametric spatial scan statistic
for multivariate data based on a multivariate extension of the
Wilcoxon-Mann-Whitney test for multivariate data (Oja and Randles 2004).
In this context the null hypothesis \(\mathcal{H}_0\) can be rewritten as
\(\mathcal{H}_0: X_1, \dots, X_n\) are identically distributed, whatever
the associated location.
Let
\[\begin{array}{ccccl}
\text{sgn} & : & \mathbb{R}^p & \to & \mathbb{R}^p \\
& & x & \mapsto & \left\{
\begin{array}{cl}
||x||_2^{-1} x & \text{ if } x \neq 0 \\
0 & \text{ otherwise}
\end{array}
\right. \\
\end{array},\]
then the multivariate ranks \(R_i\) are defined by
\(\displaystyle{R_i = \dfrac{1}{n} \sum_{j=1}^{n} \text{sgn}(A_X(X_i - X_j))}\)
where the matrix \(A_X\) makes the ranks such that
\(\displaystyle{\dfrac{p}{n}\sum_{i=1}^{n} R_i R_i^\top = \dfrac{1}{n} \sum_{i=1}^{n} R_i^\top R_i I_p}\).
Note that this matrix can be easily computed using an iterative
procedure. Then the multivariate extension of the Wilcoxon-Mann-Whitney
statistic proposed by Oja and Randles (2004) is
\[U^2(w) = \dfrac{p}{c^2_X} \left[\ |w| \ ||\bar{R}_w||_2^2 + |w^\mathsf{c}| \ ||\bar{R}_{w^\mathsf{c}}||_2^2 \ \right], \text{ where } \displaystyle{c^2_X = \dfrac{1}{n}\sum_{i=1}^n R_i^\top R_i}.\]
Cucala et al. (2019) used \(U^2(w)\) as a concentration index to build the spatial scan statistic: the multivariate nonparametric (MNP) scan statistic is \(\lambda_{\text{MNP}} = \underset{w \in \mathcal{W}}{\max} \ U^2(w)\).
It should be noted that in the case \(p=1\), these statistics are respectively equivalent to the ones introduced by Kulldorff et al. (2009) (which is equivalent to the scan statistic developed by Cucala (2014), UG), and Jung and Cho (2015) (UNP).
Here we consider the case where a continuous variable is observed in
each spatial location over time: \(\{ X(t), \ t \in \mathcal{T} \}\) is a
real-valued stochastic process where \(\mathcal{T}\) is an interval of
\(\mathbb{R}\). In this context the objective is to identify functional
spatial clusters that are aggregations of sites in which the curves are
higher or lower than elsewhere. For example, one can observe the
concentration of an air pollutant over time in different geographical
areas. Then a cluster corresponds to an aggregation of sites in which
the concentration of the air pollutant is higher or lower over the time
than in the other spatial units. Several methods will be considered: a
parametric method based on a functional ANOVA, a nonparametric approach
using a Wilcoxon-Mann-Whitney test for high-dimensional data, a
distribution-free approach based on a pointwise Student’s t-test and
finally a pointwise rank-based method. On Gaussian data, for non
localized clusters in time all approaches show high power and high true
positive rates. However the performances of the ANOVA-based method
strongly decrease on non-normal data. For localized clusters in time
(that are aggregations of sites that take higher or lower values for \(X\)
only in a small interval of time (an interval of five days over a study
period of one month for example)) the pointwise approaches should be
favored.
Figure 3 summarizes the different types of
univariate functional data with examples, and provides recommendations
on the spatial scan statistics methods to be used for these data (and
the argument to use in the scan function of the package). More
precisely, we distinguish lattice functional data which are aggregated
functional data for example at the scale of the administrative areas of
a country (unemployment rate, percentage of the population over 65,
etc), geostatistical functional data which are defined on a continuous
space (typically temperature, sunshine, or atmospheric pressure over
time) although they are observed only at discrete sites, and marked
point data for which the location is random (for example the
distribution of trees in a forest) and we observe at each location the
circumference of the tree over time for example. To detect spatial
clusters, as mentioned before, in the case of Gaussian data we will
prefer the pointwise distribution-free functional approach (DFFSS) and
otherwise we will use the pointwise rank-based approach (URBFSS).
Frévent et al. (2021a) supposed that the process \(X\) takes values in a
semi-metric space, in particular in
\(\mathcal{L}^2(\mathcal{T},\mathbb{R})\) and proposed a parametric
spatial scan statistic for functional data, based on a functional ANOVA.
Here the null hypothesis \(\mathcal{H}_0\) can be rewritten:
\(\mathcal{H}_0: \forall w \in \mathcal{W}, \ \mu_{w} = \mu_{w^\mathsf{c}} = \mu_S\),
and the alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a
potential cluster \(w\) can be defined as follows:
\(\mathcal{H}_1^{(w)}: \mu_{w} \neq \mu_{w^\mathsf{c}}\), where \(\mu_w\),
\(\mu_{w^\mathsf{c}}\) and \(\mu_S\) stand for the mean functions in \(w\),
outside \(w\) and over \(S\), respectively. Cuevas et al. (2004) and
Górecki and Smaga (2015) proposed the following ANOVA test statistic:
\[\displaystyle{F_n^{(w)} = \dfrac{
|w| \ ||\bar{X}_{w} - \bar{X}||_2^2 + |w^\mathsf{c}| \ ||\bar{X}_{w^\mathsf{c}} - \bar{X}||_2^2}{ \dfrac{1}{n-2}
\left[\sum_{j, s_j \in w} ||X_j - \bar{X}_{w}||_2^2 + \sum_{j, s_j \in w^\mathsf{c}} ||X_j - \bar{X}_{w^\mathsf{c}}||_2^2 \right]},}\]
where
\(\displaystyle{\bar{X}_{g}(t) = \dfrac{1}{|g|} \sum_{i, s_i \in g} X_i(t)}\)
are empirical estimators of \(\mu_g\) (\(g \in \{w, w^\mathsf{c}\}\)),
\(\displaystyle{\bar{X}(t) = \dfrac{1}{n} \sum_{i = 1}^n X_i(t)}\) is the
empirical estimator of \(\mu_S\) and
\(\displaystyle{||x||_2^2 = \int_{\mathcal{T}} x^2(t) \ \text{d}t}\).
Thus, Frévent et al. (2021a) proposed to use \(F_n^{(w)}\) as a concentration
index and the proposed parametric functional spatial scan statistic
(PFSS) is
\(\Lambda_{\text{PFSS}} = \underset{w \in \mathcal{W}}{\max} \ F_n^{(w)}\).
This method gives high powers and F-measures on normal data but as in
the multivariate framework the parametric method faces problems when the
data is not normal. Smida et al. (2022) proposed a nonparametric spatial scan
statistic for functional data based on a functional
Wilcoxon-Mann-Whitney test (Chakraborty and Chaudhuri 2014).
Here \(X\) is a process of a smooth Banach space \(\chi\), with a Gâteaux
differentiable norm \(||.||_{\chi}\). Let denote \(P_w\) and
\(P_{w^\mathsf{c}}\) the probability measures of \(X\) in \(w\) and in
\(w^\mathsf{c}\) respectively, then \(\mathcal{H}_0\) corresponds to:
\(\mathcal{H}_0: \forall w \in \mathcal{W}, \ P_w = P_{w^\mathsf{c}}\) and
the alternative hypothesis associated with a potential cluster \(w\) can
be rewritten as
\(\mathcal{H}_1^{(w)}: P_w(X) = P_{w^\mathsf{c}}(X-\Delta), \ \Delta \in \chi \backslash \{0\}\).
Chakraborty and Chaudhuri (2014) defined the sign function in the functional framework as
\[\forall h \in \chi, \ \text{Sgn}_X(h) = \left\{ \begin{array}{cl} \underset{v \rightarrow 0^+}{\lim} \ \dfrac{|| X + vh ||_\chi - ||X||_\chi}{v} &\text{ if } X \neq 0 \\
0 &\text{ if } X = 0 \end{array} \right. .\]
Then they proposed the following test statistic:
\[T_{WMW}(w) = \dfrac{1}{|w| |w^\mathsf{c}|} \sum_{i, s_i \in w} \sum_{j, s_j \in w^\mathsf{c}} \text{Sgn}_{X_j - X_i}.\]
Under \(\mathcal{H}_0\), if \(\dfrac{|w|}{n} \rightarrow \gamma \in [0;1]\)
as \(|w|, |w^\mathsf{c}| \rightarrow \infty\),
\(\sqrt{\dfrac{|w| |w^\mathsf{c}|}{n}} T_{WMW}(w)\) converges weakly to a
distribution that does not depend on \(|w|\). Thus Smida et al. (2022) proposed
to use
\(U(w) = \bigg| \bigg| \sqrt{\dfrac{|w| |w^\mathsf{c}|}{n}} T_{WMW}(w) \bigg| \bigg|\)
as a concentration index: the nonparametric functional scan statistic
(NPFSS) is
\(\Lambda_{\text{NPFSS}} = \underset{w \in \mathcal{W}}{\max} \ U(w)\).
It should be noticed that although Smida et al. (2022) only studied the
performances of the NPFSS in the univariate functional framework, their
method is also applicable on multivariate functional data as shown by
Frévent et al. (2021b).
Frévent et al. (2021a) also proposed to combine the distribution-free
spatial scan statistic for univariate data proposed by
Cucala (2014) and the max statistic of Lin et al. (2021). They supposed
that for each time \(t\), \(\mathbb{V}[X_i(t)] = \sigma^2(t)\) for all
\(i \in \left[\!\left[ 1 ; n \right]\!\right]\). Then for each \(t\), the
concentration index proposed by Cucala (2014) to test
\(\mathcal{H}_0: \forall w \in \mathcal{W}, \mu_w(t) = \mu_{w^\mathsf{c}}(t) = \mu_S(t)\)
was
\[I^{(w)}(t) = \dfrac{|\bar{X}_w(t) - \bar{X}_{w^\mathsf{c}}(t)|}{\sqrt{\hat{\mathbb{V}}[\bar{X}_w(t) - \bar{X}_{w^\mathsf{c}}(t)]}},\]
where
\(\hat{\mathbb{V}}[\bar{X}_w(t) - \bar{X}_{w^\mathsf{c}}(t)] = \widehat{\sigma^2}(t) \left[ \dfrac{1}{|w|} + \dfrac{1}{|w^\mathsf{c}|} \right]\),
\(\displaystyle{\widehat{\sigma^2}(t) = \dfrac{1}{n-2} \left[ \sum_{i, s_i \in w} (X_i(t) - \bar{X}_w(t))^2 + \sum_{i, s_i \in w^\mathsf{c}} (X_i(t) - \bar{X}_{w^\mathsf{c}}(t))^2 \right]}\).
Then the idea is to globalize the information by maximizing the previous
quantity over the time for each potential cluster \(w\), as suggested by
Lin et al. (2021):
\[I^{(w)} = \underset{t \in \mathcal{T}}{\sup} \ I^{(w)}(t).\]
For cluster detection, as for the PFSS, the null hypothesis
\(\mathcal{H}_0\) (the absence of cluster) is defined as follows:
\(\mathcal{H}_0: \forall w \in \mathcal{W}, \ \mu_{w} = \mu_{w^\mathsf{c}} = \mu_S\).
And the alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a
potential cluster \(w\) can be defined as follows:
\(\mathcal{H}_1^{(w)}: \mu_{w} \neq \mu_{w^\mathsf{c}}\).
Frévent et al. (2021a) considered \(I^{(w)}\) as a concentration index and
maximized it over the set of potential clusters \(\mathcal{W}\) yielding
to the following distribution-free functional spatial scan statistic
(DFFSS):
\(\Lambda_{\text{DFFSS}} = \underset{w \in \mathcal{W}}{\max} \ I^{(w)}\).
A pointwise approach based on ranks and on the nonparametric scan statistic for univariate data (Jung and Cho 2015) can be proposed in the univariate functional framework by adapting the approach of Frévent et al. (2021b).
For a time \(t\), Jung and Cho (2015) proposed to test
\(\mathcal{H}_0: \forall w \in \mathcal{W}, F_{w,t} = F_{w^\mathsf{c},t}\)
where \(F_{w,t}\) and \(F_{w^\mathsf{c},t}\) are the cumulative distribution
functions of \(X(t)\) in \(w\) and outside \(w\), by using the Wilcoxon
rank-sum test statistic. For a time \(t\) and a potential cluster \(w\), the
Wilcoxon rank-sum test statistic is
\(\displaystyle{W(t)^{(w)} = \sum_{i, s_i \in w} R_i(t)}\) where \(R_i(t)\)
is the rank of \(X_i(t)\) in \(\{ X_1(t), \dots, X_n(t) \}\), using the
average rank in the case of tied observations.
Then the standardized version of this statistic is
\[T(t)^{(w)} = \dfrac{W(t)^{(w)} - \mathbb{E}[W(t)^{(w)}]}{\sqrt{\mathbb{V}[W(t)^{(w)}]}}\]
where \(\mathbb{E}[W(t)^{(w)}] = \dfrac{|w|(n+1)}{2}\) and
\(\mathbb{V}[W(t)^{(w)}] = \dfrac{|w| |w^\mathsf{c}| (n+1)}{12}\) are the
expected value and the variance of \(W(t)^{(w)}\) under \(\mathcal{H}_0\).
Jung and Cho (2015) proposed to minimize the p-value associated
with \(T(t)^{(w)}\) on the set of potential clusters \(\mathcal{W}\). We
propose to adapt their approach by simply using \(|T(t)^{(w)}|\) as a
pointwise statistic.
In the context of cluster detection, the null hypothesis is defined as
\(\mathcal{H}_0\):
\(\forall w \in \mathcal{W}, \ \forall t \in \mathcal{T}, \ F_{w,t} = F_{w^\mathsf{c},t}\).
The alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a
potential cluster \(w\) is \(\mathcal{H}_1^{(w)}\):
\(\exists t \in \mathcal{T}, \ F_{w,t}(x) = F_{w^\mathsf{c},t}(x-\Delta_t)\),
\(\Delta_t \neq 0\).
As before, we propose to globalize the information over the time with
\(T^{(w)} = \underset{t \in \mathcal{T}}{\sup} \ |T(t)^{(w)}|\) and to use
this quantity as a concentration index, yielding to the following
univariate rank-based functional spatial scan statistic (URBFSS):
\(\Lambda_{\text{URBFSS}} = \underset{w \in \mathcal{W}}{\max} \ T^{(w)}\).
Here we consider the case where several continuous variables are
observed simultaneously in each spatial unit over time:
\(\{ X(t), \ t \in \mathcal{T} \}\) is a \(p\)-dimensional vector-valued
stochastic process (\(p \ge 2\)) where \(\mathcal{T}\) is an interval of
\(\mathbb{R}\). The objective is to detect multivariate functional spatial
clusters that are aggregations of sites in which the curves are higher
or lower than elsewhere. For example we can observe the concentration of
several pollutants over time in different locations. Thus at each
location we observe several processes (air pollutant concentrations) and
these processes can be correlated. In this context a cluster is an
aggregation of sites overexposed or underexposed to multiple pollutants
over time. Several methods will be presented: a parametric method based
on a functional MANOVA, a distribution-free approach based on a
pointwise Hotelling \(T^2\)-test and finally a pointwise rank-based
method. On normal data, all approaches show high power and high true
positive rates for non localized clusters in time. However the
performances of the methods based on the MANOVA and the Hotelling
\(T^2\)-test decrease on non-normal data. For localized clusters in time
the pointwise approaches should be favored, especially the pointwise
rank-based method on non-Gaussian data. By localized clusters in time we
mean aggregations of sites that take higher or lower values for \(X\) only
in a small interval of time (an interval of five days over a study
period of one month for example).
Figure 4 summarizes the different types of
multivariate functional data with examples, and provides guidelines on
the spatial scan statistics methods to be used for these data (and the
argument to use in the scan function of the package). To be more
precise, we can distinguish lattice functional data which are aggregated
data for example at the scale of the regions of a country (unemployment
rate and fraction of the population that has not graduated from high
school, over time, for example), geostatistical functional data which
are defined on a continuous space (temperature, sunshine, and
atmospheric pressure over time) although they are observed only at
discrete sites, and marked point data for which the location is random
(for example the distribution of trees in a forest) and we observe at
each location the circumference and height of the tree over time for
example. As previously mentioned, to detect spatial clusters, in the
case of Gaussian data we will prefer the multivariate distribution-free
functional spatial scan statistic (MDFFSS) and for non-Gaussian data we
will use the pointwise rank-based approach (MRBFSS).
Here, the process \(X\) is supposed to take values in a semi-metric space, in particular the Hilbert space \(\mathcal{L}^2(\mathcal{T}, \mathbb{R}^p)\) of \(p\)-dimensional vector-valued square-integrable functions on \(\mathcal{T}\), equipped with the inner product \(\displaystyle{\langle X, Y \rangle = \int_{\mathcal{T}} X(t)^\top Y(t) \ \text{d}t}\).
Frévent et al. (2021b) proposed a parametric scan statistic for
multivariate functional data based on a functional MANOVA
Lawley–Hotelling trace test (Górecki and Smaga 2017).
In this context, the null hypothesis \(\mathcal{H}_0\) is
\(\mathcal{H}_0: \forall w \in \mathcal{W}, \ \mu_{w} = \mu_{w^\mathsf{c}} = \mu_S\),
where \(\mu_w\), \(\mu_{w^\mathsf{c}}\) and \(\mu_S\) stand for the mean
functions in \(w\), outside \(w\) and over \(S\), respectively. And the
alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a potential
cluster \(w\) is \(\mathcal{H}_1^{(w)}: \mu_{w} \neq \mu_{w^\mathsf{c}}\).
Górecki and Smaga (2017) presented the following adaptation of the
Lawley-Hotelling trace test statistic:
\[\mathrm{LH}^{(w)} = \text{Trace}(H_w E_w^{-1})\]
where
\(\displaystyle{H_w = |w| \int_{\mathcal{T}} \left[\bar{X}_w(t) - \bar{X}(t) \right] \left[\bar{X}_w(t) - \bar{X}(t) \right]^\top \ \text{d}t + |w^\mathsf{c}| \int_{\mathcal{T}} \left[\bar{X}_{w^\mathsf{c}}(t) - \bar{X}(t) \right] \left[\bar{X}_{w^\mathsf{c}}(t) - \bar{X}(t) \right]^\top \ \text{d}t}\)
and
\(\displaystyle{E_w = \sum_{j, s_j \in w} \int_{\mathcal{T}} \left[X_j(t) - \bar{X}_w(t) \right] \left[X_j(t) - \bar{X}_w(t) \right]^\top \text{d}t + \sum_{j, s_j \in w^\mathsf{c}} \int_{\mathcal{T}} \left[X_j(t) - \bar{X}_{w^\mathsf{c}}(t) \right] \left[X_j(t) - \bar{X}_{w^\mathsf{c}}(t) \right]^\top \text{d}t}\)
with
\(\displaystyle{\bar{X}_{g}(t) = \dfrac{1}{|g|} \sum_{i, s_i \in g} X_i(t)}\)
the empirical estimators of \(\mu_g(t)\) for \(g \in \{w, w^\mathsf{c}\}\)
and \(\displaystyle{\bar{X}(t) = \dfrac{1}{n} \sum_{i = 1}^n X_i(t)}\) the
empirical estimator of \(\mu_S(t)\).
Then Frévent et al. (2021b) considered \(\mathrm{LH}^{(w)}\) as a
concentration index and proposed the parametric multivariate functional
spatial scan statistic (MPFSS):
\(\Lambda_{\text{MPFSS}} = \underset{w \in \mathcal{W}}{\max} \ \mathrm{LH}^{(w)}\).
In fact Górecki and Smaga (2017) proposed four test statistics using the matrices \(H_w\) and \(E_w\) to compare the mean functions in \(w\) and \(w^\mathsf{c}\): (1) the Lawley–Hotelling trace test statistic \(\text{LH}^{(w)} = \text{Trace}(H_w E_w^{-1})\), (2) the Pillai trace test statistic \(\text{P}^{(w)} = \text{Trace}(H_w(H_w+E_w)^{-1})\), (3) the Roy’s largest root test statistic \(\text{R}^{(w)} = \lambda_{\text{max}}(H_w E_w^{-1})\) where \(\lambda_{\text{max}}(H_w E_w^{-1})\) is the maximum eigenvalue of \(H_w E_w^{-1}\) and (4) the Wilks lambda test statistic \(\text{W}^{(w)} = \dfrac{\text{det}(E_w)}{\text{det}(H_w+E_w)}\).
Thus each of these quantities (or the opposite for the Wilks lambda test statistic) can be considered as a concentration index and maximized over \(\mathcal{W}\) which results in the following parametric multivariate functional spatial scan statistics:
\[\Lambda_{\text{LH}} = \underset{w \in \mathcal{W}}{\max} \ \text{LH}^{(w)}, \ \ \Lambda_{\text{P}} = \underset{w \in \mathcal{W}}{\max} \ \text{P}^{(w)}, \ \ \Lambda_{\text{R}} = \underset{w \in \mathcal{W}}{\max} \ \text{R}^{(w)}, \ \ \Lambda_{\text{W}} = \underset{w \in \mathcal{W}}{\min} \ \text{W}^{(w)}.\]
These four approaches are implemented in the package HDSpatialScan.
Frévent et al. (2021b) proposed a distribution-free spatial scan statistic
for multivariate functional data which is the counterpart of the
distribution-free spatial scan statistic for univariate functional data
developed by Frévent et al. (2021a). They supposed that for each time \(t\),
\(\mathbb{V}[X_i(t)] = \Sigma(t,t)\) for all
\(i \in \left[\!\left[ 1 ; n \right]\!\right]\), where \(\Sigma\) is a \(p \times p\)
covariance matrix function.
Thus, as previously, in the context of cluster detection, the null
hypothesis \(\mathcal{H}_0\) can be defined as follows:
\(\mathcal{H}_0: \forall w \in \mathcal{W}, \ \mu_{w} = \mu_{w^\mathsf{c}} = \mu_S\),
where \(\mu_w\), \(\mu_{w^\mathsf{c}}\) and \(\mu_S\) stand for the mean
functions in \(w\), outside \(w\) and over \(S\), respectively. And the
alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a potential
cluster \(w\) can be defined as follows:
\(\mathcal{H}_1^{(w)}: \mu_{w} \neq \mu_{w^\mathsf{c}}\). Next, Qiu et al. (2021)
proposed to compare the mean function \(\mu_w\) in \(w\) with the mean
function \(\mu_{w^\mathsf{c}}\) in \(w^\mathsf{c}\) by using the following
statistic:
\[T_{n,\text{max}}^{(w)} = \underset{t \in \mathcal{T}}{\sup} \ T_n(t)^{(w)}\]
where \(T_n(t)\) is a pointwise statistic defined by the Hotelling
\(T^2\)-test statistic
\[T_n(t)^{(w)} = \dfrac{|w| |w^\mathsf{c}|}{n} \left(\bar{X}_w(t) - \bar{X}_{w^\mathsf{c}}(t) \right)^\top \hat{\Sigma}(t,t)^{-1} \left(\bar{X}_w(t) - \bar{X}_{w^\mathsf{c}}(t) \right).\]
\(\bar{X}_w(t)\) and \(\bar{X}_{w^\mathsf{c}}(t)\) are the empirical
estimators of the mean functions defined previously, and
\(\displaystyle{\hat{\Sigma}(s,t) = \dfrac{1}{n-2} \left[ \sum_{i, s_i \in w} \left(X_i(s) - \bar{X}_w(s) \right) \left(X_i(t) - \bar{X}_w(t) \right)^\top + \sum_{i, s_i \in w^\mathsf{c}} \left(X_i(s) - \bar{X}_{w^\mathsf{c}}(s) \right) \left(X_i(t) - \bar{X}_{w^\mathsf{c}}(t) \right)^\top \right]}\)
is the pooled sample covariance matrix function.
Then Frévent et al. (2021b) proposed to use \(T_{n,\text{max}}^{(w)}\) as a
concentration index and to maximize it over the set of potential
clusters \(\mathcal{W}\): the multivariate distribution-free functional
spatial scan statistic (MDFFSS) is
\(\Lambda_{\text{MDFFSS}} = \underset{w \in \mathcal{W}}{\max} \ T_{n,\text{max}}^{(w)}\).
Finally Frévent et al. (2021b) also proposed to consider as a pointwise
test statistic the multivariate extension of the Wilcoxon rank-sum test
statistic developed by Oja and Randles (2004) and detailed in
Section 2.1. They defined the pointwise multivariate
ranks as
\(\displaystyle{R_i(t) = \dfrac{1}{n} \sum_{j=1}^{n} \text{sgn}(A_X(t)(X_i(t) - X_j(t)))}\)
where the pointwise transformation matrix \(A_X(t)\) is so that
\(\displaystyle{\dfrac{p}{n}\sum_{i=1}^{n} R_i(t) R_i(t)^\top = \dfrac{1}{n} \sum_{i=1}^{n} R_i(t)^\top R_i(t) I_p}\),
and the sgn function is the same as in Section 2.1.
Then for each time \(t\), the pointwise multivariate extension of the
Wilcoxon rank-sum test statistic is defined as
\(\displaystyle{W(t)^{(w)} = \dfrac{pn}{\sum_{i=1}^{n} R_i(t)^\top R_i(t)} \left[ \ |w| \ ||\bar{R}_w(t) ||_2^2 + |w^\mathsf{c}| \ ||\bar{R}_{w^\mathsf{c}}(t) ||_2^2 \ \right]}\)
where
\(\displaystyle{\bar{R}_{g}(t) = \dfrac{1}{|g|} \sum_{i, s_i \in g} R_i(t) \ (g \in \{ w, w^\mathsf{c} \})}\)
.
In the context of cluster detection, the null hypothesis is defined as
\(\mathcal{H}_0\):
\(\forall w \in \mathcal{W}, \ \forall t \in \mathcal{T}, \ F_{w,t} = F_{w^\mathsf{c},t}\)
where \(F_{w,t}\) and \(F_{w^\mathsf{c},t}\) correspond respectively to the
cumulative distribution functions of \(X(t)\) in \(w\) and outside \(w\). The
alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a potential
cluster \(w\) is \(\mathcal{H}_1^{(w)}\):
\(\exists t \in \mathcal{T}, \ F_{w,t}(x) = F_{w^\mathsf{c},t}(x-\Delta_t)\),
\(\Delta_t \neq 0\).
Finally Frévent et al. (2021b) proposed to globalize the information over
the time with the quantity
\(W^{(w)} = \underset{t \in \mathcal{T}}{\sup} \ W(t)^{(w)}\) and to use
it as a concentration index to be maximized over the set of potential
clusters \(\mathcal{W}\). The multivariate rank-based functional spatial
scan statistic (MRBFSS) is then
\(\Lambda_{\text{MRBFSS}} = \underset{w \in \mathcal{W}}{\max} \ W^{(w)}.\)
Once the most likely cluster (MLC) is detected, its statistical significance must be evaluated. The distribution of the scan statistic \(\mathcal{S}\) (\(\mathcal{S}\) = \(\lambda_{\text{MG}}\), \(\lambda_{\text{MNP}}\), \(\Lambda_{\text{PFSS}}\), \(\Lambda_{\text{NPFSS}}\), \(\Lambda_{\text{DFFSS}}\), \(\Lambda_{\text{URBFSS}}\), \(\Lambda_{\text{MPFSS}}\), \(\Lambda_{\text{MDFFSS}}\) or \(\Lambda_{\text{MRBFSS}}\)) is intractable under \(\mathcal{H}_0\) due to the overlapping nature of \(\mathcal{W}\). Then we choose to obtain a large set of simulated datasets by randomly permuting the observations \(X_i\) in the spatial locations. This technique was already used in spatial scan statistics (Kulldorff et al. 2009; Cucala et al. 2017).
Let \(M\) denote the number of random permutations of the original dataset
and \(\mathcal{S}^{(1)},\dots,\mathcal{S}^{(M)}\) be the observed scan
statistics on the simulated datasets. According to Dwass (1957) the p-value
for \(\mathcal{S}\) observed in the real data is estimated by
\[\hat{p} = \dfrac{1 + \sum_{m=1}^M \mathbb{1}_{\mathcal{S}^{(m)} \ge \mathcal{S}}}{M+1}.\]
Finally, the MLC is considered to be statistically significant if the
associated \(\hat{p}\) is less than the type I error.
According to Cucala et al. (2019) the MNP method tends to present a better power and higher true positive rates for non-Gaussian data than the MG one. Although the false positive rates are often higher for this approach than the MG one, it remains moderate. The same conclusions are true for the UG (Kulldorff et al. 2009) and the UNP (Jung and Cho 2015) which are their particular counterparts in the case of a single variable. In the functional framework, the approaches that present the best results are the DFFSS and the URBFSS in the univariate context and the MDFFSS and the MRBFSS in the multivariate one (Frévent et al. 2021a,2021b). The URBFSS and the MRBFSS tend to show higher powers and higher true positive rates although they detect more false positives than the DFFSS and the MDFFSS respectively. Table 1 summarizes the methods and their performances. The symbols \(\checkmark\) and X indicate respectively a high and a low performance on the criterion. If there is no symbol it means that for this criterion the approach offers medium performances. The terminology “localized clusters in time” in the functional cases refers to aggregations of sites that take higher or lower values for the process only in a small interval of time (an interval of five days over a study period of one month for example). Table 2 gives an idea of the computation time of the different scanning methods proposed by the package. It should be noted that the computation time of the different spatial scan statistics methods is dependent on the size of the datasets, and in particular a function of the number of sites, the number of observation times and the number of variables considered.
Gaussian distribution | Non-Gaussian distribution | ||||||
Method | Power | True positive rate | False positive rate | Power | True positive rate | False positive rate | |
Univariate data | Univariate data | ||||||
UG\(^a\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | X | X | \(\checkmark\) | |
UNP\(^a\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
Multivariate data | Multivariate data | ||||||
MG\(^b\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | X | X | \(\checkmark\) | |
MNP\(^b\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
Functional univariate data | Functional univariate data | ||||||
Non localized clusters in time | Non localized clusters in time | ||||||
PFSS\(^c\) | \(\checkmark\) | X | \(\checkmark\) | ||||
DFFSS\(^d\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
NPFSS\(^e\) | \(\checkmark\) | \(\checkmark\) | |||||
URBFSS\(^f\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
Localized clusters in time | Localized clusters in time | ||||||
PFSS\(^c\) | X | X | X | X | |||
DFFSS\(^d\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
NPFSS\(^e\) | X | X | |||||
URBFSS\(^f\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
Functional multivariate data | Functional multivariate data | ||||||
Non localized clusters in time | Non localized clusters in time | ||||||
MPFSS\(^c\) | \(\checkmark\) | X | X | \(\checkmark\) | |||
MDFFSS\(^d\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
NPFSS\(^e\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | ||||
MRBFSS\(^f\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
Localized clusters in time | Localized clusters in time | ||||||
MPFSS\(^c\) | X | X | X | X | \(\checkmark\) | ||
MDFFSS\(^d\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
NPFSS\(^e\) | X | X | |||||
MRBFSS\(^f\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
\(^a\) The Univariate Gaussian (UG) and the Univariate Nonparametric (UNP)
spatial scan statistics
\(^b\) The Multivariate Gaussian (MG) and the Multivariate Nonparametric
(MNP) spatial scan statistics
\(^c\) The Parametric Functional (PFSS) and the Multivariate Parametric
Functional (MPFSS) spatial scan statistics
\(^d\) The Distribution-free Functional (DFFSS) and the Multivariate
Distribution-free Functional (MDFFSS) spatial Scan Statistics
\(^e\) The Nonparametric Functional (NPFSS) spatial Scan Statistic
\(^f\) The Univariate Rank-based Functional (URBFSS) and the Multivariate
Rank-based Functional (MRBFSS) spatial Scan Statistics
Method | Computation time (in s) | |||
Mean | Standard deviation | Minimum | Maximum | |
Univariate data | ||||
UG\(^a\) | 4.31 | 0.18 | 4.01 | 4.77 |
UNP\(^a\) | 3.07 | 0.12 | 2.77 | 3.5 |
Multivariate data | ||||
MG\(^b\) | 253.04 | 32.93 | 231.34 | 310.91 |
MNP\(^b\) | 14.6 | 1.44 | 13.48 | 17.77 |
Functional univariate data | ||||
PFSS\(^c\) | 113.51 | 8.08 | 109.47 | 132.58 |
DFFSS\(^d\) | 55.99 | 4.93 | 52.52 | 66.76 |
NPFSS\(^e\) | 12.91 | 1.23 | 11.74 | 15.6 |
URBFSS\(^f\) | 17.93 | 1.37 | 16.93 | 22.53 |
Functional multivariate data | ||||
MPFSS\(^c\) | 72.52 | 9.21 | 65.23 | 100.17 |
MDFFSS\(^d\) | 182.95 | 21.54 | 166.27 | 227.84 |
NPFSS\(^e\) | 22.95 | 1.7 | 21.77 | 28 |
MRBFSS\(^f\) | 244.04 | 21.83 | 234.05 | 305.56 |
\(^a\) The Univariate Gaussian (UG) and the Univariate Nonparametric (UNP)
spatial scan statistics
\(^b\) The Multivariate Gaussian (MG) and the Multivariate Nonparametric
(MNP) spatial scan statistics
\(^c\) The Parametric Functional (PFSS) and the Multivariate Parametric
Functional (MPFSS) spatial scan statistics
\(^d\) The Distribution-free Functional (DFFSS) and the Multivariate
Distribution-free Functional (MDFFSS) spatial Scan Statistics
\(^e\) The Nonparametric Functional (NPFSS) spatial Scan Statistic
\(^f\) The Univariate Rank-based Functional (URBFSS) and the Multivariate
Rank-based Functional (MRBFSS) spatial Scan Statistics
The package HDSpatialScan provides a function SpatialScan
to compute
all the spatial scan statistics. The user chooses the method to apply by
specifying the method
argument: "MG"
and "MNP"
apply respectively
the parametric and nonparametric spatial scan statistics approaches on
multivariate data. Their univariate counterparts (when \(p=1\)) can be
computed with "UG"
and "UNP"
respectively. Then "PFSS"
, "DFFSS"
and "URBFSS"
apply the parametric, the distribution-free and the new
rank-based functional approaches on univariate functional data, and
"MPFSS"
, "MDFFSS"
, and "MRBFSS"
are their multivariate
counterparts. Finally "NPFSS"
applies the nonparametric spatial scan
statistic for functional data developed by Smida et al. (2022) on both
univariate and multivariate functional data.
Depending on the type of approach (univariate, multivariate, functional
univariate or functional multivariate), the data must be formatted in a
specific way. For univariate approaches, the data must be a vector in
which each element corresponds to a site. If the data is individual and
many individuals share the same site, the data can remain in an
individual format with one element of the vector per individual. Then
for real-valued multivariate methods or functional univariate methods,
the data must be a matrix in which each row corresponds to a site (or an
individual) and each column corresponds to a variable or an observation
time in the functional framework. For multivariate functional methods
the data must be a list in which each element is a matrix corresponding
to a site (or an individual). In the matrices, the rows correspond to
the variables and the columns to the observation times. Note that the
observation times must be the same for each site or individual and they
must be equally spaced for the methods "NPFSS"
, "PFSS"
and
"MPFSS"
. However if it is not the case of the raw data, they can be
easily transformed by smoothing the data (Ramsay and Silverman 2005), by using for
example the R package fda
(Ramsay et al. 2020).
The most important parameter is the method
argument which has already
been presented previously and allows to choose the spatial scan
statistics to be applied. Note that you can choose one or more methods.
Supplying "MPFSS"
automatically computes the four strategies for the
multivariate parametric functional spatial scan statistic (the
Lawley-Hotelling trace (LH), the Roy’s largest root (R), the Pillai’s
trace (P) and the Wilks’ lambda (W)). If you only want the
Lawley-Hotelling trace for example, you can simply supply "MPFSS-LH"
.
Although the Lawley-Hotelling trace test is the most used statistic
(Oja and Randles 2004), it should be noted that all these methods usually provide very
similar results.
The other arguments are data
, sites_coord
, system
, mini
, maxi
,
type_minimaxi
, mini_post
, maxi_post
, type_minimaxi_post
,
sites_areas
, MC
, typeI
, nbCPU
, variable_names
and times
.
Note that nbCPU
will be ignored for the methods "UG"
and "UNP"
,
variable_names
is ignored for the univariate and univariate functional
scan statistics and times
is ignored for non-functional scan
statistics.
The argument data
, is the data vector, matrix or list on which the
approaches must be applied. MC
and typeI
correspond respectively to
the number of permutations of the data while computing the statistical
significance of the clusters and the type I error i.e. a cluster is
declared significant if its estimated p-value is below this threshold.
The arguments sites_coord
and system
are respectively a matrix of
two columns corresponding to the coordinates of each site or individual,
and to the system of coordinates (“Euclidean” or “WGS84”).
The sites_areas
argument is optional and corresponds to the areas of
the sites (or the site of each individual if the data is individual).
The argument nbCPU
permits to do parallelization and the arguments
mini
, maxi
, type_minimaxi
, mini_post
, maxi_post
,
type_minimaxi_post
are described further below.
variable_names
is simply the names of the variables (in the same order
as in the data) for multivariate or multivariate functional scan
statistics and times
corresponds to the times of observation, they
must be numeric.
The clusters are computed automatically as circular clusters, so we need
to define a minimum and a maximum size for these clusters. That is what
we call “a priori filtering” and this allows to control the
computation time. Three types of a priori filtering are possible
through the argument type_minimaxi
: "sites/indiv"
(the filtering is
applied on the number of sites or individuals in the potential clusters,
it is the default value), "area"
(it is applied on the area of the
clusters and is available only if sites_areas
is provided), or
"radius"
(the radius of the clusters).
The arguments mini
and maxi
are then respectively the minimum number
of sites/individuals, or the minimal area or radius and the maximum
number of sites/individuals, or the maximal area or radius. For the
radius it is specified in km if system
is "WGS84"
or in the user
units if system
is "Euclidean"
.
It should be noted that this filtering can bias the p-values obtained
for the clusters. In order to perform a correct statistical inference,
Kulldorff and Nagarwalla (1995) recommended to consider a maximum size of half the study
region. Thus the default setting is to consider potential clusters
comprising at least one site and at most 50% of the sites
(Equation (1)). If you want to select clusters according
to size (number of sites or individuals), area or radius, it is better
to select them a posteriori among the detected clusters and if you
really want to decrease the computation time we recommend to increase
the number of CPU (with the argument nb_CPU
). Changing the default
settings can allow the user to investigate whether there appear to be
clusters in a relatively quick first step, although the inference is
biased, before applying the scan procedure with the default settings for
the a priori filtering (50% of the studied region).
Sometimes after that the p-value of each potential cluster has been
computing, the user may want to retrieve only the significant clusters
that satisfy a certain size, area, or radius criteria. That is what we
call a posteriori filtering. The corresponding arguments are
mini_post
, maxi_post
and type_minimaxi_post
and their definitions
are the same as mini
, maxi
and type_minimaxi
. If the user only
wants to obtain clusters meeting size criteria, this a posteriori
approach must be prioritized over the a priori approach which gives
biased results and must therefore be used with great care.
The function SpatialScan
returns a list of object of class
"ResScanOutput"
which is composed of many elements. The element
sites_clusters
is a list in which each element corresponds to a
significant cluster and contains the index of the sites (or the
individuals) included in this cluster. The clusters are listed in their
order of detection. The secondary clusters are defined according to
Kulldorff (1997): they correspond to potential clusters that also
present large values for the concentration index. Their p-values are
calculated as if they were the most likely cluster themselves which is a
bit conservative since the secondary clusters are compared with the most
likely cluster of the permutations (Kulldorff 1997). Finally, only
clusters that are significant at the typeI
threshold and that do not
overlap with a more likely cluster are returned, and pval_clusters
corresponds to the associated p-values. The element centres_clusters
corresponds to the coordinates of the centres of each detected cluster
and radius_clusters
is the radius of the clusters in km if system
is
“WGS84” or in the user units otherwise. areas_clusters
corresponds to
the areas of the clusters (in the same units as sites_areas
). Finally
the system of coordinates, the coordinates of the sites, the data and
the name of the scan procedure are recalled respectively in the elements
system
, sites_coord
, data
and method
.
Depending on the type of the method (univariate, multivariate,
univariate functional or multivariate functional) the objects of class
"ResScanOutput"
are also of class "ResScanOutputUni"
,
"ResScanOutputMulti"
, "ResScanOutputUniFunct"
or
"ResScanOutputMultiFunct"
. The objects of class "ResScanOutputMulti"
and "ResScanOutputMultiFunct"
also include the element
variable_names
, and the objects of class "ResScanOutputUniFunct"
and
"ResScanOutputMultiFunct"
include the element time
.
It is possible to plot the detected clusters by using the classical
plot
function. Depending on the type
parameter, the package
HDSpatialScan provides three different types of plot.
The first one, "map"
, allows the user to plot a map of the sites and
draws the circles corresponding to the circular clusters. The second
one, "map2"
, plots the clusters in colors. For these two types of plot
the argument spobject
which is the spatial object corresponding to the
sites, must be provided. If you do not have this object you can use the
third type "schema"
which simply draws a schema of the sites and the
clusters, with the argument system_conv
which allows to correctly
project the coordinates. It must be entered as in the PROJ documentation
(PROJ contributors 2021).
One may also want to get some features of one’s clusters.
The function summary
allows to get a summary of the clusters, either
the mean and the standard deviation of each of the variables (if many)
if the argument type_summ
is "param"
, or the 25th percentiles, the
medians and the 75th percentiles if the argument type_summ
is
"nparam"
. This function also provides the p-values, the radius and the
area if available (only if sites_areas
is provided) for each cluster
detected.
Other interesting functions are plotCurves
that allows to display
cluster curves (only in the functional case), and plotSummary
which
displays the average (if type = "mean"
) or the median (if
type = "median"
) curves in the clusters, outside and the global mean
or median curves in the functional case. For the multivariate
non-functional framework it displays a spider chart of means or medians
for each variable inside the cluster, outside, or in all the area. Note
that all these functions take an argument only.MLC
which allows to
only plot or summarize the most likely cluster (by setting
only.MLC = TRUE
). Finally the print
function shows the scan
procedure used as well as the number of clusters detected and their
p-value.
To show the simplicity of use of the package, we will apply the different approaches on the environmental data provided in the package. It should be noted that the codes presented in this section represent a total computation time of about one hour on a regular laptop, using 7 cores.
We considered data provided by the French national air quality
forecasting platform PREV’AIR which is available in the package
HDSpatialScan. This lattice data consists in the daily concentrations
(from May 1, 2020 to June 25, 2020) in \(\mu g.m^{-3}\) of four pollutants
for each of the 169 cantons (administrative subdivisions of France) of
the Nord-Pas-de-Calais (a region in northern France) characterized by
spatial polygons and located by their center of gravity
\(s_1, \dots, s_{169}\): nitrogen dioxide (\(\text{NO}_2\)), ozone
(\(\text{O}_3\)) and fine particles \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\)
corresponding respectively to particles whose diameter is less than
\(10 \mu m\) and \(2.5 \mu m\). The package HDSpatialScan provides the
full data: fmulti_data
but also some reduced data for the univariate
functional case which consists in considering only the \(\text{NO}_2\)
concentrations (funi_data
), and for the multivariate non-functional
framework (multi_data
) which corresponds to the temporal mean
concentrations of the four pollutants over the study period.
The first step is to load the data:
library(HDSpatialScan)
data("map_sites")
data("multi_data")
data("funi_data")
data("fmulti_data")
The second step is to visualize the pollutants daily concentration curves in each canton and the spatial distributions of the temporal mean concentrations for each pollutant over the studied time period (Figures 5 and 6). This step allows us to see if sites seem to aggregate and therefore if launching a cluster detection is relevant, and if a temporal variation of the concentrations is visible, in which case a functional method will be more relevant than a multivariate approach summarizing each curve by its mean.
The maps in Figure 6 show a spatial heterogeneity of the average concentration for each pollutant. Thus spatial scan statistics seem to be suitable to highlight the presence of cantons-level spatial clusters of pollutants concentrations. Moreover since the curves in Figure 5 show a marked temporal variability during the period from May 1, 2020 to June 25, 2020 a functional approach is more appropriate. However for sake of completeness we will also perform a multivariate spatial scan statistic approach anyway. Since small clusters of pollution are more relevant for interpretation because the sources of the pollutants are very localized, we will consider an a posteriori filtering of maximum radius equal to 10 km.
First we will investigate a multivariate spatial scan statistic. In this
example the temporal component of the multivariate functional data was
suppressed by averaging the components over the time and we looked for
spatial clusters of the combination of the different air pollutants.
This will pick up areas of multiple exposure to pollutants or, on the
contrary, areas with little pollution. We first checked the normality of
each variable, which has been done by using a histogram and a qqplot.
Since the distribution of the pollutants temporal mean concentrations is
non-normal we decide to apply the MNP scan procedure. Here the system of
coordinates is “WGS84”, it must be filled with the argument system
. As
explained in Section 3.1, (Kulldorff and Nagarwalla 1995)
recommended to consider a maximum size of half the study region for the
potential clusters so we use this a priori filtering with the
parameters mini
, maxi
and type_minimaxi
: the potential clusters
are circular and they contain between 1 and 50% of the sites. Then as
noticed in Section 4.1, we will apply an a
posteriori filtering of maximum radius equal to 10 km (arguments
mini_post
, maxi_post
and type_minimaxi_post
). Here we only want to
consider the significant clusters at the 5% threshold. Thus we leave the
typeI
parameter at its default value (0.05). However it should be
noted that it is possible to obtain all the clusters (the MLC and the
secondary clusters (Kulldorff 1997)) by setting the typeI
value at
1.
library(sp)
<- coordinates(map_sites)
coords <- SpatialScan(method = "MNP", data = multi_data, sites_coord = coords,
res_mnp + system = "WGS84", mini = 1, maxi = nrow(coords)/2, type_minimaxi = "sites/indiv",
+ mini_post = 0, maxi_post = 10, type_minimaxi_post = "radius",
+ nbCPU = 7, MC = 99, variable_names = c("NO2", "O3", "PM10", "PM2.5"))$MNP
Once the scan procedure is completed, the plot function can be used. For
brevity, we only focus on the MLC and for the sake of completeness we
will show the use of the three possible visualizations of the clusters.
Since we have a spatial object map_sites
we can use the types "map"
and "map2"
. However for sake of completeness we also show the use of
"schema"
which allows to display the clusters otherwise
(Figure 7). For the latter, since the system of the
coordinates is “WGS84”, the plot function requires to complete the
parameter system_conv
which allows to correctly project the points.
Here we choose the EPSG code 2154 corresponding to the Lambert 93
projection since the data is located in metropolitan France.
plot(x = res_mnp, type = "map", spobject = map_sites, only.MLC = TRUE)
plot(x = res_mnp, type = "map2", spobject = map_sites, only.MLC = TRUE)
plot(x = res_mnp, type = "schema", system_conv = "+init=epsg:2154", only.MLC = TRUE)
Finally users may want to get some summarized characteristics, such as
the quantiles of the variables. This can be achieved by using the
function summary
with the argument type_summ
equal to "nparam"
(for the quantiles):
summary(res_mnp, type_summ = "nparam", only.MLC = TRUE)
## $basic_summary
## Cluster 1
## p-value 0.001
## Radius 9.999
##
## $complete_summary
## Overall Inside cluster 1 Outside cluster 1
## Number of sites 169.000 12.000 157.000
## Q25 NO2 8.673 11.327 8.635
## Median NO2 9.183 11.721 9.075
## Q75 NO2 9.848 12.382 9.692
## Q25 O3 66.778 67.527 66.721
## Median O3 67.895 67.609 67.961
## Q75 O3 68.564 67.922 68.658
## Q25 PM10 16.397 17.483 16.205
## Median PM10 16.970 17.877 16.933
## Q75 PM10 17.372 17.962 17.266
## Q25 PM2.5 9.132 10.584 9.113
## Median PM2.5 9.833 10.678 9.790
## Q75 PM2.5 10.213 10.919 10.107
The user can also use the function plotSummary
to display the spider
chart corresponding to the detected cluster
(Figure 8).
plotSummary(res_mnp, type = "median", only.MLC = TRUE)
The MLC is located in the area of Lille. The summary and Figure 8 show that it is a cluster of overpollution (except for the pollutant \(\text{O}_3\)). This cluster is especially characterized by high concentrations of \(\text{NO}_2\) and \(\text{PM}_{2.5}\) which indicates pollution from road traffic and from the residential sector (auxiliary heating in particular). As the adverse health effects of air pollution (and their potential synergistic effect) are well established, such a result could inform local stakeholders about immediate interventions around the area of Lille to reduce the air pollution levels.
We have obtained some first results however the curves on Figure 5 present a marked temporal variability during the study period. Thus it could be interesting to apply functional spatial scan statistics.
Here we only consider the pollutant \(\text{NO}_2\). Applying a spatial
scan statistic for univariate functional data will thus allow to
highlight areas where the \(\text{NO}_2\) concentration curves are
abnormally high or, on the contrary abnormally low. We choose to use the
URBFSS scan procedure since it often presents higher powers and true
positive rates than the other univariate functional methods as its
multivariate counterpart MRBFSS (Frévent et al. 2021b). As mentioned in
Section 4.2 we decide to use the set of potential
clusters a priori in the Equation (1) which corresponds
to the recommended approach of (Kulldorff and Nagarwalla 1995), and to the default
values of the parameters mini
, maxi
and type_minimaxi
in the scan
functions. We also set a maximum radius equal to 10 km a posteriori.
<- SpatialScan(method = "URBFSS", data = funi_data, sites_coord = coords,
res_urbfss + system = "WGS84", mini = 1, maxi = nrow(coords)/2, type_minimaxi = "sites/indiv",
+ mini_post = 0, maxi_post = 10, type_minimaxi_post = "radius",
+ nbCPU = 7, MC = 99)$URBFSS
plot(res_urbfss, type = "map2", spobject = map_sites, only.MLC = TRUE)
Again the MLC is located in the area of Lille (Figure 9).
For functional data another function is provided to give some
characteristics of the clusters: we can visualize the curves in the
cluster by adding the curve of the global median with the function
plotCurves
. The function plotSummary
allows to visualize the median
curves inside and outside the cluster (Figure 10): this
is a cluster of overexposure to \(\text{NO}_2\), which indicates
traffic-related air pollution. Since exposure to pollution impacts
health negatively, these results can be used to intervene to reduce air
pollution.
plotCurves(res_urbfss, add_median = TRUE, only.MLC = TRUE)
plotSummary(res_urbfss, type = "median", only.MLC = TRUE)
Now we consider the four pollutants together. To detect spatial clusters of the combination of the four pollutants considering all available information on the time period, we apply a spatial scan statistic for multivariate functional data. It will identify geographical areas in which one or more of the pollutant concentration curves are abnormally high or abnormally low. For the same reason that we have previously chosen to apply the URBFSS scan procedure, we use the MRBFSS in this context, with the same restrictions a priori and a posteriori as for the MNP and the URBFSS scan approaches.
<- SpatialScan(method = "MRBFSS", data = fmulti_data, sites_coord = coords,
res_mrbfss + system = "WGS84", mini = 1, maxi = nrow(coords)/2, type_minimaxi = "sites/indiv",
+ mini_post = 0, maxi_post = 10, type_minimaxi_post = "radius",
+ nbCPU = 7, MC = 99, variable_names = c("NO2", "O3", "PM10", "PM2.5"))$MRBFSS
plot(res_mrbfss, type = "map2", spobject = map_sites, only.MLC = TRUE)
The detected cluster is exactly the same as before and is therefore located in the urban area of Lille (Figure 11).
Again we will display the curves in the cluster by adding the curve of the global median (Figure 12), as well as the median curves inside and outside the cluster which show that this is a cluster of high concentrations of \(\text{NO}_2\), \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\) (Figure 13). As mentioned in Section 4.2, in environmental science it is well-known that \(\text{NO}_2\) and \(\text{PM}_{2.5}\) are more frequent in urban areas due to road traffic and population density so this is consistent with the cluster observed here. As the adverse health effects of air pollution and the combined effects of air pollutants are well established, this result could enable interventions by local authorities around the Lille area to reduce air pollution.
plotCurves(res_mrbfss, add_median = TRUE, only.MLC = TRUE)
plotSummary(res_mrbfss, type = "median", only.MLC = TRUE)
In this article we presented the HDSpatialScan package. It makes it
very easy to apply the existing scan statistics developed for
multivariate data or functional data (univariate or multivariate), and
the new rank-based scan statistic for univariate functional data
presented in the Section 2. The potential clusters
considered are of variable size and circular. In further updates of the
package HDSpatialScan other shapes of scanning window such as
elliptical or rectangular shapes will be implemented. Our package also
allows to easily plot and summarize the detected clusters. Then examples
of applications of the functions of the package have been shown.
HDSpatialScan presents the advantage that all the scan procedures are
applied using the same function SpatialScan
and it uses the classical
R functions plot
, print
and summary
which makes it very quick to
get started.
HDSpatialScan, rsatscan, SpatialEpi, DCluster, DClusterm, SpatialEpiApp, rflexscan, graphscan, scanstatistics, fda
Epidemiology, FunctionalData, Spatial
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Frévent, et al., "The R Package HDSpatialScan for the Detection of Clusters of Multivariate and Functional Data using Spatial Scan Statistics", The R Journal, 2022
BibTeX citation
@article{RJ-2022-045, author = {Frévent, Camille and Ahmed, Mohamed-Salem and Soula, Julien and Cucala, Lionel and Smida, Zaineb and Dabo-Niang, Sophie and Genin, Michaël}, title = {The R Package HDSpatialScan for the Detection of Clusters of Multivariate and Functional Data using Spatial Scan Statistics}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {3}, issn = {2073-4859}, pages = {95-120} }