A deeper understanding of a distribution support, being able to determine regions of a certain (possibly high) probability content is an important task in several research fields. Package HDiR for R is designed for exact computation of directional (circular and spherical) highest density regions and density level sets when the density is fully known. Otherwise, HDiR implements nonparametric plug-in methods based on different kernel density estimates for reconstructing this kind of sets. Additionally, it also allows the computation and plug-in estimation of level sets for general functions (not necessarily a density). Some exploratory tools, such as suitably adapted distances and scatterplots, are also implemented. Two original datasets and spherical density models are used for illustrating HDiR functionalities.
When analyzing a data distribution, it is often the case that for a deeper understanding of the modelling problem, it is interesting to determine regions on the density support exceeding a certain threshold on the density function values. These regions are known as density level sets and, if the density is unknown, such a task can be accomplished from a set estimation perspective. Set estimation deals with the problem of reconstructing a set (or estimating any of its features such as its boundary or its volume) from a random sample of points intimately related to it. Since (Hartigan 1975) establishes the notion of clusters as connected components of a density level set, the reconstruction of this particular type of sets has been widely considered in the literature (mainly for densities supported on an Euclidean space). There are only very few contributions where density level set theory has been extended to more general domains such as the unit sphere or manifolds. (Cuevas et al. 2006) consider the estimation of level sets for general functions (not necessarily a density) such as regression curves, providing some consistency theoretical results and showing a density level set on the sphere for illustration. More recently, the reconstruction of density level sets on manifolds is studied in (Cholaquidis et al. 2022), who also presents some simulations illustrating the performance of their approach on the torus and on the sphere.
Let
Despite a practitioner may be interested in determining this type of
regions, the value of the level
To sum up, given a value of
The contents of this paper, describing the contributions in
HDiR, mainly focus on
computation and plug-in estimation of highest density regions (HDRs) and
density level sets in the circle and the sphere. Although general level
sets can be also analysed using HDiR, we will not formally detail
aspects on their computation and on their plug-in reconstruction given
that they can be seen as a direct generalisation of those introduced for
density level sets by replacing the density by the general function
under study. Therefore, with the objective of showing the capabilities
of the HDiR package for exact computation of directional HDRs and
density level sets when
This section provides a brief background on the design of plug-in tools included in HDiR for directional (circular and spherical) HDR and density level set estimation. Following (Cuevas et al. 2006), if a nonparametric estimator is available for a general function, this methodology may be directly extended for reconstructing the corresponding level sets.
Although there are other nonparametric alternative routes for level set
estimation, the plug-in approach has received considerable attention in
the Euclidean literature (see (Tsybakov 1997),
(Baı́llo 2003), (Mason and Polonik 2009), (Rigollet et al. 2009),
(Mammen and Polonik 2013) or (Chen et al. 2017)). This is with no doubt a
natural methodology, which can be generalized to the directional setting
as in (Saavedra-Nieves and Crujeiras 2021b). Given that level set estimation is a
simpler problem than HDR reconstruction, we will restrict to this last
setting in what follows. Given a random sample
Following (Bai et al. 1989), package HDiR also enables to use any kernel function (not necessarily the von Mises-Fisher density implemented by default). An example where an uniform kernel is considered will be presented later. Even more generally, HDiR would allow the user to define different density estimators that the one introduced in ((4)). See, for instance, (Pelletier 2005).
As for the concentration parameter
For the circular and the spherical densities shown in Figure 1,
now Figure 2 contains the HDR plug-in estimators (bluish
colours) for
If the target is the reconstruction of a HDR or a density level set, the
Hausdorff metric is a suitable error criterion in the directional
setting (see (Cuevas et al. 2006) and (Cholaquidis et al. 2022)). If
For a given
An alternative approach is implemented in the internal function
sphere.integration
of HDiR. Specifically, the proposed numerical
integration procedure on the sphere requires the definition of a
triangular mesh, such as the ones depicted in Figure 3,
obtained from the projection over the sphere of triangular meshes on an
embedded icosaedrum. This type of mesh guarantees that there is not a
prevailing direction. For computing the corresponding spherical
integral, the Cartesian coordinates of the mesh vertices are transformed
into spherical coordinates and standard quadrature formulae are applied
in each triangle over the plane formed by the azimuthal and polar angles
(see (Strang and Fix 1973)).
Package HDiR additionally includes a computationally feasible approach
for estimating
This estimation method presents a lower computational complexity than
numerical integration algorithms. Furthermore, it involves a statistical
approximation. Therefore, it is possible to establish confidence
intervals in order to quantify uncertainty in estimates of
The plug-in reconstruction of the directional HDRs in
((3)) also involves the calculation of the kernel density
estimator in ((4)) that is known to be heavily
dependent on the selection of
The closed expression of the Hausdorff distance between the boundaries
of the HDR and its plug-in reconstruction,
This section presents an overview of the structure of the package. HDiR ((Saavedra-Nieves and Crujeiras 2021a)) is an easy-to-use toolbox that R practitioners can use for computation or plug-in estimation of directional highest density regions and general level sets defined on the circle and sphere. The methods included in the package facilitate both data exploration and nonparametric estimation of the target regions. Functions in this library automatize the required operations for the computation of this kind of sets. First, we will describe the real data sets included in the package. Then, the functions available in HDiR are detailed. Of course, there exist several libraries in the CRAN repository of R dealing with plug-in estimation of Euclidean level sets and HDRs. In particular, the library pdfCluster ((Azzalini et al. 2014)) provides a routine to estimate the probability density function by kernel methods from a set of linear data with arbitrary dimension. The main focus is on cluster analysis via kernel density estimation according to the approach by (Hartigan 1975). For modal clustering, package LPCM ((Einbeck and Evers 2019)) implements the mean-shift algorithm and Modalclust ((Cheng and Ray 2014)) performs the method for mode seeking introduced in (Li et al. 2007). There are also other packages that do not solve the task of estimate HDRs directly, but they usually allow to compute the linear kernel density estimator and, therefore, address HDRs graphical representation (not necessarily with an appropriate estimate). A brief summary of the capabilities of these libraries are provided below.
Other packages such as sm ((Bowman and Azzalini 2018)) and ks ((Duong 2007)) also include tools for kernel density estimation allowing for graphical displays of density contours in the two- and three-dimensional Euclidean spaces. Moreover, there are many libraries in the CRAN repository for directional data analysis but, as far as we know, none of them solves the problem of level set or HDR reconstruction. In this section, we would like to highlight those packages including tools for kernel density estimation, both for circular and directional data:
Note also that there are other packages including tools for circular/directional data analysis. For instance, CircStats ((Lund and Agostinelli 2012)) is a companion to (Jammalamadaka and Sengupta 2001), although functions implemented in this package are also available in circular. CircNNTSR ((Fernández-Durán and Gregorio-Domı́nguez 2013)) provides an alternative estimation method for circular distributions based on nonnegative trigonometric sums. isocir ((Barragán et al. 2013)) implements some routines for analyzing angular data subjected to order constraints on a unit circle. Finally, movMF ((Hornik and Grün 2014)) is focused on mixtures of von Mises distributions, allowing to draw random samples from these models and to proceed with parameter estimation, by using an expectation-maximization algorithm.
Specifically, the goal of HDiR package is to provide tools for
directional (circular and spherical) general level sets and HDRs exact
computation also including their plug-in estimation. This library
implements the first specific bandwidth selector devised for directional
HDRs proposed in (Saavedra-Nieves and Crujeiras 2021b), but it also allows
directly user-defined bandwidth selection and to use the existing
directional bandwidth selection methods devised for kernel density
estimation. Additionally, two alternative methods for estimating the
threshold
Dataset | Description |
---|---|
earthquakes |
Geographical coordinates (latitude and longitude) of earthquakes |
of magnitude greater than or equal to 2.5 degrees between Octo- | |
ber 2004 and April 2020 | |
sandhoppers |
Orientation of two sandhoppers species, Talitrus saltator and Ta- |
lorchestia brito under different natural conditions | |
Function | Description |
circ.boot.bw |
Circular bootstrap bandwidth for HDRs estimation |
circ.distances |
Euclidean and Hausdorff distances between two sets of points on |
the unit circle | |
circ.hdr |
Computation of HDRs and general level sets for a given circular |
real-valued function | |
circ.plugin.hdr |
Circular plug-in estimation of HDRs and level sets and confiden- |
ce regions | |
circ.scatterplot |
Circular scatterplot for plug-in HDRs |
dspheremix |
Density functions for mixtures of spherical von Mises-Fisher |
rspheremix |
Random generation functions for mixtures of spherical von Mises- |
Fisher | |
sphere.boot.bw |
Spherical bootstrap bandwidth for HDRs estimation |
sphere.distances |
Euclidean and Hausdorff distances between two sets of points on |
the unit sphere | |
sphere.hdr |
Computation of HDRs and general level sets for a given spherical |
real-valued density | |
sphere.plugin.hdr |
Spherical plug-in estimation of HDRs and level sets |
sphere.scatterplot |
Spherical scatterplot for plug-in HDRs |
A complete description of the HDiR package capabilities is provided in this section. The complete list of functions, illustrative density models (density functions and random sample generation) and the two novel datasets available in HDiR, with a brief description, can be seen in Table 1.
The package HDiR includes a circular and a spherical datasets, used
for the illustration of the different functions. The first dataset,
sandhoppers
, contains the orientation angles (in radians between
The second dataset, earthquakes
, contains the geographical coordinates
(latitude and longitude) of earthquakes of magnitude greater than or
equal to 2.5 degrees on the Richter scale registered on Earth between
1st October 2004 and 9th April 2020. It can be downloaded from the
website of the European-Mediterranean Seismological Centre (EMSC)earthquakes
, contains the geographical coordinates (latitude and
longitude) of earthquakes of magnitude greater than or equal to 2.5
degrees on the Richter scale registered on Earth between 1st October
2004 and 9th April 2020. It can be downloaded from the website of the
European-Mediterranean Seismological Centre (EMSC)
Functions dspheremix
and rspheremix
allow to compute density
functions and to generate data from the spherical distributions
introduced in (Saavedra-Nieves and Crujeiras 2021b). These densities represent a
variety of complex structures showing multimodality and/or asymetry. Any
user of package HDiR could use them for simulations or even for
illustration purposes.
Function dspheremix
computes the density function of 9 different
spherical distributions that can be written as finite mixtures of
spherical von Mises-Fisher. Function rspheremix
is designed for random
data generation from these 9 spherical models. Both functions have an
argument called model
which allows to specify a model (a number
between dspheremix
and
rspheremix
are x
and n
, respectively. x
represents a matrix
whose rows collect to points on the unit sphere (in Cartesian
coordinates) and n
denotes the number of observations to be randomly
generated.
Specifically, model number 9 corresponds to the spherical density shown
in Figure 1. For instance, the evaluation of this density on
the north pole
> data <- rbind(c(1, 0, 0), c(0, 0, 1))
> dspheremix(x = data, model = 9)
[1] 0.0009079986 7.0233299246
Output of this example with dspheremix
is a numeric vector containing
the density values on both poles. Additionally, set.seed(1)
as in the rest
of examples throughout this work, by:
> rspheremix(n = 100, model = 9)
[,1] [,2] [,3]
[1,] 0.254793394 -0.186993591 0.948743233
[2,] 0.227755936 0.896600223 0.379783194
[3,] -0.227024808 0.516581111 0.825592934
[4,] 0.125075316 0.960536966 -0.248444967
Output of function rspheremix
is a matrix of dimension
Functions circ.hdr
and sphere.hdr
must be considered when the
objective is to compute theoretical density level sets or HDRs from a
fully known circular and spherical density
The basic arguments of function circ.hdr
that the user must provide
are the circular (not necessarily a density) function f
and, depending
on the set to be computed (a level set or a HDR), level
or tau
must
be indicated. It is worth to mention that level
represents the value
of 1-tau
, the probability coverage
required for HDR computation in ((2)). Note that
tau
must be specified only when f
is a density. Otherwise, fixing
the probability content of the level set makes no sense. Additionally, a
graphical display is generated with different plot arguments (col
,
lty
, shrink
, plot.hdr=FALSE
(by default plot.hdr=TRUE
).
If level
is specified, the output is a list with two components:
levelset
, a matrix where each row contains the boundaries (in radians)
of each connected component of the level set and level
, the input
level
or a character indicating if the level set is equal to the empty
set or the support distribution. If tau
is provided, the output is
also a list with the next components: hdr
, a matrix where each row
contains the boundaries (in radians) of each connected component of the
HDR; prob.content
, probability coverage 1-tau
and level
, threshold
of the HDR computed by numerical integration methods.
An example with the code lines in order to computing a level set (second code line) and a HDR (third code line) for the circular density represented in Figure 1 is given below. This circular density is the model 13 implemented in the package NPCirc. Therefore, it is necessary to install this library before executing the following code.
> f <- function(x){return(dcircmix(x, 13))}
> circ.hdr(f, level = 0.35)
$levelset
[,1] [,2]
[1,] 0.3301974 0.6698291
[2,] 2.8271189 3.1730400
[3,] 4.9089351 5.0913298
$level
[1] 0.35
> circ.hdr(f, tau = 0.5)
$hdr
[,1] [,2]
[1,] 0.2232764 0.7767501
[2,] 2.7201978 3.2799611
[3,] 4.8523298 5.1479351
$prob.content
[1] 0.5
$level
[1] 0.3024789
From the outputs obtained, some conclusions on the number of connected
components can be extracted. HDR computed when hdr
of the obtained list. Density level set with threshold levelset
also shows the
existence of three connected components.
As for function sphere.hdr
, argument f
is now a spherical
real-valued function. Again, f
may not be a density. The other basic
arguments level
, tau
and plot.hdr
coincide with the usage
description for function circ.hdr
. Additionally, the user can specify
two parameters related to the estimated boundary or to the numerical
integration possibilities on the unit sphere to calculate the HDRs
threshold. In particular, nborder
indicates the maximum number of
boundary points to be represented and tol
, the tolerance parameter
used to determinate the boundary. Two extra parameters control the
numerical integration procedure, when required. Argument mesh
indicates the number of vertices on each edge of the embedded icosaedrum
(reproducing the meshes in Figure 3). Possible values of this
argument are 10, 20 and 40, corresponding with 2000, 8000 and 32000
triangular cells on the sphere, respectively. Quadrature formulae on the
triangles are possible with different degrees, controlled by deg
, with
values ranging from 0 up to 6.
An example with the code lines in order to compute a level set (second line) and a HDR (third line) for the spherical density represented in Figure 1 is presented in what follows:
> f <- function(x){return(dspheremix(x, model = 9))}
> sphere.hdr(f, level = 0.1, mesh = 10, deg = 3)
> sphere.hdr(f, tau = 0.5, mesh = 10, deg = 3)
Outputs are similar to those presented for function circ.hdr
. Again,
levelset
and hdr
are matrices of rows of points (in Cartesian
coordinates) on the level set and HDR boundaries, respectively.
Moreover, it is worth to mention that execution time of sphere.hdr
is
considerably higher when tau
is set instead level
because, in this
case, threshold estimation via numerical integration methods is
required.
The HDiR package contains the implementation of density plug-in methods in order to estimate HDRs. Furthermore, it also enables plug-in estimation of general level sets.
Function circ.plugin.hdr
allows to reconstruct density level sets or
HDRs from the kernel estimator described in
((4)). The arguments tau
, level
and
plot.hdr
have basically the same description for function circ.hdr
.
The argument sample
denotes a numeric vector of angles (in radians)
corresponding to the sample of points bw
. Its value could be directly established by the user. Following
(Oliveira et al. 2014), it could be also chosen by using the classical functions
bw.rt
, bw.CV
, bw.pi
or bw.boot
in NPCirc (by default
bw=bw.CV(circular(sample))
providing a cross-validation bandwidth).
The previous options are designed for density estimation. An appropriate
bandwidth for HDR estimation can be obtained using circ.boot.bw
. The
argument tau.method
is a character value selecting the rule to
estimate the HDRs threshold. This must be one of "quantile"
or
"trapezoidal"
. The default option estimates the threshold using the
quantile method proposed in (Hyndman 1996); the second one,
using the trapezoidal rule for numerical integration. The confidence for
limits on HDR are established from conf
that is a numeric probability
that takes the value conf=0.95
by default. Finally, plot.hdrconf
is
a logical string. If plot.hdr=TRUE
and plot.hdrconf=TRUE
(default
options), the confidence region for the estimated HDR is added to the
estimation graphical representation. The argument boot
is a logical
string. If TRUE
, confidence regions are not computed. Its name is due
to this option is only used by function circ.boot.bw
for reducing the
execution time. Default boot=FALSE
.
If level
is specified, the output is a list with four components:
levelset
, a matrix where each row contains the boundary (in radians)
of a connected component of the level set or a character indicating if
the HDR is equal to the empty set or the support distribution;
prob.content
, the empirical probability coverage of the set; level
indicates the level of the level set and bw
, the value of the
smoothing parameter. If tau
is provided, the output is a list with the
next components: hdr
, a matrix where each row contains the boundary
(in radians) of a connected component of the level set; prob.content
,
the probability coverage 1-tau
; level
, the estimated threshold;
bw
, the numeric value of the smoothing parameter used; hdr.lo
and
hdr.hi
, HDRs corresponding to lower and upper confidence limits,
respectively; threshold.lo
and threshold.hi
the corresponding
thresholds.
For example, the circular confidence regions in Figure 2 can be obtained from the next code lines:
> sample <- rcircmix(500, 13)
> circ.plugin.hdr(sample, tau = 0.5, plot.hdrconf = TRUE, k = 2, col = "blue")
$hdr
[,1] [,2]
[1,] 0.1478027 0.6761185
[2,] 2.6761715 3.2736716
[3,] 4.9403824 5.1542246
$prob.content
[1] 0.5
$level
50%
0.2952482
$bw
[1] 64.62809
$hdr.lo
[,1] [,2]
[1,] 0.1226448 0.7327238
[2,] 2.6447241 3.3114085
[3,] 4.9089351 5.1793825
$level.lo
50%
0.2762859
$hdr.hi
[,1] [,2]
[1,] 0.179250 0.6320922
[2,] 2.713908 3.2422243
[3,] 4.984409 5.1164877
$level.hi
50%
0.3142105
Specifically, hdr.lo
and hdr.hi
in the output list contain the
matrices whose rows correspond to the boundaries (in radians) of the
connected components of lower and upper confidence regions,
respectively. For this example, both regions have three connected
components. Additionally, level.lo
and level.hi
contain the
thresholds of both confidence sets.
The specific bandwidth for circular HDRs estimation described in
(Saavedra-Nieves and Crujeiras 2021b) can be computed from function
circ.boot.bw
. As in the previous circular functions described, the
argument sample
is a numeric vector of angles (in radians)
representing tau
corresponds to the probability
coverage 1-tau
of the HDR to be reconstructed. The pilot smoothing
parameter used is bw
. Default
bw=bw.CV(circular(sample), upper = 100)
. As before, its value could be
chosen by using the classical functions bw.rt
, bw.CV
, bw.pi
or
bw.boot
in NPCirc. The number of bootstrap resamples is denoted by
B
(by default B=50
) and upper
is the numerical upper value for
bounding the optimization procedure (by default 1.5bw
). The output of
this function is a single numeric value corresponding to the selected
smoothing parameter.
The following code lines show how to determine both bandwidths for the circular sample previously generated. Output shows that cross-validation selector takes a larger value than the proposal in (Saavedra-Nieves and Crujeiras 2021b).
> bw.CV(sample, upper = 100); circ.boot.bw(sample, tau = 0.8, B = 2)
[1] 64.62809
[1] 37.06194
Function sphere.plugin.hdr
is designed to estimate spherical HDRs or
density level sets from the kernel estimator described in
((4)). The arguments tau
, level
, plot.hdr
,
nborder
, tol
, mesh
and deg
have the same description as for
function sphere.hdr
. The pilot smoothing parameter used is bw
that,
by default, is bw="none"
selecting a cross-validation bandwidth.
Although other options are possible. For instance, bw
can be a numeric
value o also bw="rot"
allows to consider the rule of thumb suggested
by (Garcı́a–Portugués 2013). The value of bw
could be also selected directly
by the user. The argument ngrid
sets the resolution of the density
calculation (by default ngrid=500
).
If level
is provided, the output is also a list with four components:
levelset
, a matrix of rows of points ( on the HDR boundary;
prob.content
, the empirical probability coverage of the set 1-tau
;
level
, the level of the HDR and bw
, the value of the smoothing
parameter. If tau
is an input, the output of sphere.plugin.hdr
is a
list with the following components: hdr
, a matrix of rows of points on
the HDR boundary; prob.content
, probability coverage 1-tau
and
level
, threshold or level associated to the probability content
1-tau
. The threshold
The spherical HDRs estimators in Figure 2 can be reproduced through the next code lines:
> sample <- rspheremix(500, model = 9)
> sphere.plugin.hdr(sample, tau = 0.5, nborder = 2000)
The first specific bandwidth for spherical HDRs estimation described in
(Saavedra-Nieves and Crujeiras 2021b) can be computed from function
sphere.boot.bw
. As in the previous spherical functions described, the
argument sample
is a matrix whose rows represent points on the unit
sphere (in Cartesian coordinates) and tau
corresponds to the
probability coverage 1-tau
of the HDR to be reconstructed. The pilot
smoothing parameter bw
(default bw="none"
) is chosen using
cross-validation, although it may be set to a numeric value or
bw="rot"
, allowing to select the rule of thumb suggested by
(Garcı́a–Portugués 2013). The argument B
denotes again the number of
bootstrap resamples that (default B=50
) and upper
is the numerical
upper value for bounding the optimization procedure (default 1.5bw
).
The output of this function is a single numeric value corresponding to
the selected smoothing parameter.
The following code lines contain a simulated example where the
cross-validation bandwidth and the proposal in
(Saavedra-Nieves and Crujeiras 2021b) provide HDR estimations which look quite
different for the spherical model
> sample <- rspheremix(500, model = 8)
> bw.boot <- sphere.boot.bw(sample, bw = "rot", tau = 0.8, B = 2)
> sphere.plugin.hdr(sample, bw = bw.boot, tau = 0.8)
> sphere.plugin.hdr(sample, bw = "none", tau = 0.8)
Finally, it is important to note that function sphere.plugin.hdr
for
reconstructing spherical HDR’s calls vmf.kerncontour
in package
Directional to compute the density on a grid on the sphere. Most of
the computational work in this function is in estimating the density
using vmf.kerncontour
. Hence, the speed of this function depends
largely on the speed of vmf.kerncontour
. A similar situation occurs
for function sphere.boot.bw
where function sphere.plugin.hdr
is
called
Density estimators different from the one introduced in
((4)) could be naturally considered for plug-in
estimation of HDRs or level sets. Functions circ.hdr
and sphere.hdr
in package HDiR allow to consider this option in the circular and
spherical settings, respectively.
Next, an example with the code lines in order to determine a spherical HDR plug-in reconstruction (sixth line) from the kernel density estimator in (Bai et al. 1989) with uniform kernel is shown.
> f <- function(x){
sample <- rspheremix(500, model = 3)
return(kde_dir(x, data = sample, h = 0.4,
L = function(x) dunif(x)))
}
> sphere.hdr(f, level = 0.3)
$levelset
[,1] [,2] [,3]
[1,] 0.3587511132 -0.159961736 0.9196249
[2,] -0.4523490796 0.077542650 0.8884635
[3,] -0.4588831000 0.060463844 0.8864369
[4,] 0.2455354599 -0.291602658 0.9244892
$level
[1] 0.3
An spherical density estimator with uniform kernel is available in
package DirStats. Before level set plug-in estimation, it is necessary
to install this library in order to define the kernel estimator, in this
example, from a sample of size levelset
. Note that only the first four points
are printed in the example. The value of the threshold level
.
Furthermore, if the considered density estimator for plug-in estimation
is also a density function, argument tau
in circ.hdr
and
sphere.hdr
could be used.
A generalisation of the approach in (Cuevas et al. 2006), for general level
sets, to the directional setting can be performed in practice with
HDiR. Again, functions circ.hdr
and sphere.hdr
address this
problem for circular and sperical data, respectively.
Next, an example with the code lines in order to obtain the plug-in
estimator of a regression curve (eighth line) with circular explanatory
(x
) variable and linear response (y
). In this case, the regression
curve is estimated through the Nadaraya-Watson estimator implemented in
NPCirc. Here, it is computed from a sample of size x
and y
(lines from 1 to 7).
> f <- function(t){
n <- 100
x <- runif(n, 0, 2*pi)
y <- sin(x)+0.5*rnorm(n)
return(kern.reg.circ.lin(circular(x), y, t, bw = 10, method = "NW")$y)
}
> circ.hdr(f, level = 0.5, plot.hdr = FALSE)
$levelset
[,1] [,2]
[1,] 0.4748553 2.757935
$level
[1] 0.5
Output in levelset
contains the boundary (in radians) of the only
connected component for the reconstructed regression level set.
This section introduces a brief background on the design of two exploratory tools included in HDiR: distances between sets and circular/spherical scatterplots.
Distances between sets are a useful tool when the target is the
reconstruction of a set. In particular, the Hausdorff distance can be
seen as a suitable error criterion also in the directional setting.
Additionally, it could be also used for measuring the distances between
modes or clusters of two different populations. Figure 5
(first column) represents, through a black dashed line, the Hausdorff
distance between
Function circ.distances
computes the Euclidean and Hausdorff distances
between two sets of points in x
and y
, two
numeric vectors of angles (in radians) determining both sets of points.
The output is a list with two components: dE
, a numeric value
corresponding to the Euclidean distance, and dH
, another numeric value
corresponding to the Hausdorff distance.
Specifically, if x
and y
correspond to two HDRs boundaries, this
function returns the distances between the circular HDRs frontiers. In
particular, for the example in Figure 5 (left), the
distances between
> sample <- rcircmix(100, 13)
> f <- function(x){return(dcircmix(x, 13))}
> circ.distances(as.numeric(circ.hdr(f, tau = 0.5)$hdr),
+ as.numeric(circ.plugin.hdr(sample, tau = 0.5)$hdr))
$dE
[1] 0.04402277
$dH
[1] 1.37933
The results obtained show that the Euclidean distance is considerably
smaller than the Hausdorff distance that, as we mention before, takes
the value
Function sphere.distances
also determines the Euclidean and Hausdorff
distances but, in this case, between two sets of points on x
and y
are two matrices whose rows represent points on
the unit sphere (in Cartesian coordinates). The output of this function
has the same organization as the output of circ.distances
and it also
allows to compute distances between spherical HDRs frontiers.
Distances between
> sample = rspheremix(1000, model = 9)
> x <- sphere.plugin.hdr(sample, tau = 0.8, plot.hdr = FALSE)$hdr
> y <- sphere.plugin.hdr(sample, tau = 0.5, plot.hdr = FALSE)$hdr
> sphere.distances(x, y)
$dE
[1] 0.08600028
$dH
[1] 0.258705
The performance of the specific bandwidth for HDR estimation introduced
in (Saavedra-Nieves and Crujeiras 2021b) can be also illustrated through the
consideration of the Hausdorff distance in the example shown in Figure
4. Specifically, the value of the Hausdorff distance between
the theoretical HDR and the reconstruction computed from the bandwidth
proposed in (Saavedra-Nieves and Crujeiras 2021b) is
Additionally, scatterplots are useful to identify the estimated
directional HDRs in which sample points fall. This graphical tool is
computed as follows. Given several values
Function circ.scatterplot
produces a circular scatterplot with points
coloured according to the HDRs in which they fall. Apart from the
argument tau
that represents a numeric vector of probabilities and
plot.density
that is a logical string indicating if the kernel density
estimator is added to the scatterplot (default plot.density=TRUE
), the
other inputs (sample
, bw
and tau.method
) have the same description
for circular functions. The output is a scatterplot and also a list
where the number of components is equal to the number of estimated HDR
or, equivalently, to the length of tau
vector. Each component contains
the sample points in each HDR from the smallest value of tau
to the
largest one.
Next code lines allow to obtain a circular scatterplot computed from a
circular sample of size
> sample<- rcircmix(100, model = 13)
> circ.scatterplot(sample, tau = c(0.2, 0.5, 0.8))
Spherical scatterplots can be represented from function
sphere.scatterplot
. Again, apart from tau
that is a vector of
probabilities, the description of the remaining parameters coincides
with the rest of spherical functions. The output provides a scatterplot
and, as in the circular case, a list where the number of components is
equal to the number of estimated HDR containing the corresponding sample
points from the smallest value of tau
to the biggest one.
As an illustration, the spherical scatterplot shown in Figure 5 (third column) could be computed from the next code lines:
> sample <- rspheremix(1000, model = 9)
> sphere.scatterplot(sample, tau = c(0.2, 0.5, 0.8))
Datasets sandhoppers
and earthquakes
included in HDiR are used
next to illustrate briefly the usage of the set of functions previously
described in the circular and spherical settings, respectively.
Figure 6 shows the estimated HDRs established in
((3)), when
> data(sandhoppers)
> attach(sandhoppers)
> britoF <- angle[(species == "brito")&(time == "morning")&(sex == "F")
+ &(month == "October")]
> circ.plugin.hdr(sample = britoF, tau = 0.8, plot.hdrconf = FALSE)
> britoM <- angle[(species == "brito")&(time == "morning")&(sex == "M")
+ &(month == "October")]
> circ.plugin.hdr(sample = britoM, tau = 0.8, plot.hdrconf = FALSE)
According to Figure 6, no remarkable differences exist
between the HDRs reconstructions for males using a cross-validation
bandwidth (center) and the proposal
> bw.CV(britoM); circ.boot.bw(britoM, tau = 0.8)
> bw.CV(britoF); circ.boot.bw(britoF, tau = 0.8)
As an example with the dataset earthquakes
in Figure 7, we
show the estimated HDR defined in ((3)) for
> data(earthquakes)
> hdr <- as.data.frame(euclid.inv(sphere.plugin.hdr(euclid(earthquakes), tau = 0.8,
+ plot.hdr = FALSE)$hdr))
> world <- map_data("world")
> g.earthquakes <- ggplot()+
> geom_map(data = world, map = world, mapping = aes(map_id = region),
+ color = "grey90", fill = "grey80")+
> geom_point(data = earthquakes, mapping = aes(x = Longitude,
+ y = Latitude), color = "red", alpha = 0.2, size = 0.75, stroke = 0)+
> geom_point(data = hdr, mapping = aes(x = Long, y = Lat),
+ color = "darkblue", size = 1)+
> scale_y_continuous(breaks = NULL, limits = c(-90, 90))+
> scale_x_continuous(breaks = NULL, limits = c(-180, 180))+
> coord_map("mercator")
> g.earthquakes
The value of the bandwidth proposed in (Saavedra-Nieves and Crujeiras 2021b) for
earthquakes
dataset with tau=0.8
and B=5
bootstrap resamples is
> sphere.boot.bw(euclid(earthquakes), tau = 0.8, B = 5)
Once the HDRs estimation has been performed for different values of
> hdr1 <- sphere.plugin.hdr(euclid(earthquakes), tau = 0.8, plot.hdr = FALSE)$hdr
> hdr2 <- sphere.plugin.hdr(euclid(earthquakes), bw = 0.09, tau = 0.8,
+ plot.hdr = FALSE)$hdr
> sphere.distances(hdr1, hdr2)
Apart from distances between HDRs, scatterplots are another powerful
exploratory tool implemented in HDiR. For the sandhoppers
dataset,
Figure 8 shows the circular scatterplots for
> circ.scatterplot(britoF, tau = c(0.2, 0.5, 0.8))
> circ.scatterplot(britoM, tau = c(0.2, 0.5, 0.8))
Spherical scatterplots for earthquakes
dataset when euclid
allows to transforms the data to geographical
coordinates (longitude and latitude) on Cartesian coordinates. Remark
that the smoothing parameter is selected by using the rule of thumb
proposed in (Garcı́a–Portugués 2013).
> sphere.scatterplot(euclid(earthquakes), tau = c(0.2, 0.5, 0.8), bw = "rot",
+ nborder = 1500)
HDiR has been mainly developed for facilitating the reconstruction of directional (circular and spherical) HDRs and density level sets, following a nonparametric plug-in approach. However, it also allows to solve the computation and the plug-in estimation of level sets for general real-valued functions, such as a regression curve. As consequence, plug-in reconstruction of HDRs could be performed by considering a different density estimator than the one implemented by default in HDiR.
The implemented tools are accessible for the scientific community, enabling their usage in practical problems such as the exploration of modes or the approximation of the distribution effective support. As previously noted, level set computation is also useful for determining distribution clusters, a task that can be accomplished by the identification of the connected components from a plug-in level set estimator.
Up to the authors’ knowledge, HDiR is the only statistical package that allows to estimate (circular and spherical) HDRs and general level sets. For HDRs reconstruction, HDiR also implements the first specific selector for HDRs estimation in this context, proposed in (Saavedra-Nieves and Crujeiras 2021b). Additionally, it offers graphical exploratory tools such as HDRs scatterplots that allow to visualize HDRs of a distribution taking into account different probability contents. Similarities or discrepancies between them could be measured through the Hausdorff distance also implemented in HDiR.
Future extensions of the HDiR package could include the estimation of level sets and HDRs in other supports, involving a circular or a spherical component, such as the torus or the cylinder. In addition, new specific bandwidths for HDR estimation could be implemented. A variety of bandwidths selectors emerge from the consideration of different distances in ((5)). Finally, cluster definition in (Hartigan 1975) deserves to be exploited in the directional setting, for instance, by implementing cluster trees for hyperspherical data.
P. Saavedra-Nieves and R.M. Crujeiras acknowledge the financial support of Ministerio de Ciencia e Innovación of the Spanish government under grants PID2020-118101GB-I00 and PID2020-116587GB-I00 and ERDF. Authors also thank Prof. Felicita Scapini for providing the sandhoppers data (collected under the support of the European Project ERB ICI8-CT98-0270), Prof. Andrés Prieto for his help with spherical numerical integration. The authors also acknowlege the constructive comments of the AE and the reviewer, which have improved the contents of the paper and the package.
HDiR, SphericalCubature, pdfCluster, LPCM, Modalclust, denpro, hdrcde, lsbs, sm, ks, circular, CircStats, Directional, DirStats, NPCirc, CircNNTSR, isocir, movMF, ggplot2, maps, mapproj
Cluster, Distributions, Environmetrics, NaturalLanguageProcessing, NumericalMathematics, Phylogenetics, Spatial, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Saavedra-Nieves & Crujeiras, "HDiR: An R Package for Computation and Nonparametric Plug-in Estimation of Directional Highest Density Regions and General Level Sets", The R Journal, 2022
BibTeX citation
@article{RJ-2022-046, author = {Saavedra-Nieves, Paula and Crujeiras, Rosa M.}, title = {HDiR: An R Package for Computation and Nonparametric Plug-in Estimation of Directional Highest Density Regions and General Level Sets}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {3}, issn = {2073-4859}, pages = {121-141} }