A deeper understanding of a distribution support, being able to determine regions of a certain (possibly high) probability content is an important task in several research fields. Package *HDiR* for R is designed for exact computation of directional (circular and spherical) highest density regions and density level sets when the density is fully known. Otherwise, *HDiR* implements nonparametric plug-in methods based on different kernel density estimates for reconstructing this kind of sets. Additionally, it also allows the computation and plug-in estimation of level sets for general functions (not necessarily a density). Some exploratory tools, such as suitably adapted distances and scatterplots, are also implemented. Two original datasets and spherical density models are used for illustrating *HDiR* functionalities.

When analyzing a data distribution, it is often the case that for a deeper understanding of the modelling problem, it is interesting to determine regions on the density support exceeding a certain threshold on the density function values. These regions are known as density level sets and, if the density is unknown, such a task can be accomplished from a set estimation perspective. Set estimation deals with the problem of reconstructing a set (or estimating any of its features such as its boundary or its volume) from a random sample of points intimately related to it. Since (Hartigan 1975) establishes the notion of clusters as connected components of a density level set, the reconstruction of this particular type of sets has been widely considered in the literature (mainly for densities supported on an Euclidean space). There are only very few contributions where density level set theory has been extended to more general domains such as the unit sphere or manifolds. (Cuevas et al. 2006) consider the estimation of level sets for general functions (not necessarily a density) such as regression curves, providing some consistency theoretical results and showing a density level set on the sphere for illustration. More recently, the reconstruction of density level sets on manifolds is studied in (Cholaquidis et al. 2022), who also presents some simulations illustrating the performance of their approach on the torus and on the sphere.

Let \(X\) be a random vector taking values on a \(d-\)dimensional unit sphere \(S^{d-1}\) with density \(f\) and \(t>0\), the goal of (directional) density level set estimation is to reconstruct the set \[\label{G(t)} G_f(t)=\{x\in S^{d-1}:f(x)\geq t\}. \tag{1}\] from a random sample of points \(\mathcal{X}_n= \{X_1,\cdots,X_n\}\) of \(X\) when \(f\) is unknown. As an illustration, some (theoretical) level sets are shown in Figure 1 by representing \(G_f(t)\) for a circular (left) and a spherical density (right) when three different values of the level \(t\) are chosen. The threshold \(t\) is represented through a dotted line for the circular case. Note that, if large values of \(t\) are considered, \(G_f(t)\) coincides with the greatest modes of the circular/spherical distribution. However, for small values of \(t\), the level set \(G_f(t)\) is practically equal to the support of the distribution. Therefore, cluster definition via connected components in (Hartigan 1975) is clearly related to the notion of mode. Note also that the computation of the number of modes considering the values of a density over a certain range of values for the level \(t\), would enable the construction of a directional cluster tree. (Azzalini and Torelli 2007) already present this statistical tool for Euclidean data. Moreover, the association between clusters and modes is the basis of modal clustering methodology (see (Menardi 2016) for a review on this topic). Most modal clustering algorithms are based on the application of a mode-seeking numerical method to the sample points and assigning the same cluster to those data that are iteratively shifted to the same limit value. Examples of such procedures include the mean shift algorithm that has been already studied in \(S^{d-1}\) (see, for instance, (Chang-Chien et al. 2010) and (Yang et al. 2014)).

Despite a practitioner may be interested in determining this type of regions, the value of the level \(t\) could be (in principle) unknown in real situations. In practice, it is quite common to assume that the set in ((1)) must satisfy a probability content previously established. Following (Box and Tiao 1973), (Hyndman 1996) and, more recently, (Azzalini and Torelli 2007), (Saavedra-Nieves and Crujeiras 2021b) generalize the definition of HDRs from the Euclidean to the directional setting, providing a plug-in estimation method. Specifically, HDRs are a kind of density level sets where the set probability content is fixed instead of the level \(t\). The estimation of HDRs involves further complexities given that the threshold must be computed from the previously fixed probability content. Formally, given \(\tau\in(0,1)\), the \(100(1 - \tau)\)% HDR is the subset \[\label{conjuntonivel2} L(f_\tau)=\{x\in S^{d-1}:f(x)\geq f_\tau\}, \tag{2}\] where \(f_\tau\) can be seen as the largest constant such that \[\mathbb{P}(X\in L(f_\tau))\geq 1-\tau,\] with respect to the distribution induced by \(f\). Figure 1 also shows the HDR \(L(f_\tau)\) for a circular and a spherical densities with three different values of \(\tau\). Note that, if large values of \(\tau\) are considered, \(L(f_\tau)\) is equal to the greatest modes and the most distinct clusters can be easily identified. However, for small values of \(\tau\), \(L(f_\tau)\) is almost equal to the support of the distribution.

To sum up, given a value of \(t\), the computation of the level set established in ((1)) (and of its connected components) is a quite simple mathematical task when \(f\) is known. Under this assumption and taking a fixed \(\tau\in (0,1)\), determining the HDR introduced in ((2)) presents a similar complexity but, in this case, it is additionally necessary to determine the threshold \(f_\tau\). In particular, numerical integration methods can be applied to solve that problem. However, when the density \(f\) is assumed to be unknown and a random sample \(\mathcal{X}_n\in S^{d-1}\) generated from \(f\) is the only available information to reconstruct the set, nonparametric set estimation techniques such as plug-in methods must be considered in order to reconstruct the connected components of the set. Perhaps due to its practical importance, Euclidean HDRs plug-in algorithms based on kernel smoothing have been widely studied even solving the problem of selecting an appropriate smoothing parameter specifically devised for the HDR reconstruction (see (Baı́llo and Cuevas 2006), (Samworth and Wand 2010) or (Casa et al. 2020)). In the directional setting, given that a proper definition of the HDR \(L(f_\tau)\) was not available, no work on this area had been carried out until the recent contribution by (Saavedra-Nieves and Crujeiras 2021b).

The contents of this paper, describing the contributions in
*HDiR*, mainly focus on
computation and plug-in estimation of highest density regions (HDRs) and
density level sets in the circle and the sphere. Although general level
sets can be also analysed using *HDiR*, we will not formally detail
aspects on their computation and on their plug-in reconstruction given
that they can be seen as a direct generalisation of those introduced for
density level sets by replacing the density by the general function
under study. Therefore, with the objective of showing the capabilities
of the *HDiR* package for exact computation of directional HDRs and
density level sets when \(f\) is known and for plug-in estimation
otherwise, this paper is organized as follows. First, a basic overview
on nonparametric plug-in estimation methods is given. Initially, the
classical directional kernel density estimator is briefly introduced, as
it is the key tool for plug-in reconstruction and exploratory methods.
Then, the problems of threshold estimation (with known and unknown
density) and specific bandwidth selection for HDRs are considered.
Circular confidence regions for HDRs are also established. Next, the
reader will find a guided tour across *HDiR* package, illustrating its
use with simulated examples first and with two real data examples later.
Following the perspective in (Cuevas et al. 2006), *HDiR* also allows the
computation and plug-in estimation of general level sets. A
reconstruction example of a (circular) regression level set is detailed.
Moreover, distances between sets and circular/spherical scatterplots are
also described as exploratory tools. Finally, some discussion is
provided, considering on the possible extensions of the package.

This section provides a brief background on the design of plug-in tools
included in *HDiR* for directional (circular and spherical) HDR and
density level set estimation. Following (Cuevas et al. 2006), if a
nonparametric estimator is available for a general function, this
methodology may be directly extended for reconstructing the
corresponding level sets.

Although there are other nonparametric alternative routes for level set
estimation, the plug-in approach has received considerable attention in
the Euclidean literature (see (Tsybakov 1997),
(Baı́llo 2003), (Mason and Polonik 2009), (Rigollet et al. 2009),
(Mammen and Polonik 2013) or (Chen et al. 2017)). This is with no doubt a
natural methodology, which can be generalized to the directional setting
as in (Saavedra-Nieves and Crujeiras 2021b). Given that level set estimation is a
simpler problem than HDR reconstruction, we will restrict to this last
setting in what follows. Given a random sample
\(\mathcal{X}_n\in S^{d-1}\) of the unknown directional density \(f\),
plug-in methods reconstruct the \(100(1 - \tau)\)% HDR namely \(L(f_\tau)\)
in ((2)) as
\[\label{Ltauest}
\hat{L}(\hat{f}_\tau)=\{x\in S^{d-1}:f_n(x)\geq \hat{f}_\tau\}, \tag{3}\]
where \(\hat{f}_\tau\) is an estimator of the threshold \(f_\tau\) and \(f_n\)
denotes a nonparametric directional density estimator. Package *HDiR*
implements the kernel density estimator provided in (Bai et al. 1989)
(\(d>2\)). From \(\mathcal{X}_n\), it is defined at a point \(x\in S^{d-1}\)
as
\[\label{estimacionnucleo}
f_n(x)= \frac{1}{n} \sum_{i=1}^n K_{vM}(x;X_i;1/h^2), \tag{4}\]
where \(K_{vM}\) denotes the von Mises-Fisher kernel density and
\(1/h^2 > 0\) is the concentration parameter.

Following (Bai et al. 1989), package *HDiR* also enables to use any
kernel function (not necessarily the von Mises-Fisher density
implemented by default). An example where an uniform kernel is
considered will be presented later. Even more generally, *HDiR* would
allow the user to define different density estimators that the one
introduced in ((4)). See, for instance,
(Pelletier 2005).

As for the concentration parameter \(1/h^2\), it plays an analogous role
to the bandwidth in the Euclidean case. For small values of \(1/h^2\), the
density estimator is oversmoothed. The opposite effect is obtained as
\(1/h^2\) increases. Hence, the choice of \(h\) is a crucial issue. For
simplicity, in what follows, we refer to \(h\) as bandwidth parameter.
Many approaches for selecting \(h\) in practice, in circular and even
directional settings, have been proposed in the literature (see
(Taylor 2008), (Oliveira et al. 2012), (Hall et al. 1987),
(Di Marzio et al. 2011) or (Garcı́a–Portugués 2013)). All these existing proposals
designed for density estimation are implemented in the package *NPCirc*
and their aim is to minimize some error criterion on the target density.
However, such a bandwidth selector may not be adequate for HDRs or level
set estimation. As far as we know, such a tool was not available in the
directional setting until the selector by (Saavedra-Nieves and Crujeiras 2021b).
It is also available in package *HDiR*. Different plug-in estimators for
HDRs emerge from the consideration of all these bandwith selectors.

For the circular and the spherical densities shown in Figure 1, now Figure 2 contains the HDR plug-in estimators (bluish colours) for \(\tau=0.5\) computed using cross-validation bandwidths and samples of sizes \(n=100\) and \(n=500\), respectively. Although the theoretical circular HDR is composed by three connected components (see Figure 1), the plug-in estimator is able to detect only the two biggest clusters when \(n=100\). In order to assess the agreement of a given estimate with the theoretical target, distances between sets are the usual tools to measure the discrepancies between the theoretical sets and the corresponding empirical reconstructions. One of the most common distances in the Euclidean setting is the Hausdorff distance between the boundaries of both sets.

If the target is the reconstruction of a HDR or a density level set, the
Hausdorff metric is a suitable error criterion in the directional
setting (see (Cuevas et al. 2006) and (Cholaquidis et al. 2022)). If \(A\) and
\(B\) are non-empty compact sets in the \(d-\)dimensional Euclidean space,
the Hausdorff distance between \(A\) and \(B\) is defined as follows
\[d_H(A,B)=\max\left\{\sup_{x\in A}d_E\left(\{x\},B\right),\sup_{y\in B}d_E\left(\{y\},A\right)\right\},\]
where \(d_E(\{x\},B)=\inf_{y\in B}\{d_E(x,y)\}\) being \(d_E(x,y)\) the
Euclidean distance between two points. However, this metric \(d_H\) is not
completely successful in detecting shape-related differences. For
instance, two sets can be very close in Hausdorff distance and still
show quite different shapes. This typically happens where the boundaries
\(\partial A\) and \(\partial B\) are far apart, no matter the proximity of
\(A\) and \(B\). So, a natural way to reinforce the notion of visual
proximity between two sets provided by Hausdorff distance is to account
also for the proximity of their respective boundaries. In particular,
Hausdorff distance between the boundaries of the theoretical HDR and its
plug-in reconstruction is a measure of the estimation error. *HDiR*
allows to compute Euclidean and Hausdorff distances between the
frontiers of two arbitrary sets on the circle and on the sphere.

For a given \(\tau\in (0,1)\), determining the set \(L(f_\tau)\) in
((2)) and its plug-in estimator
\(\hat{L}(\hat{f}_{\tau})\) in ((3)) involve the exact
computation and the estimation of the threshold \(f_\tau\), respectively.
As in the Euclidean setting, both tasks require the use of numerical
integration methods. Specifically, *HDiR* uses the classical trapezoidal
rule in the circular setting. However, for the spherical case, the
computational cost becomes a major issue due to the complexity of the
numerical integration algorithms considered on high dimensional spaces.
It should be noted that package
*SphericalCubature*
includes some functions for solving numerical integration over spheres.
However, it does not provide sufficiently accurate solutions for our
problem.

An alternative approach is implemented in the internal function
`sphere.integration`

of *HDiR*. Specifically, the proposed numerical
integration procedure on the sphere requires the definition of a
triangular mesh, such as the ones depicted in Figure 3,
obtained from the projection over the sphere of triangular meshes on an
embedded icosaedrum. This type of mesh guarantees that there is not a
prevailing direction. For computing the corresponding spherical
integral, the Cartesian coordinates of the mesh vertices are transformed
into spherical coordinates and standard quadrature formulae are applied
in each triangle over the plane formed by the azimuthal and polar angles
(see (Strang and Fix 1973)).