HDiR: An R Package for Computation and Nonparametric Plug-in Estimation of Directional Highest Density Regions and General Level Sets

A deeper understanding of a distribution support, being able to determine regions of a certain (possibly high) probability content is an important task in several research fields. Package HDiR for R is designed for exact computation of directional (circular and spherical) highest density regions and density level sets when the density is fully known. Otherwise, HDiR implements nonparametric plug-in methods based on different kernel density estimates for reconstructing this kind of sets. Additionally, it also allows the computation and plug-in estimation of level sets for general functions (not necessarily a density). Some exploratory tools, such as suitably adapted distances and scatterplots, are also implemented. Two original datasets and spherical density models are used for illustrating HDiR functionalities.

Paula Saavedra-Nieves (CITMAga, Galician Centre for Mathematical Research and Technology) , Rosa M. Crujeiras (CITMAga, Galician Centre for Mathematical Research and Technology)
2022-12-20

1 An overview on directional general level sets and highest density regions

When analyzing a data distribution, it is often the case that for a deeper understanding of the modelling problem, it is interesting to determine regions on the density support exceeding a certain threshold on the density function values. These regions are known as density level sets and, if the density is unknown, such a task can be accomplished from a set estimation perspective. Set estimation deals with the problem of reconstructing a set (or estimating any of its features such as its boundary or its volume) from a random sample of points intimately related to it. Since (Hartigan 1975) establishes the notion of clusters as connected components of a density level set, the reconstruction of this particular type of sets has been widely considered in the literature (mainly for densities supported on an Euclidean space). There are only very few contributions where density level set theory has been extended to more general domains such as the unit sphere or manifolds. (Cuevas et al. 2006) consider the estimation of level sets for general functions (not necessarily a density) such as regression curves, providing some consistency theoretical results and showing a density level set on the sphere for illustration. More recently, the reconstruction of density level sets on manifolds is studied in (Cholaquidis et al. 2022), who also presents some simulations illustrating the performance of their approach on the torus and on the sphere.

Let \(X\) be a random vector taking values on a \(d-\)dimensional unit sphere \(S^{d-1}\) with density \(f\) and \(t>0\), the goal of (directional) density level set estimation is to reconstruct the set \[\label{G(t)} G_f(t)=\{x\in S^{d-1}:f(x)\geq t\}. \tag{1}\] from a random sample of points \(\mathcal{X}_n= \{X_1,\cdots,X_n\}\) of \(X\) when \(f\) is unknown. As an illustration, some (theoretical) level sets are shown in Figure 1 by representing \(G_f(t)\) for a circular (left) and a spherical density (right) when three different values of the level \(t\) are chosen. The threshold \(t\) is represented through a dotted line for the circular case. Note that, if large values of \(t\) are considered, \(G_f(t)\) coincides with the greatest modes of the circular/spherical distribution. However, for small values of \(t\), the level set \(G_f(t)\) is practically equal to the support of the distribution. Therefore, cluster definition via connected components in (Hartigan 1975) is clearly related to the notion of mode. Note also that the computation of the number of modes considering the values of a density over a certain range of values for the level \(t\), would enable the construction of a directional cluster tree. (Azzalini and Torelli 2007) already present this statistical tool for Euclidean data. Moreover, the association between clusters and modes is the basis of modal clustering methodology (see (Menardi 2016) for a review on this topic). Most modal clustering algorithms are based on the application of a mode-seeking numerical method to the sample points and assigning the same cluster to those data that are iteratively shifted to the same limit value. Examples of such procedures include the mean shift algorithm that has been already studied in \(S^{d-1}\) (see, for instance, (Chang-Chien et al. 2010) and (Yang et al. 2014)).

graphic without alt text
Figure 1: For a circular density (left) and a spherical density (right), level set \(G_f(t)\) for \(t=t_1\), \(t=t_2\) and \(t=t_3\) verifying \(0<t_1<t_2<t_3\). Equivalently, HDR \(L(f_\tau)\) for \(\tau=\tau_1=0.2\), \(\tau=\tau_2=0.5\) and \(\tau=\tau_3=0.8\).

Despite a practitioner may be interested in determining this type of regions, the value of the level \(t\) could be (in principle) unknown in real situations. In practice, it is quite common to assume that the set in ((1)) must satisfy a probability content previously established. Following (Box and Tiao 1973), (Hyndman 1996) and, more recently, (Azzalini and Torelli 2007), (Saavedra-Nieves and Crujeiras 2021b) generalize the definition of HDRs from the Euclidean to the directional setting, providing a plug-in estimation method. Specifically, HDRs are a kind of density level sets where the set probability content is fixed instead of the level \(t\). The estimation of HDRs involves further complexities given that the threshold must be computed from the previously fixed probability content. Formally, given \(\tau\in(0,1)\), the \(100(1 - \tau)\)% HDR is the subset \[\label{conjuntonivel2} L(f_\tau)=\{x\in S^{d-1}:f(x)\geq f_\tau\}, \tag{2}\] where \(f_\tau\) can be seen as the largest constant such that \[\mathbb{P}(X\in L(f_\tau))\geq 1-\tau,\] with respect to the distribution induced by \(f\). Figure 1 also shows the HDR \(L(f_\tau)\) for a circular and a spherical densities with three different values of \(\tau\). Note that, if large values of \(\tau\) are considered, \(L(f_\tau)\) is equal to the greatest modes and the most distinct clusters can be easily identified. However, for small values of \(\tau\), \(L(f_\tau)\) is almost equal to the support of the distribution.

To sum up, given a value of \(t\), the computation of the level set established in ((1)) (and of its connected components) is a quite simple mathematical task when \(f\) is known. Under this assumption and taking a fixed \(\tau\in (0,1)\), determining the HDR introduced in ((2)) presents a similar complexity but, in this case, it is additionally necessary to determine the threshold \(f_\tau\). In particular, numerical integration methods can be applied to solve that problem. However, when the density \(f\) is assumed to be unknown and a random sample \(\mathcal{X}_n\in S^{d-1}\) generated from \(f\) is the only available information to reconstruct the set, nonparametric set estimation techniques such as plug-in methods must be considered in order to reconstruct the connected components of the set. Perhaps due to its practical importance, Euclidean HDRs plug-in algorithms based on kernel smoothing have been widely studied even solving the problem of selecting an appropriate smoothing parameter specifically devised for the HDR reconstruction (see (Baı́llo and Cuevas 2006), (Samworth and Wand 2010) or (Casa et al. 2020)). In the directional setting, given that a proper definition of the HDR \(L(f_\tau)\) was not available, no work on this area had been carried out until the recent contribution by (Saavedra-Nieves and Crujeiras 2021b).

The contents of this paper, describing the contributions in HDiR, mainly focus on computation and plug-in estimation of highest density regions (HDRs) and density level sets in the circle and the sphere. Although general level sets can be also analysed using HDiR, we will not formally detail aspects on their computation and on their plug-in reconstruction given that they can be seen as a direct generalisation of those introduced for density level sets by replacing the density by the general function under study. Therefore, with the objective of showing the capabilities of the HDiR package for exact computation of directional HDRs and density level sets when \(f\) is known and for plug-in estimation otherwise, this paper is organized as follows. First, a basic overview on nonparametric plug-in estimation methods is given. Initially, the classical directional kernel density estimator is briefly introduced, as it is the key tool for plug-in reconstruction and exploratory methods. Then, the problems of threshold estimation (with known and unknown density) and specific bandwidth selection for HDRs are considered. Circular confidence regions for HDRs are also established. Next, the reader will find a guided tour across HDiR package, illustrating its use with simulated examples first and with two real data examples later. Following the perspective in (Cuevas et al. 2006), HDiR also allows the computation and plug-in estimation of general level sets. A reconstruction example of a (circular) regression level set is detailed. Moreover, distances between sets and circular/spherical scatterplots are also described as exploratory tools. Finally, some discussion is provided, considering on the possible extensions of the package.

2 Plug-in estimation methods

This section provides a brief background on the design of plug-in tools included in HDiR for directional (circular and spherical) HDR and density level set estimation. Following (Cuevas et al. 2006), if a nonparametric estimator is available for a general function, this methodology may be directly extended for reconstructing the corresponding level sets.

Plug-in estimation methods for HDRs and level sets

Although there are other nonparametric alternative routes for level set estimation, the plug-in approach has received considerable attention in the Euclidean literature (see (Tsybakov 1997), (Baı́llo 2003), (Mason and Polonik 2009), (Rigollet et al. 2009), (Mammen and Polonik 2013) or (Chen et al. 2017)). This is with no doubt a natural methodology, which can be generalized to the directional setting as in (Saavedra-Nieves and Crujeiras 2021b). Given that level set estimation is a simpler problem than HDR reconstruction, we will restrict to this last setting in what follows. Given a random sample \(\mathcal{X}_n\in S^{d-1}\) of the unknown directional density \(f\), plug-in methods reconstruct the \(100(1 - \tau)\)% HDR namely \(L(f_\tau)\) in ((2)) as \[\label{Ltauest} \hat{L}(\hat{f}_\tau)=\{x\in S^{d-1}:f_n(x)\geq \hat{f}_\tau\}, \tag{3}\] where \(\hat{f}_\tau\) is an estimator of the threshold \(f_\tau\) and \(f_n\) denotes a nonparametric directional density estimator. Package HDiR implements the kernel density estimator provided in (Bai et al. 1989) (\(d>2\)). From \(\mathcal{X}_n\), it is defined at a point \(x\in S^{d-1}\) as \[\label{estimacionnucleo} f_n(x)= \frac{1}{n} \sum_{i=1}^n K_{vM}(x;X_i;1/h^2), \tag{4}\] where \(K_{vM}\) denotes the von Mises-Fisher kernel density and \(1/h^2 > 0\) is the concentration parameter.

Following (Bai et al. 1989), package HDiR also enables to use any kernel function (not necessarily the von Mises-Fisher density implemented by default). An example where an uniform kernel is considered will be presented later. Even more generally, HDiR would allow the user to define different density estimators that the one introduced in ((4)). See, for instance, (Pelletier 2005).

As for the concentration parameter \(1/h^2\), it plays an analogous role to the bandwidth in the Euclidean case. For small values of \(1/h^2\), the density estimator is oversmoothed. The opposite effect is obtained as \(1/h^2\) increases. Hence, the choice of \(h\) is a crucial issue. For simplicity, in what follows, we refer to \(h\) as bandwidth parameter. Many approaches for selecting \(h\) in practice, in circular and even directional settings, have been proposed in the literature (see (Taylor 2008), (Oliveira et al. 2012), (Hall et al. 1987), (Di Marzio et al. 2011) or (Garcı́a–Portugués 2013)). All these existing proposals designed for density estimation are implemented in the package NPCirc and their aim is to minimize some error criterion on the target density. However, such a bandwidth selector may not be adequate for HDRs or level set estimation. As far as we know, such a tool was not available in the directional setting until the selector by (Saavedra-Nieves and Crujeiras 2021b). It is also available in package HDiR. Different plug-in estimators for HDRs emerge from the consideration of all these bandwith selectors.

For the circular and the spherical densities shown in Figure 1, now Figure 2 contains the HDR plug-in estimators (bluish colours) for \(\tau=0.5\) computed using cross-validation bandwidths and samples of sizes \(n=100\) and \(n=500\), respectively. Although the theoretical circular HDR is composed by three connected components (see Figure 1), the plug-in estimator is able to detect only the two biggest clusters when \(n=100\). In order to assess the agreement of a given estimate with the theoretical target, distances between sets are the usual tools to measure the discrepancies between the theoretical sets and the corresponding empirical reconstructions. One of the most common distances in the Euclidean setting is the Hausdorff distance between the boundaries of both sets.

graphic without alt text
Figure 2: For the circular density shown in Figure 1 (first row), \(\hat{L}(\hat{f}_\tau)\) (bluish colour) for \(\tau=\tau_2=0.5\) computed from \(\mathcal{X}_{100}\) (first column) and \(\mathcal{X}_{500}\) (second column). Additionally, confidence regions are represented (dark red colour) for the second estimation. For the spherical density shown in Figure 1 (second row), \(\hat{L}(\hat{f}_\tau)\) (bluish colour) for \(\tau=\tau_2=0.5\) computed from \(\mathcal{X}_{100}\) (first column) and \(\mathcal{X}_{500}\) (second column).

If the target is the reconstruction of a HDR or a density level set, the Hausdorff metric is a suitable error criterion in the directional setting (see (Cuevas et al. 2006) and (Cholaquidis et al. 2022)). If \(A\) and \(B\) are non-empty compact sets in the \(d-\)dimensional Euclidean space, the Hausdorff distance between \(A\) and \(B\) is defined as follows \[d_H(A,B)=\max\left\{\sup_{x\in A}d_E\left(\{x\},B\right),\sup_{y\in B}d_E\left(\{y\},A\right)\right\},\] where \(d_E(\{x\},B)=\inf_{y\in B}\{d_E(x,y)\}\) being \(d_E(x,y)\) the Euclidean distance between two points. However, this metric \(d_H\) is not completely successful in detecting shape-related differences. For instance, two sets can be very close in Hausdorff distance and still show quite different shapes. This typically happens where the boundaries \(\partial A\) and \(\partial B\) are far apart, no matter the proximity of \(A\) and \(B\). So, a natural way to reinforce the notion of visual proximity between two sets provided by Hausdorff distance is to account also for the proximity of their respective boundaries. In particular, Hausdorff distance between the boundaries of the theoretical HDR and its plug-in reconstruction is a measure of the estimation error. HDiR allows to compute Euclidean and Hausdorff distances between the frontiers of two arbitrary sets on the circle and on the sphere.

Threshold estimation and confidence regions for HDRs

For a given \(\tau\in (0,1)\), determining the set \(L(f_\tau)\) in ((2)) and its plug-in estimator \(\hat{L}(\hat{f}_{\tau})\) in ((3)) involve the exact computation and the estimation of the threshold \(f_\tau\), respectively. As in the Euclidean setting, both tasks require the use of numerical integration methods. Specifically, HDiR uses the classical trapezoidal rule in the circular setting. However, for the spherical case, the computational cost becomes a major issue due to the complexity of the numerical integration algorithms considered on high dimensional spaces. It should be noted that package SphericalCubature includes some functions for solving numerical integration over spheres. However, it does not provide sufficiently accurate solutions for our problem.

An alternative approach is implemented in the internal function sphere.integration of HDiR. Specifically, the proposed numerical integration procedure on the sphere requires the definition of a triangular mesh, such as the ones depicted in Figure 3, obtained from the projection over the sphere of triangular meshes on an embedded icosaedrum. This type of mesh guarantees that there is not a prevailing direction. For computing the corresponding spherical integral, the Cartesian coordinates of the mesh vertices are transformed into spherical coordinates and standard quadrature formulae are applied in each triangle over the plane formed by the azimuthal and polar angles (see (Strang and Fix 1973)).