In many data analyses, the dimensionality of the observed data is high while its intrinsic dimension remains quite low. Estimating the intrinsic dimension of an observed dataset is an essential preliminary step for dimensionality reduction, manifold learning, and visualization. This paper introduces an R package, named ider, that implements eight intrinsic dimension estimation methods, including a recently proposed method based on a second-order expansion of a probability mass function and a generalized linear model. The usage of each function in the package is explained with datasets generated using a function that is also included in the package.
An assumption that the intrinsic dimension is low even when the apparent dimension is high—that the data distribution is constrained onto a low dimensional manifold—is the basis of many machine learning and data analysis methods, such as dimension reduction and visualization (Cook and Yin 2001; Kokiopoulou and Saad 2007). Without good estimates of the intrinsic dimension, dimensionality reduction is no more than a risky bet, insofar as one does not know to what extent the dimensionality can be reduced. We may overlook important information by projecting the original data on too small dimensional subspace. By analyzing high-dimensional data unnecessarily, computation resources and time can be wasted. When we use visualization techniques to gain insights about data, it is essential to understand whether the data at hand can be safely visualized at low dimensions, and to what extent the original information will be preserved via the visualization method. Several methods for intrinsic dimension estimation (IDE) have been proposed, and they can be roughly divided into two categories:
projection-based methods and
distance-based methods.
The former category of IDE methods basically involve two steps. First, the given dataset is partitioned. Then, in each partition, principal component analysis (PCA) or another procedure for finding a dominant subspace is performed. This approach is generally easy to implement and suitable for exploratory data analysis (Fukunaga and Olsen 1971; Verveer and Duin 1995; Kambhatla and Leen 1997; Bruske and Sommer 1998). However, the estimated dimension is heavily influenced by how the data space is partitioned. Moreover, it is also unknown how the threshold for the eigenvalue obtained by PCA should be determined. This class of methods is useful for explanatory analysis with human interaction and trial-and-error iteration. However, it is unsuitable for plugging into a pipeline for automated data analysis, and we do not consider this sort of method in this paper.
The package ider implements various methods for estimating the intrinsic dimension from a set of observed data using a distance-based approach (Pettis et al. 1979; Grassberger and Procaccia 1983; Kégl 2002; Hein and Audibert 2005; Levina and Bickel 2005; Fan et al. 2009; Gupta and Huang 2010; Eriksson and Crovella 2012; Hino et al. 2017). The implemented algorithms work with either a data matrix or a distance matrix. There are a large number of distance-based IDE methods. Among them, methods based on the fractal dimension (Mandelbrot 1977) are well studied in the fields of both mathematics and physics. The proposed package ider implements the following fractal dimension-based methods:
corint
: the correlation integral (Grassberger and Procaccia 1983)
convU
: the kernel-version of the correlation
integral (Hein and Audibert 2005)
packG,packT
: capacity dimension-based methods with packing number
estimation (a greedy method (Kégl 2002) and a tree-based
method (Eriksson and Crovella 2012))
mada
: first-order local dimension
estimation (Farahmand et al. 2007)
side
: second-order local dimension estimation (Hino et al. 2017)
There are several other distance-based methods, such as one based on a
maximum-likelihood estimate of the Poisson distribution (Levina and Bickel 2005),
which approximates the distance distribution from an inspection point to
other points in a given dataset. This method is implemented in our
package as a function lbmle
. A similar but different approach
utilizing the nearest-neighbor information has also been implemented as
a function nni
(Pettis et al. 1979).
The proposed package also provides a data-generating function gendata
that generates several famous artificial datasets often used as
benchmarks for IDE and manifold learning.
In fractal analysis, the Euclidean concept of a dimension is replaced with the notion of a fractal dimension, which characterizes how the given shape or datasets occupy their ambient space. There are many different definitions of the fractal dimension, from both mathematical and physical perspectives. Well-known fractal dimensions include the correlation dimension and the capacity dimension. There are already some R packages for estimating the fractal dimension, such as fractal, nonlinearTseries, and tseriesChaos. In fractal and nonlinearTseries, the correlation dimension and its generalization estimators are implemented, and in tseriesChaos, the method of false nearest neighbors (Kennel et al. 1992) is implemented. These packages focus on estimates of the embedded dimension of a time series in order to characterize its chaotic property. To complement the above-mentioned packages, we implemented several fractal dimension estimators for vector-valued observations.
For a set of observed data
In the ider package, the
classical correlation dimension estimator proposed in
Grassberger and Procaccia (1983) is performed using the function corint
as
follows.
> set.seed(123)
> x <- gendata(DataName='SwissRoll', n=300)
> estcorint <- corint(x=x, k1=5, k2=10)
> print(estcorint)
> [1] 1.963088
where gendata
to generate the famous ‘SwissRoll’ data with
an ambient dimension of three and an intrinsic dimension of two. As
observed, the correlation integral method by Grassberger and Procaccia (1983) works
well for this dataset. The kernel-based correlation dimension estimator
is performed by using the function convU
as follows:
> set.seed(123)
> x <- gendata(DataName='SwissRoll', n=300)
> estconvU <- convU(x=x, maxDim=5)
> print(estconvU)
> [1] 2
The method proposed by Hein and Audibert (2005) attempts to find the
possible intrinsic dimension one-by-one up to maxDim
. Consequently,
the estimated dimension can only be a natural number. All IDE functions
in ider support both
vector-valued data matrices and distance matrices as the input data.
This is useful in cases where we exclusively obtain a distance matrix,
and in cases where the original data object cannot be represented by a
finite and fixed dimensional vector. This is also useful when we treat
very high-dimensional data, such that retaining its distance matrix
saves memory storage. To indicate that the input is a distance matrix,
we set the parameter DM
to TRUE
as follows:
> set.seed(123)
> x <- gendata(DataName='SwissRoll', n=300)
> estcorint <- corint(x=dist(x), DM=TRUE, k1=5, k2=10)
> print(estcorint)
> [1] 1.963088
The distance matrix can be either a matrix
object or dist
object.
Let
The problem of estimating
In the ider package, the
capacity dimension estimation is based on the packing number with greedy
approximation, and it is performed using the function pack
as
> set.seed(123)
> x <- gendata(DataName='SwissRoll', n=300)
> estpackG <- pack(x=x, greedy=TRUE) ## estimate the packing number by greedy method
> print(estpackG)
> [1] 2.289935
whereas the hierarchical clustering-based method is performed as
> estpackC <- pack(x=x, greedy=FALSE) ## estimate the packing number by cluttering
> print(estpackC)
> [1] 2.393657
Packing-number-based methods require two radii
The former two fractal dimensions, viz., the correlation dimension and capacity dimension, are designed to estimate a global intrinsic dimension. Any global IDE method can be converted into a local method by running the global method on a neighborhood of a point. However, we introduce two inherently local fractal dimension estimators. The relationship between global and local fractal dimensions are shown in (Hino et al. 2017).
Let
Let the mada
is shown below:
> set.seed(123)
> tmp <- gendata(DataName='ldbl', n=300)
> x <- tmp$x
> estmada <- mada(x=x, local=TRUE)
> estmada[c(which(tmp$tDim==1)[1], which(tmp$tDim==2)[1], which(tmp$tDim==3)[1])]
> 1.113473 2.545525 2.207250
This sample code estimates the local intrinsic dimensions of every point
in the dataset
In (Hino et al. 2017), accurate local IDE methods based on a higher-order
expansion of the probability mass function and Poisson regression
modeling are proposed. By using the second-order Taylor series expansion
for
The probability mass
The proof for this is detailed in (Hino et al. 2017). However, there is no
need to know the exact form of the second-order expansion. By
approximating the probability mass
Let the intrinsic dimension at the inspection point
In the package ider, two
IDE algorithms are implemented based on the maximization of
Eq. (8). The first method simply assumes distinct intrinsic
dimensions and maximizes the log-likelihood (8) with respect to
The second method treats the log-likelihood (8) as a function
of both the regression coefficients
In the ider package, a
second-order local IDE with discrete dimensions is performed using the
function side
(Second-order Intrinsic Dimension Estimator) as follows:
> set.seed(123)
> tmp <- gendata(DataName='ldbl', n=300)
> x <- tmp$x
> idx <- c(sample(which(tmp$tDim==1)[1:10], 3), sample(which(tmp$tDim==2)[1:30], 3))
> estside <- side(x=x[1:100,], local=TRUE, method='disc')
> print(estside[idx]) ## estimated discrete local intrinsic dimensions by side
[1] 1 1 1 3 1 2
An example of the same, using the method ‘cont’ is as follows:
> estside <- side(x=x[1:100,], local=TRUE, method='cont')
> print(estside[idx]) ## estimated continuous local intrinsic dimensions by side
[1] 1.020254 1.338089 1.000000 2.126269 3.360426 2.074643
It is seen that the obtained estimates are not natural numbers.
The local dimension estimate is easily aggregated to a global estimate
by taking an average, median, or voting of local estimates, and this is
realized when we set the argument local = TRUE
in mada
or side
.
The functions mada
and side
have an argument comb
to specify how
the local estimates are combined. When comb=’average’
, the local
estimates are averaged as a global IDE. Likewise, when comb=’median’
,
the median of the local estimates is adopted; and when comb=’vote’
,
the voting of local estimates is adopted as a global IDE. Note that the
combination method vote
should be used only with method=’disc’
.
> set.seed(123)
> x <- gendata(DataName='SwissRoll', n=300)
> estmada <- mada(x=x, local=FALSE, comb='median')
> estmada
[1] 1.754866
> estside <- side(x=x, local=FALSE, comb='median', method='disc')
> estside
[1] 2
The package ider supports
two other distance-based dimension-estimation methods, namely lbmle
and nni
.
Levina and Bickel (2005) derived the maximum likelihood estimator of the dimension
lbmle
:
> set.seed(123)
> x <- gendata(DataName='SwissRoll', n=300)
> estmle <- lbmle(x=x, k1=3, k2=5, BC=FALSE)
> print(estmle)
[1] 3.174426
It was pointed out by MacKay and Ghahramani that the above MLE contains
a certain bias lbmle
, however, we can
calculate the bias-corrected estimate by setting the argument BC
,
which stands for "bias-correction", to TRUE
:
> set.seed(123)
> x <- gendata(DataName='SwissRoll', n=300)
> estmle <- lbmle(x=x, k1=3, k2=5, BC=TRUE)
> print(estmle)
[1] 2.032756
Pettis et al. (1979) proposed an IDE method based on the analysis of the
distribution of distances from one point to its nearest neighbors.
Pettis et al. (1979) derived that the distribution of the distance from a point
where
where
The estimator is implemented as a function nni
in
ider and used as follows:
> set.seed(123)
> x <- gendata(DataName='SwissRoll', n=300)
> estnni <- nni(x=x)
> print(estnni)
[1] 2.147266
The function nni
has parameters lbmle
. This method is based on an iterative estimate of IDE, and the
function nni
has a parameter eps
to specify the threshold for
stopping the iteration, which is set at
The ider package is
equipped with a data-generating function gendata
. It can generate nine
different artificial datasets, which are manifolds of dimension DataName
to one of the following:
SwissRoll
SwissRoll data, a 2D manifold in 3D space.
NDSwissRoll
Non-deformable SwissRoll data, a 2D manifold in 3D space.
Moebius
Moebius strip, a 2D manifold in 3D space.
SphericalShell
Spherical Shell,
Sinusoidal
Sinusoidal data, a 1D manifold in 3D space.
Spiral
Spiral-shaped data, a 1D manifold in 2D space.
Cylinder
Cylinder-shaped data, a 2D manifold in 3D space.
SShape
S-shaped data, a 2D manifold in 3D space.
ldbl
Four subspaces, line - disc - filled ball - line, in this order,
along the
The final dataset lbdl
is used to see the ability of local dimension
estimations. The dataset comprises four sub-manifolds: line-shape (1D),
disc (2D), filled ball (3D), and line-shape again, and these four
sub-manifolds are concatenated in this order.
The parameter n
of the function gendata
specifies the number of
samples in a dataset. All but the SphericalShell
dataset have fixed
ambient and intrinsic dimensions. For the SphericalShell
dataset, an
arbitrary integer can be set as the ambient dimension by setting the
argument p
.
The realizations of each dataset are shown in Fig. 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
gendata
.
The aim of this paper is to introduce the package
ider and explain the
standard usage of its implemented functions. Exhaustive experiments
comparing IDE methods in various settings can be found in
(Hino et al. 2017). In this paper, as a simple example of the application of
the IDE methods to realistic problems, we consider estimating the
intrinsic dimension of a set of images. We used the CMU Hand Rotation
dataset
|
|
|
|
|
The original CMU Hand Rotation dataset is composed of 481 images of
> data(handD)
> str(handD)
Class 'dist' atomic [1:115440] 4.96 8.27 8.33 8.31 8.12 ...
..- attr(*, "Labels")= chr [1:481] "dimg" "dimg" "dimg" "dimg" ...
..- attr(*, "Size")= int 481
..- attr(*, "call")= language as.dist.default(m = handD)
..- attr(*, "Diag")= logi FALSE
..- attr(*, "Upper")= logi FALSE
> dim(as.matrix(handD))
[1] 481 481
Because the object handD
is a distance matrix, when we apply IDE
methods to this data, we must set the argument DM
to TRUE
:
>lbmle(x=handD, DM=TRUE, k1=3, k2=5, BC=TRUE, p=NULL)
[1] 4.433915
>corint(x=handD, DM=TRUE, k1=3, k2=10)
[1] 2.529079
> pack(x=handD, DM=TRUE, greedy=TRUE)
[1] 3.314233
> pack(x=handD, DM=TRUE, greedy=FALSE)
[1] 2.122698
> nni(handD, DM=TRUE)
[1] 2.646178
> side(x=handD, DM=TRUE, local=FALSE, method='disc', comb='median')
[1] 3
> side(x=handD, DM=TRUE, local=FALSE, method='cont', comb='median')
[1] 2
These results suggest that even the extrinsic dimension of the image is
very high (
We measured computational costs of dimension estimation methods for
SwissRoll
dataset, where the number of observations
It is seen that the computational costs of packT
and side
grow
almost quadratically. Among various methods, lbmle
and nni
are
computationally very efficient. When we apply packT
or side
to large
scale dataset, it is advised to subsample the original dataset.
In this paper, we introduced an R package, ider, that implements several IDE algorithms for vector-valued observation or distance matrices. Different IDE methods capture different aspects of the data distribution in low-dimensional subspace, and the estimated dimension varies. Our goal in developing ider is to provide a systematic method for selecting or comparing different methods for analyzing a given dataset. In practice, it is not expected that there is a ground-truth intrinsic dimension. In such cases, one possible approach is to apply all of the IDE methods provided in ider. In doing so, we can check whether the estimates agree with one another. If a consensus on the estimate cannot be reached, the reason why will be of interest, and this can help to characterize the nature of the data distribution.
It is worth noting that there is another approach to IDE based on the
minimum spanning tree. In (Costa et al. 2004), a method for
estimating the
Finally, another important future work is improving computational efficiency. IDE methods based on packing number and second-order expansion of probability mass function are not computationally efficient. Implementation using Rcpp or parallelization would be an effective means for realizing the fast computation, and we will now be working on a parallel implementation.
The author would like to express their special thanks to the editor, the associate editor and anonymous reviewers whose comments led to valuable improvements of the manuscript. Part of this research was supported by JSPS KAKENHI Grant Numbers 16H02842, 25120011, 16K16108, and 17H01793.
ider, fractal, nonlinearTseries, tseriesChaos, Rcpp
Finance, HighPerformanceComputing, NumericalMathematics, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hino, "ider: Intrinsic Dimension Estimation with R", The R Journal, 2017
BibTeX citation
@article{RJ-2017-054, author = {Hino, Hideitsu}, title = {ider: Intrinsic Dimension Estimation with R}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {2}, issn = {2073-4859}, pages = {329-341} }