Several persistent homology software libraries have been implemented in R. Specifically, the Dionysus, GUDHI, and Ripser libraries have been wrapped by the TDA and TDAstats CRAN packages. These software represent powerful analysis tools that are computationally expensive and, to our knowledge, have not been formally benchmarked. Here, we analyze runtime and memory growth for the 2 R packages and the 3 underlying libraries. We find that datasets with less than 3 dimensions can be evaluated with persistent homology fastest by the GUDHI library in the TDA package. For higher-dimensional datasets, the Ripser library in the TDAstats package is the fastest. Ripser and TDAstats are also the most memory-efficient tools to calculate persistent homology.
Topological data analysis (TDA) is a broad set of methodologies that
characterizes structural features of datasets inspired by topological
principles. It has a broad range of usage, from viral evolution to
physical chemistry (Chan et al. 2013; Offroy and Duponchel 2016). Within the umbrella of TDA,
persistent homology represents an algebraic approach to understanding
the number, characteristics, and persistence of structural features in
an
There are several C++ libraries available to researchers that calculate alpha and Vietoris-Rips complexes, such as Dionysus, GUDHI, and Ripser (Maria et al. 2016; Morozov 2018; Bauer 2019). These libraries have been wrapped in R by the TDA and TDAstats packages (Wadhwa et al. 2018; Fasy et al. 2019). Although useful, calculating persistent homology for large datasets is often limited due to computational complexity (Otter et al. 2017). As a result, researchers often limit persistent homology analysis to lower dimensions. However, ignoring features in higher dimensions may cause significant information loss, underutilizing persistent homology’s capabilities. Here, we aim to benchmark two R packages - TDA and TDAstats - and enable researchers to most efficiently calculate persistent homology in R.
An
There are several different methods to construct a simplicial complex on
a given point cloud
Once
Alpha complexes provide another method to generate simplicial complexes
on the point cloud
In both methods, the boundary matrix records all simplicial complexes
for each parameter value (
Alpha complex calculations have a run time complexity of
We use readr v1.3.1 to read rectangular data (Wickham et al. 2018), ggplot2 v3.2.1 (Wickham 2016), scatterplot3d v0.3-41 (Ligges and Mächler 2003), recexcavAAR v0.3.0 (Schmid and Serbe 2017), deldir v0.1 (Turner 2020), ggtda v0.1 (Brunson et al. 2020), and magick v2.2 (Ooms 2019) to visualize data, bench v1.0.4 to collect benchmark data (Hester 2019), TDA v1.6.9 (Fasy et al. 2019) and TDAstats v0.4.1 (Wadhwa et al. 2018) to calculate persistent homology of Vietoris-Rips and alpha simplicial complices, and pryr v0.1.4 for calculations involving R objects (Wickham 2018). Median runtime calculations are shown along with the minimum and maximum of 10 iterations per benchmark. Datasets were generated by sampling functions in base R to generate points uniformly distributed over circles (dimension = 2), spheres (dimension = 3), filled squares (dimension = 2), filled cubes (dimension = 3, 4), and tori (dimension = 3). The number of points per point cloud varied from 25 to 500 along with intervals of 25 points, which were empirical limits chosen after considering available computational resources. For consistency between software libraries, the minimum and maximum simplicial complex radii were predetermined for each point cloud and provided as parameters to the TDA and TDAstats R packages. Within the TDA package, benchmark data was collected for the GUDHI (Maria et al. 2016) and Dionysus (Morozov 2018) libraries; within the TDAstats package, benchmark data was collected for the Ripser (Bauer 2019) library. As alpha complex calculation was only implemented in GUDHI, alpha complex benchmark data was naturally only collected for the single library. Measuring memory usage proved challenging since all the libraries calculating persistent homology were implemented in either C++ or Java and then wrapped in R as part of a CRAN package. Thus, memory burden was indirectly measured by using boundary matrix size as a proxy. Given that Ripser optimizes computation of persistent homology by avoiding calculation of a boundary matrix, memory use benchmarks are not provided for Ripser and, consequently, TDAstats.
Benchmark data were collected twice - once on a local machine and once on a remote computing node, each of which featured 16 GB RAM. Both datasets were compared for consistency and are publicly available at the repository linked below. Data from the remote computing node is visualized in this report. The larger point clouds required more than 16 GB of RAM to calculate persistent homology using a subset of the libraries; attempts to compute results resulted in runtime errors, and the corresponding output is missing from the corresponding figures and tables. Fully reproducible code for all numerical results and figures can be found at https://github.com/eashwarsoma/TDA-benchmark. This GitHub repository also contains instructions for generating the Supplement referenced in this report’s results. Video explanations of TDA concepts and reproducing all results in this report can be found at https://tinyurl.com/TDABench.
Computing persistent homology of a canonical torus grants quick insight into efficiency of each library (Figure 3). Dionysus exhibits the longest median runtime, and, although Ripser and GUDHI have similar runtimes for smaller point clouds, as the number of points increases Ripser eventually has a significant lead. Next, we compare library performance with multiple canonical datasets to ensure that the noted pattern generalizes.
Tori do not trivially generalize to other dimensions, but circles do. Benchmarking on a circular point cloud permits confirmation of the pattern in Figure 3 while also revealing how the libraries compare as the dataset dimension increases. Figure 4 exhibits the resulting data for a 2-dimensional circle (left panel), 3-dimensional sphere (center panel), and 4-dimensional hypersphere (right panel). When the dataset dimension equals 2, GUDHI practically matches Ripser’s performance in outpacing Dionysus. However, in the case of the 3-dimensional sphere, the pattern visualized in Figure 3 for the 3-dimensional torus is again present. By the 4th dimension, the gap between Ripser and GUDHI widens. Of note, missing points for larger datasets in Figure 4 are not plotted if and only if calculating persistent homology caused an error due to insufficient RAM. Thus, for the hypersphere, Ripser was able to calculate persistent homology for a dataset with approximately 3 times as many points as Dionysus and over 2 times as many as GUDHI. Interestingly, all curves plotted in Figure 4 grow polynomially with respect to the number of points (see Supplement for regression details).
Large data and feature dimensionality often restrict persistent homology calculations to small point clouds due to computational limits. When calculating persistent homology on a high-dimensional point cloud, as Vietoris-Rips feature dimension increases, there is a corresponding increase in runtime (Figure 5). Dionysus is clearly outmatched by GUDHI and Ripser as feature dimension increases, with the difference being clearest for larger point clouds; by feature dimension 5, Ripser outpaces GUDHI as well (Figure 5). It is unclear whether runtime for each library grows polynomially or exponentially (see Supplement for regression details).
Even with a constant feature dimension, the underlying data dimension could play a role in the runtime of persistent homology calculation. Figure 6 compares the handling of this issue by the Vietoris-Rips complex and the alpha complex. Since GUDHI is the only library implementing functionality with an alpha complex, we compare its implementations of the Vietoris-Rips and alpha complices. Due to computational limitations, an alpha complex could not be calculated for any point clouds with data dimensions exceeding 3. Two notable aspects of Figure 6 stand out. First, the alpha complex calculation clearly runs faster than the Vietoris-Rips complex calculation, a trend that becomes clearer as point cloud size increases. Second, although the Vietoris-Rips complex calculation runtime appears to be independent of the underlying data dimension, the alpha complex calculation is dependent on it. Figure 6 shows a subtle difference between data dimensions 2 and 3 as point cloud size increases. Although unconcerning for a data dimension up to 3, failure to run any alpha complex calculations with a data dimension of 4 could be cause for concern.
In addition to runtime differences, the three Vietoris-Rips homology engines differ in memory use. All three engines appeared to follow power law growth, with a linear trend on log-log plots (Figure 7). However, for nearly all combinations of point cloud dimension and shape, TDAstats used the least memory, and Dionysus used the most, with TDAstats also growing with the smallest power law exponent as the number of points increased for most point clouds. For most point clouds, runtime and memory complexity for TDAstats (Ripser) grew with a power function at least one degree less than the other engines (Figure 8).
As persistent homology calculations continue to become a more popular tool to analyze complex multidimensional data, it will be important to understand from a computational perspective which method to use. In this paper, we examined two forms of persistent homology complexes: Vietoris-Rips and alpha complexes. Both algorithms describe topological features through the generation of simplicial complexes. The advantage in saving computational time by choosing a particular algorithm depends on point cloud characteristics.
Figure 9 shows that at high point cloud sizes, GUDHI’s alpha
complex outperforms Ripser. Theoretically, alpha complexes gain
polynomial run time complexity as the number of points increases,
whereas Vietoris-Rips complexes gain exponential run time complexity
(Otter et al. 2017). Specifically, alpha complexes are
Based on the theoretical complexity and our results, alpha complexes are superior for point clouds with 3 or fewer dimensions. This advantage becomes especially clear at a high number of points. This difference in performance is clear in both runtime and memory use. Interestingly, while alpha complexes had overall less memory use, the memory use varied depending on the shape. Alpha complexes seem to require more memory for noisier data sets such as the annulus when compared to the sphere.
However, without sufficient computational resources, alpha complexes were not usable for point cloud dimensions greater than 3. If a point cloud has more than 3 dimensions, then it could undergo pre-processing with dimension reduction before using alpha complexes. Note, it is possible an algorithm will eventually be developed to enable alpha complex computation of higher-dimensional data. However, if data dimension cannot be reduced without significant information loss, then Vietoris-Rips complexes should be used. It should be stated that if a point cloud is compatible with both complexes, both analyses should be performed as there may be a variation in the persistent homology matrix. This is because alpha complexes satisfy the Nerve Theorem (Edelsbrunner and Mücke 1994), which implies that they are topologically equivalent to the true underlying topology of the dataset; in contrast, Vietoris-Rips complexes only approximate the underlying topology (Hausmann 1996). Among the tested Vietoris-Rips engines, Ripser (wrapped by TDAstats) has the fastest calculation time. GUDHI and Dionysus (wrapped by TDA) significantly fall behind as feature dimension and number of points increase.
On average, Ripser computed the persistent homology of a Vietoris-Rips complex with less memory than either GUDHI or Dionysus. Thus, when efficiency is critical, useRs would likely find TDAstats the appropriate library. However, TDA contains an entire library of features not available in TDAstats. Specifically, TDA allows kNN density estimation, kernel density estimation, density-based clustering, and dendrogram visualization, among other useful functionality. When computational resources are plenty, when point clouds are small and low-dimensional, or when the aforementioned functionality is required, TDA will likely be more appropriate than TDAstats. Both packages are hosted on CRAN.
While Vietoris-Rips complexes can handle high-dimensional data well, the calculation still significantly slows down when looking for higher dimension features. This is evidenced by the big-O polynomial growth for runtime and memory that have degree less than 4 for most 2-dimensional point clouds, but degree between 4 and 6 for most 4-dimensional point clouds (Figure 8); even higher degree complexities should be expected as point cloud dimension increases. Thus, finding high-dimensional topological features in high-dimensional point clouds remains a challenge. Methods to calculate persistent homology do exist for other simplicial complexes, such as the Delaunay complex and the Witness complex, but, to our knowledge, they are not currently implemented in CRAN packages. Future challenges would be creating and implementing algorithms that reduce the computational complexity of higher-dimensional topological feature calculations for R.
The authors thank Emily Dolson, PhD for assistance with using a high-performance computing node for data collection. JGS was supported by NIH grant R37 CA244613.
TDA, TDAstats, readr, ggplot2, scatterplot3d, recexcavAAR, deldir, magick, bench, pryr
Phylogenetics, Spatial, TeachingStatistics, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Somasundaram, et al., "Benchmarking R packages for Calculation of Persistent Homology", The R Journal, 2021
BibTeX citation
@article{RJ-2021-033, author = {Somasundaram, Eashwar V. and Brown, Shael E. and Litzler, Adam and Scott, Jacob G. and Wadhwa, Raoul R.}, title = {Benchmarking R packages for Calculation of Persistent Homology}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {184-193} }