C443: An R package to See a Forest for the Trees

Classification trees, well-known for their ease of interpretation, are a widely used tool to solve statistical learning problems. However, researchers often end up with a forest rather than an individual classification tree, which implies a major cost due to the loss of the transparency of individual trees. Therefore, an important challenge is to enjoy the benefits of forests without paying this cost. In this paper, we propose the R package C443. The C443 methodology simplifies a forest into one or a few condensed summary trees, to gain insight into its central tendency and heterogeneity. This is done by clustering the trees in the forest based on similarities between them, and on post-processing the clustering output. We will elaborate upon the implementation of the methodology in the package, and will illustrate its use with three examples.

Aniek Sies (KU Leuven, Faculty of Psychology and Educational Sciences) , Iven Van Mechelen (KU Leuven, Faculty of Psychology and Educational Sciences) , Kristof Meers (KU Leuven, Faculty of Psychology and Educational Sciences)
2023-12-18

0.1 Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2023-062.zip

E. Alfaro, M. Gámez and N. García. adabag: An R package for classification with boosting and bagging. Journal of Statistical Software, 54(2): 1–35, 2013. URL http://www.jstatsoft.org/v54/i02/.
A. B. Arrieta, N. Dı́az-Rodrı́guez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcı́a, S. Gil-López, D. Molina, R. Benjamins, et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, 58: 82–115, 2020.
M. Banerjee, Y. Ding and A.-M. Noone. Identifying representative trees from ensembles. Statistics in Medicine, 31(15): 1601–1616, 2012. DOI 10.1002/sim.4492.
L. Breiman. Bagging predictors. Machine Learning, 24(2): 123–140, 1996a.
L. Breiman. Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6): 2350–2383, 1996b. DOI 10.1214/aos/1032181158.
L. Breiman. Random forests. Machine Learning, 45(1): 5–32, 2001. DOI 10.1023/A:1010933404324.
L. Breiman, J. Friedman, C. J. Stone and R. A. Olshen. Classification and regression trees. Belmont: Wadsworth International Group, 1984.
B. Briand, G. R. Ducharme, V. Parache and C. Mercat-Rommens. A similarity measure to assess the stability of classification trees. Computational Statistics & Data Analysis, 53(4): 1208–1217, 2009. DOI 10.1016/j.csda.2008.10.033.
H. Chipman, E. George and R. McCulloh. Making sense of a forest of trees. In Computing science and statistics, proceedings of the 30th symposium on the interface., Ed S. Weisberg pages. 84–92 1998. Fairfax, VA: Interface Foundation of North America.
D. Dua and C. Graff. UCI machine learning repository. 2017. URL http://archive.ics.uci.edu/ml.
E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan and A. N. Gorban. The five factor model of personality and evaluation of drug consumption risk. In Data science, pages. 231–242 2017. Springer. DOI 10.1037/10140-001.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1): 119–139, 1997.
T. J. Hastie, R. J. Tibshirani and J. H. Friedman. The elements of statistical learning: Data mining, inference, and prediction. New York: Springer, 2009. DOI 10.1007/978-0-387-84858-7.
K. Hornik, C. Buchta and A. Zeileis. Open-source machine learning: R meets Weka. Computational Statistics, 24(2): 225–232, 2009. DOI 10.1007/s00180-008-0119-7.
T. Hothorn, K. Hornik and A. Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3): 651–674, 2006. DOI 10.1198/106186006X133933.
T. Hothorn and A. Zeileis. partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16: 3905–3909, 2015. URL http://jmlr.org/papers/v16/hothorn15a.html.
L. Kaufman and P. J. Rousseeuw. Finding groups in data: An introduction to cluster analysis. Hoboken: John Wiley & Sons, 1990. DOI 10.1002/9780470316801.
G. W. Leibniz. Nouveaux essais sur l’entendement humain, livre iv, chap. xvii. 1764.
A. Liaw and M. Wiener. Classification and regression by randomForest. R News, 2(3): 18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.
S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal and S.-I. Lee. From local explanations to global understanding with explainable AI for trees. Nature machine intelligence, 2(1): 56–67, 2020.
M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert and K. Hornik. cluster: Cluster analysis basics and extensions. 2019.
R. R. McCrae and P. T. Costa. A contemplated revision of the NEO five-factor inventory. Personality and Individual Differences, 36(3): 587–596, 2004. DOI 10.1016/s0191-8869(03)00118-1.
J. H. Patton, M. S. Stanford and E. S. Baratt. Factor structure of the barratt impulsiveness scale. Journal of Cinical Psychology, 51(6): 768–774, 1995. DOI 10.1002/1097-4679(199511)51:6<768::aid-jclp2270510607>3.0.co;2-1.
B. Pfeifer, H. Baniecki, A. Saranti, P. Biecek and A. Holzinger. Multi-omics disease module detection with an explainable greedy decision forest. Scientific Reports, 12(1): 1–15, 2022.
M. Philipp, T. Rusch, K. Hornik and C. Strobl. Measuring the stability of results from supervised statistical learning. Journal of Computational and Graphical Statistics, 27(4): 685–700, 2018.
M. Philipp, A. Zeileis and C. Strobl. A toolkit for stability assessment of tree-based learners. In Proceedings of COMPSTAT 2016 – 22nd international conference on computational statistics, Eds A. Colubi, A. Blanco and C. Gatu pages. 315–325 2016. The International Statistical Institute/International Association for Statistical Computing. ISBN 978-90-73592-36-0.
G. Ridgeway. Generalized boosted models: A guide to the gbm package. Update, 1(1): 2007, 2007.
P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20: 53–65, 1987. DOI 10.1016/0377-0427(87)90125-7.
E. Schubert and P. J. Rousseeuw. Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In International conference on similarity search and applications, pages. 171–187 2019. Springer.
W. D. Shannon and D. Banks. Combining classification trees using MLE. Statistics in Medicine, 18(6): 727–740, 1999. DOI 10.1002/(sici)1097-0258(19990330)18:6<727::aid-sim61>3.3.co;2-u.
A. Sies and I. Van Mechelen. C443: A methodology to see a forest for the trees. Journal of Classification, 37: 730–753, 2020. DOI 10.1007/s00357-019-09350-4.
M. Skurichina and R. P. Duin. Bagging, boosting and the random subspace method for linear classifiers. Pattern Analysis & Applications, 5(2): 121–135, 2002.
C. Strobl, J. Malley and G. Tutz. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4): 323, 2009. DOI 10.1037/a0016973.
T. Therneau, B. Atkinson and B. Ripley. Package rpart. 2015.
P. Turney. Technical note: Bias and the quantification of stability. Machine Learning, 20(1): 23–33, 1995. DOI 10.1007/bf00993473.
S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3): 1–67, 2011. URL https://www.jstatsoft.org/v45/i03/.
M. N. Wright and A. Ziegler. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1): 1–17, 2017. DOI 10.18637/jss.v077.i01.
M. Zuckerman, D. M. Kuhlman, J. Joireman, P. Teta and M. Kraft. A comparison of three structural models for personality: The big three, the big five, and the alternative five. Journal of Paediatric Personality and Social Psychology, 65(4): 757, 1993. DOI 10.1037//0022-3514.65.4.757.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Sies, et al., "C443: An R package to See a Forest for the Trees", The R Journal, 2023

BibTeX citation

@article{RJ-2023-062,
  author = {Sies, Aniek and Mechelen, Iven Van and Meers, Kristof},
  title = {C443: An R package to See a Forest for the Trees},
  journal = {The R Journal},
  year = {2023},
  note = {https://doi.org/10.32614/RJ-2023-062},
  doi = {10.32614/RJ-2023-062},
  volume = {15},
  issue = {3},
  issn = {2073-4859},
  pages = {59-78}
}