Identification of representative trees in random forests based on a new tree-based distance measure
Björn-Hergen Laabs (),
Ana Westenberger and
Inke R. König
Additional contact information
Björn-Hergen Laabs: University of Lübeck
Ana Westenberger: University of Lübeck
Inke R. König: University of Lübeck
Advances in Data Analysis and Classification, 2024, vol. 18, issue 2, No 6, 363-380
Abstract:
Abstract In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).
Keywords: Random forest; Representative trees; Tree distance; Interpretability; 68T20; 68T30; 62-08 (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s11634-023-00537-7 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:advdac:v:18:y:2024:i:2:d:10.1007_s11634-023-00537-7
Ordering information: This journal article can be ordered from
http://www.springer. ... ds/journal/11634/PS2
DOI: 10.1007/s11634-023-00537-7
Access Statistics for this article
Advances in Data Analysis and Classification is currently edited by H.-H. Bock, W. Gaul, A. Okada, M. Vichi and C. Weihs
More articles in Advances in Data Analysis and Classification from Springer, German Classification Society - Gesellschaft für Klassifikation (GfKl), Japanese Classification Society (JCS), Classification and Data Analysis Group of the Italian Statistical Society (CLADAG), International Federation of Classification Societies (IFCS)
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().