Comparative Evaluation of Classifiers in the Presence of Statistical Interactions between Features in High Dimensional Data Settings
Guo Yu and
Balasubramanian Raji
Additional contact information
Guo Yu: BG Medicine, Inc.
Balasubramanian Raji: University of Massachusetts – Amherst
The International Journal of Biostatistics, 2012, vol. 8, issue 1, 1-32
Abstract:
Background: A central challenge in high dimensional data settings in biomedical investigations involves the estimation of an optimal prediction algorithm to distinguish between different disease phenotypes. A significant complicating aspect in these analyses can be attributed to the presence of features that exhibit statistical interactions. Indeed, in several clinical investigations such as genetic studies of complex diseases, it is of interest to specifically identify such features. In this paper, we compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in settings involving high dimensional datasets including statistically interacting feature subsets. We evaluate the performance of these classifiers under conditions of varying sample size, levels of signal-to-noise ratio and strength of statistical interactions among features. We summarize two datasets from studies in diabetes and cardiovascular disease involving gene expression, metabolomics and proteomics measurements and compare results obtained using the four classifiers.Results: Simulation studies revealed that the classifier Prediction Analysis of Microarrays had the highest classification accuracy in the absence of noise, statistical interactions and when feature distributions were multivariate Gaussian within each class. In the presence of statistical interactions, modest effect sizes and the absence of noise, Support Vector Machines achieved the best performance followed closely by Random Forests. Random Forests was optimal in settings that included both significant levels of high dimensional noise features and statistical interactions between biomarker pairs. The data applications revealed similar trends in the relative performances of each classifier.Conclusion: Random Forests had the highest classification accuracy among the four classifiers and was successful in incorporating interaction effects between features in the presence of noise in high dimensional datasets.
Date: 2012
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://doi.org/10.1515/1557-4679.1373 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bpj:ijbist:v:8:y:2012:i:1:n:17
Ordering information: This journal article can be ordered from
https://www.degruyter.com/journal/key/ijb/html
DOI: 10.1515/1557-4679.1373
Access Statistics for this article
The International Journal of Biostatistics is currently edited by Antoine Chambaz, Alan E. Hubbard and Mark J. van der Laan
More articles in The International Journal of Biostatistics from De Gruyter
Bibliographic data for series maintained by Peter Golla ().