The ability to classify patients based on gene-expression data varies by algorithm and performance metric
Stephen R Piccolo,
Avery Mecham,
Nathan P Golightly,
Jérémie L Johnson and
Dustin B Miller
PLOS Computational Biology, 2022, vol. 18, issue 3, 1-34
Abstract:
By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist—and most support diverse hyperparameters—so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.Author summary: When a patient is treated in a medical setting, a clinician may extract a tissue sample and use transcriptome-profiling technologies to quantify the extent to which thousands of genes are expressed in the sample. These measurements reflect biological activity that may influence disease development, progression, and/or treatment responses. Patterns that differ between patients in distinct groups (for example, patients who do or do not have a disease or do or do not respond to a treatment) may be used to classify future patients into these groups. This study is a large-scale benchmark comparison of algorithms that can be used to perform such classifications. Additionally, we evaluated feature-selection algorithms, which can be used to identify which variables (genes and/or patient characteristics) are most relevant for classification. Through a series of analyses that build on each other, we show that classification performance varies considerably, depending on which algorithms are used, whether feature selection is used, which settings are used when executing the algorithms, and which metrics are used to evaluate the algorithms’ performance. Researchers can use these findings as a resource for deciding which algorithms and settings to prioritize when deriving transcriptome-based biomarkers in future efforts.
Date: 2022
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009926 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 09926&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1009926
DOI: 10.1371/journal.pcbi.1009926
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().