Comparative Analysis of Machine Learning Techniques for Imbalanced Genetic Data
Arshmeet Kaur () and
Morteza Sarmadi ()
Additional contact information
Arshmeet Kaur: Evergreen Valley College
Morteza Sarmadi: Gilead Sciences
Annals of Data Science, 2025, vol. 12, issue 5, No 5, 1553-1575
Abstract:
Abstract Advancements in genome sequencing technologies have significantly increased the availability of genomic data. The use of machine learning models to predict the pathogenicity or clinical significance of genetic mutations is crucial. However, genetic datasets often feature imbalanced target variables and high-cardinality, skewed predictor variables. These attributes complicate machine learning modeling processes. This study addresses these challenges in both regression and classification tasks. In this study, we systematically explored the impact of various data preprocessing techniques, feature selection methods, and model choices on the performance of machine learning models trained on imbalanced genetic data. We evaluated the performance metrics using fivefold cross-validation. Our key findings demonstrate that the regression models are robust to outliers and skew in predictor and target variables. Similarly, in classification tasks, class-imbalanced target variables and skewed predictors minimally impact model performance. Among the models tested, random forest was the most effective model for both imbalanced regression and classification tasks. Our key contributions are as follows: we address a significant research gap by focusing on imbalanced regression, a problem that is sparsely explored compared to class-imbalanced classification. We identify the techniques that improve prediction performance and provide practical insights into handling genetic data. Additionally, we provide a foundation for future research to further optimize machine learning approaches in genomics. This study uses a genetic dataset as a case, but our findings are applicable to imbalanced data in other fields.
Keywords: Machine learning; Genetic mutations; CADD_PHRED; SIFT; PolyPhen (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s40745-024-00575-8 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:aodasc:v:12:y:2025:i:5:d:10.1007_s40745-024-00575-8
Ordering information: This journal article can be ordered from
https://www.springer ... gement/journal/40745
DOI: 10.1007/s40745-024-00575-8
Access Statistics for this article
Annals of Data Science is currently edited by Yong Shi
More articles in Annals of Data Science from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().