Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data
Jie-Huei Wang (),
Cheng-Yu Liu,
You-Ruei Min,
Zih-Han Wu and
Po-Lin Hou
Additional contact information
Jie-Huei Wang: Department of Mathematics, National Chung Cheng University, Chiayi 62102, Taiwan
Cheng-Yu Liu: Department of Mathematics, National Chung Cheng University, Chiayi 62102, Taiwan
You-Ruei Min: Department of Statistics, Feng Chia University, Taichung 40724, Taiwan
Zih-Han Wu: Department of Mathematics, National Chung Cheng University, Chiayi 62102, Taiwan
Po-Lin Hou: Department of Mathematics, National Chung Cheng University, Chiayi 62102, Taiwan
Mathematics, 2024, vol. 12, issue 14, 1-24
Abstract:
The complexity of cancer development involves intricate interactions among multiple biomarkers, such as gene-environment interactions. Utilizing microarray gene expression profile data for cancer classification is anticipated to be effective, thus drawing considerable interest in the fields of bioinformatics and computational biology. Due to the characteristics of genomic data, problems of high-dimensional interactions and noise interference do exist during the analysis process. When building cancer diagnosis models, we often face the dilemma of model adaptation errors due to an imbalance of data types. To mitigate the issues, we apply the SMOTE-Tomek procedure to rectify the imbalance problem. Following this, we utilize the overlapping group screening method alongside a binary logistic regression model to integrate gene pathway information, facilitating the identification of significant biomarkers associated with clinically imbalanced cancer or normal outcomes. Simulation studies across different imbalanced rates and gene structures validate our proposed method’s effectiveness, surpassing common machine learning techniques in terms of classification prediction accuracy. We also demonstrate that prediction performance improves with SMOTE-Tomek treatment compared to no imbalance treatment and SMOTE treatment across various imbalance rates. In the real-world application, we integrate clinical and gene expression data with prior pathway information. We employ SMOTE-Tomek and our proposed methods to identify critical biomarkers and gene-environment interactions linked to the imbalanced binary outcomes (cancer or normal) in patients from the Cancer Genome Atlas datasets of lung adenocarcinoma and breast invasive carcinoma. Our proposed method consistently achieves satisfactory classification accuracy. Additionally, we have identified biomarkers indicative of gene-environment interactions relevant to cancer and have provided corresponding estimates of odds ratios. Moreover, in high-dimensional imbalanced data, for achieving good prediction results, we recommend considering the order of balancing processing and feature screening.
Keywords: binary logistic regression; cancer diagnostic; gene-environment interaction; joint modeling; overlapping group screening; SMOTE-Tomek; TCGA (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/14/2209/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/14/2209/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:14:p:2209-:d:1435252
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().