A hybrid feature extraction framework combining PCA and mutual information for gene expression based lung cancer classification
Syed Naseer Ahmad Shah,
Kaartik Issar and
Rafat Parveen
PLOS ONE, 2026, vol. 21, issue 2, 1-28
Abstract:
Lung cancer remains a leading cause of cancer-related mortality worldwide, with early and accurate diagnosis posing a critical challenge for improving patient outcomes. Gene expression data provide crucial insights for lung cancer classification by revealing underlying biological mechanisms. However, the high dimensionality of such data presents challenges, including computational complexity and overfitting risks. This study proposes a hybrid feature extraction framework combining Principal Component Analysis (PCA) and Mutual Information (MI) to address these issues. PCA reduces dimensionality by capturing key variance patterns, while MI selects features highly relevant to the target class, ensuring an informative and concise feature set. Gene expression datasets from The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) were integrated, focusing on common genes. The hybrid PCA-MI framework was applied to rank genes, and the selected features were used to train a Convolutional Neural Network (CNN) for lung cancer classification. The genes ranked by the hybrid model were further analysed using protein-protein interaction (PPI) networks to identify hub genes, enhancing biological interpretability. The proposed framework was benchmarked against ten other feature extraction methods, including Lasso, Random Forest, Autoencoder, and PCA alone. The CNN classifier achieved superior performance with the PCA-MI features, attaining 98% accuracy and 98% precision. Training and validation curves demonstrated stable learning behaviour, and confusion matrix analysis confirmed robust predictions. Hub gene identification through PPI analysis validated the biological significance of the ranked genes. This study presents a robust framework for lung cancer classification by leveraging the strengths of PCA and MI, integrating deep learning and PPI analysis to address high-dimensional data challenges, and setting a foundation for future research in multi-omics data integration and enhanced diagnostic strategies.
Date: 2026
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0342160 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 42160&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0342160
DOI: 10.1371/journal.pone.0342160
Access Statistics for this article
More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().