EconPapers    
Economics at your fingertips  
 

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

Aboubacry Gaye (), Abdou Ka Diongue, Seydou Nourou Sylla (), Maryam Diarra (), Amadou Diallo (), Cheikh Talla () and Cheikh Loucoubar ()
Additional contact information
Aboubacry Gaye: Laboratory for Studies and Research in Statistics and Development, Gaston Berger University of Saint Louis
Seydou Nourou Sylla: Information and Communication Technologies for Development, Alioune Diop University of Bambey
Maryam Diarra: Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar
Amadou Diallo: Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar
Cheikh Talla: Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar
Cheikh Loucoubar: Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar

Journal of Classification, 2024, vol. 41, issue 1, No 8, 158-169

Abstract: Abstract This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

Keywords: Supervised dimension reduction; Correlation blocks; High-dimensional supervised classification; Genomic data (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s00357-024-09463-5 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:jclass:v:41:y:2024:i:1:d:10.1007_s00357-024-09463-5

Ordering information: This journal article can be ordered from
http://www.springer. ... hods/journal/357/PS2

DOI: 10.1007/s00357-024-09463-5

Access Statistics for this article

Journal of Classification is currently edited by Douglas Steinley

More articles in Journal of Classification from Springer, The Classification Society
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-04-06
Handle: RePEc:spr:jclass:v:41:y:2024:i:1:d:10.1007_s00357-024-09463-5