Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration

Shofiqul, Islam; Sonia, Anand; Jemila, Hamid; Lehana, Thabane; Joseph, Beyene

Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration

Islam Shofiqul, Anand Sonia, Hamid Jemila, Thabane Lehana and Beyene Joseph ()
Additional contact information
Islam Shofiqul: Population Health Research Institute, McMaster University and Hamilton Health Sciences, Hamilton, Ontario, Canada
Anand Sonia: Population Health Research Institute, McMaster University and Hamilton Health Sciences, Hamilton, Ontario, Canada
Hamid Jemila: Department of Medicine, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4K1, Canada
Thabane Lehana: Population Health Research Institute, McMaster University and Hamilton Health Sciences, Hamilton, Ontario, Canada
Beyene Joseph: Department of Medicine, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4K1, Canada

Statistical Applications in Genetics and Molecular Biology, 2017, vol. 16, issue 3, 199-216

Abstract: Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.

Keywords: AUC; Copula; Gamma distribution; Kernel PCA; principal component (search for similar items in EconPapers)
Date: 2017
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1515/sagmb-2016-0066 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:16:y:2017:i:3:p:199-216:n:3

Ordering information: This journal article can be ordered from
https://www.degruyte ... urnal/key/sagmb/html

DOI: 10.1515/sagmb-2016-0066

Access Statistics for this article

Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf

More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().