Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA
Eliezyer Fermino de Oliveira,
Pranjal Garg,
Jens Hjerling-Leffler,
Renata Batista-Brito and
Lucas Sjulson
PLOS Computational Biology, 2025, vol. 21, issue 2, 1-23
Abstract:
High-dimensional data have become ubiquitous in the biological sciences, and it is often desirable to compare two datasets collected under different experimental conditions to extract low-dimensional patterns enriched in one condition. However, traditional dimensionality reduction techniques cannot accomplish this because they operate on only one dataset. Contrastive principal component analysis (cPCA) has been proposed to address this problem, but it has seen little adoption because it requires tuning a hyperparameter resulting in multiple solutions, with no way of knowing which is correct. Moreover, cPCA uses foreground and background conditions that are treated differently, making it ill-suited to compare two experimental conditions symmetrically. Here we describe the development of generalized contrastive PCA (gcPCA), a flexible hyperparameter-free approach that solves these problems. We first provide analyses explaining why cPCA requires a hyperparameter and how gcPCA avoids this requirement. We then describe an open-source gcPCA toolbox containing Python and MATLAB implementations of several variants of gcPCA tailored for different scenarios. Finally, we demonstrate the utility of gcPCA in analyzing diverse high-dimensional biological data, revealing unsupervised detection of hippocampal replay in neurophysiological recordings and heterogeneity of type II diabetes in single-cell RNA sequencing data. As a fast, robust, and easy-to-use comparison method, gcPCA provides a valuable resource facilitating the analysis of diverse high-dimensional datasets to gain new insights into complex biological phenomena.Author summary: Technological advances in the biological sciences have led to the proliferation of large, complex datasets for which analysis is challenging. Analyses for these datasets rely heavily on dimensionality reduction techniques, which extract reduced-complexity representations of the data that are easier to analyze and interpret. However, these techniques typically operate on only one dataset, and many biological experiments involve comparing two datasets collected under different conditions. Contrastive principal components analysis (cPCA) was previously developed for this purpose, but it has limitations that have precluded its widespread adoption. Here we introduce generalized contrastive principal components analysis (gcPCA), a method that overcomes these limitations. We first explain the mathematical basis of gcPCA, then describe an open-source gcPCA toolbox with implementations in Python and MATLAB. Finally, we demonstrate the utility of gcPCA in analyzing diverse biological datasets, highlighting its versatility as a tool to compare experimental data collected under two different conditions.
Date: 2025
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012747 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 12747&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1012747
DOI: 10.1371/journal.pcbi.1012747
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().