EconPapers    
Economics at your fingertips  
 

A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC

Jason Bennett, Mikhail Pomaznoy, Akul Singhania and Bjoern Peters

PLOS Computational Biology, 2021, vol. 17, issue 10, 1-18

Abstract: Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower ( 14,000). To address this, it would be desirable to reduce the gathered data’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data.Author summary: Next-generation sequencing has spurred the creation of many techniques that attempt to distill large datasets down to a more manageable size without losing valuable information, more simply referred to as dimensionality reduction. We have sought to contribute to this effort by focusing not directly on dimensionality reduction but on interpreting the results of the most common technique used for dimensionality reduction of sequencing data: gene clustering. While methods to generate gene clusters have been well explored, the evaluation of cluster quality has not, i.e., answering the question "Have we made biologically significant clusters?" We have developed a metric that can be used to answer this question. Our metric incorporates prior biological knowledge about the data to determine if the clustering process was optimal by looking at how genes are grouped in gene clusters and determine if they make sense biologically. Our metric can also be used to provide a discrete range of values that indicate how to generate clusters with the highest potential biological information content. This metric can be utilized by any -omics level study to generate study-specific gene clusters while reducing the time spent validating gene clusters and improving confidence in the resultant clusters.

Date: 2021
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009459 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 09459&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1009459

DOI: 10.1371/journal.pcbi.1009459

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().

 
Page updated 2025-03-19
Handle: RePEc:plo:pcbi00:1009459