Transcriptome diversity is a systematic source of variation in RNA-sequencing data

García-Nieto, Pablo E; Wang, Ban; Fraser, Hunter B

Transcriptome diversity is a systematic source of variation in RNA-sequencing data

Pablo E García-Nieto, Ban Wang and Hunter B Fraser

PLOS Computational Biology, 2022, vol. 18, issue 3, 1-20

Abstract: RNA sequencing has been widely used as an essential tool to probe gene expression. While standard practices have been established to analyze RNA-seq data, it is still challenging to interpret and remove artifactual signals. Several biological and technical factors such as sex, age, batches, and sequencing technology have been found to bias these estimates. Probabilistic estimation of expression residuals (PEER), which infers broad variance components in gene expression measurements, has been used to account for some systematic effects, but it has remained challenging to interpret these PEER factors. Here we show that transcriptome diversity–a simple metric based on Shannon entropy–explains a large portion of variability in gene expression and is the strongest known factor encoded in PEER factors. We then show that transcriptome diversity has significant associations with multiple technical and biological variables across diverse organisms and datasets. In sum, transcriptome diversity provides a simple explanation for a major source of variation in both gene expression estimates and PEER covariates.Author summary: Although the cells in every individual organism have nearly identical DNA sequences, they differ substantially in their function—for instance, neurons are very different from muscle cells. This is in large part because different genes are transcribed from DNA into RNA, a key step in the process known as gene expression. The measurement of RNA levels is an important tool in studying biology, but is complicated by many potentially confounding factors. To account for this, computational methods can correct for unknown confounders, but these do not provide any information about what these confounders are. Here we show that transcriptome diversity–a simple metric based on Shannon entropy–explains a large portion of variability in both gene expression measurements as well as the confounding factors detected by a leading method. This prevalent factor provides a simple explanation for a primary source of variation in gene expression estimates.

Date: 2022
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009939 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 09939&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1009939

DOI: 10.1371/journal.pcbi.1009939

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().