The accuracy of absolute differential abundance analysis from relative count data

Roche, Kimberly E; Mukherjee, Sayan

The accuracy of absolute differential abundance analysis from relative count data

Kimberly E Roche and Sayan Mukherjee

PLOS Computational Biology, 2022, vol. 18, issue 7, 1-25

Abstract: Concerns have been raised about the use of relative abundance data derived from next generation sequencing as a proxy for absolute abundances. For example, in the differential abundance setting, compositional effects in relative abundance data may give rise to spurious differences (false positives) when considered from the absolute perspective. In practice however, relative abundances are often transformed by renormalization strategies intended to compensate for these effects and the scope of the practical problem remains unclear. We used simulated data to explore the consistency of differential abundance calling on renormalized relative abundances versus absolute abundances and find that, while overall consistency is high, with a median sensitivity (true positive rates) of 0.91 and specificity (1—false positive rates) of 0.89, consistency can be much lower where there is widespread change in the abundance of features across conditions. We confirm these findings on a large number of real data sets drawn from 16S metabarcoding, expression array, bulk RNA-seq, and single-cell RNA-seq experiments, where data sets with the greatest change between experimental conditions are also those with the highest false positive rates. Finally, we evaluate the predictive utility of summary features of relative abundance data themselves. Estimates of sparsity and the prevalence of feature-level change in relative abundance data give reasonable predictions of discrepancy in differential abundance calling in simulated data and can provide useful bounds for worst-case outcomes in real data.Author summary: Molecular sequence counting is a near-ubituiqous method for taking “snapshots” of the state of biological systems at the molecular level and is applied to problems as diverse as profiling gene expression and characterizing bacterial community composition. However, concerns exist about the interpretation of these data, given they are relative counts. In particular some feature-level differences between samples may be technical, not biological, stemming from compositional effects. Here, we quantify the accuracy of estimates of sample-sample differences made from relative versus “absolute” molecular count data, using a comprehensive simulation strategy and published experimental data. We find the accuracy of difference estimation is high in at least 50% of simulated and real data sets but that low accuracy outcomes are far from rare. Further, we observe similar numbers of these low accuracy cases when using any of several popular methods for estimating differences in biological count data. Our results support the use of complementary reference measures of absolute abundance (like RNA spike-ins) for normalizing next-generation sequencing data. We briefly validate the use of these reference quantities and of stringent effect size thresholds as strategies for mitigating interpretational problems with relative count data.

Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010284 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 10284&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1010284

DOI: 10.1371/journal.pcbi.1010284

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().