Generating correlated data for omics simulation
Jianing Yang,
Gregory R Grant and
Thomas G Brooks
PLOS Computational Biology, 2025, vol. 21, issue 9, 1-16
Abstract:
Simulation of realistic omics data is a key input for benchmarking studies that help users obtain optimal computational pipelines. Omics data involves large numbers of measured features on each sample and these measures are generally correlated with each other. However, simulation too often ignores these correlations, perhaps due to computational and statistical hurdles of doing so. To alleviate this, we describe three approaches for generating omics-scale data with correlated measures which mimic real datasets. These approaches are all based on a Gaussian copula approach with a covariance matrix that decomposes into a diagonal part and a low-rank part. This decomposition allows for extremely efficient simulation, overcoming a hurdle for adoption of past methods. We use these approaches to demonstrate the importance of including correlation in two benchmarking applications. First, we show that variance of results from the popular DESeq2 method increases when dependence is included. Second, we demonstrate that CYCLOPS, a method for inferring circadian time of collection from transcriptomics, improves in performance when given gene-gene dependencies in some circumstances. We provide an R package, dependentsimr, that has efficient implementations of these methods and can generate dependent data with arbitrary marginal distributions, including discrete (binary, ordered categorical, Poisson, negative binomial), continuous (normal), or with an empirical distribution.Author summary: Modern techniques, including high-throughput sequencing, produce more data than ever before. To determine the optimal computational analysis methods for these data, benchmarks are often performed using simulated data. This simulated data needs to closely match realistic data in order for benchmarking to meaningful. An often neglected aspect of this is that measurements of different values are often correlated or dependent upon each other. Two possible reasons for this neglect could be that there is a lack of guidelines on how to produce such data and also that methods to produce it are computationally expensive to run. We describe here three related methods that are both conceptually relatively simple and also highly computationally efficient. We demonstrated these on two applications which show how inclusion of these dependencies can affect the results of benchmarking. Lastly, we provide a software package to act as a reference implementations of these.
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013392 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13392&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013392
DOI: 10.1371/journal.pcbi.1013392
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().