EconPapers    
Economics at your fingertips  
 

Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates

Gary L Gadbury, Qinfang Xiang, Lin Yang, Stephen Barnes, Grier P Page and David B Allison

PLOS Genetics, 2008, vol. 4, issue 6, 1-8

Abstract: Plasmode is a term coined several years ago to describe data sets that are derived from real data but for which some truth is known. Omic techniques, most especially microarray and genomewide association studies, have catalyzed a new zeitgeist of data sharing that is making data and data sets publicly available on an unprecedented scale. Coupling such data resources with a science of plasmode use would allow statistical methodologists to vet proposed techniques empirically (as opposed to only theoretically) and with data that are by definition realistic and representative. We illustrate the technique of empirical statistics by consideration of a common task when analyzing high dimensional data: the simultaneous testing of hundreds or thousands of hypotheses to determine which, if any, show statistical significance warranting follow-on research. The now-common practice of multiple testing in high dimensional experiment (HDE) settings has generated new methods for detecting statistically significant results. Although such methods have heretofore been subject to comparative performance analysis using simulated data, simulating data that realistically reflect data from an actual HDE remains a challenge. We describe a simulation procedure using actual data from an HDE where some truth regarding parameters of interest is known. We use the procedure to compare estimates for the proportion of true null hypotheses, the false discovery rate (FDR), and a local version of FDR obtained from 15 different statistical methods.Author Summary: Plasmode is a term used to describe a data set that has been derived from real data but for which some truth is known. Statistical methods that analyze data from high dimensional experiments (HDEs) seek to estimate quantities that are of interest to scientists, such as mean differences in gene expression levels and false discovery rates. The ability of statistical methods to accurately estimate these quantities depends on theoretical derivations or computer simulations. In computer simulations, data for which the true value of a quantity is known are often simulated from statistical models, and the ability of a statistical method to estimate this quantity is evaluated on the simulated data. However, in HDEs there are many possible statistical models to use, and which models appropriately produce data that reflect properties of real data is an open question. We propose the use of plasmodes as one answer to this question. If done carefully, plasmodes can produce data that reflect reality while maintaining the benefits of simulated data. We show one method of generating plasmodes and illustrate their use by comparing the performance of 15 statistical methods for estimating the false discovery rate in data from an HDE.

Date: 2008
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000098 (text/html)
https://journals.plos.org/plosgenetics/article/fil ... 00098&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pgen00:1000098

DOI: 10.1371/journal.pgen.1000098

Access Statistics for this article

More articles in PLOS Genetics from Public Library of Science
Bibliographic data for series maintained by plosgenetics ().

 
Page updated 2025-03-19
Handle: RePEc:plo:pgen00:1000098