Big Data, Small Sample: Edgeworth Expansions Provide a Cautionary Tale

Inna, Gerlovina; van der Laan Mark, J.; Alan, Hubbard

Big Data, Small Sample: Edgeworth Expansions Provide a Cautionary Tale

Gerlovina Inna (), J. van der Laan Mark and Hubbard Alan
Additional contact information
Gerlovina Inna: Division of Biostatistics, University of California, Berkeley, 101 Haviland Hall, Berkeley, CA 94720, USA
J. van der Laan Mark: University of California, Berkeley 101 Haviland Hall, Berkeley, CA 94720, USA
Hubbard Alan: Division of Biostatistics, School of Public Health, UC Berkeley, Berkeley, CA 94720, USA

The International Journal of Biostatistics, 2017, vol. 13, issue 1, 6

Abstract: Multiple comparisons and small sample size, common characteristics of many types of “Big Data” including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates calculation of very small tail probabilities of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore the impact of departures of sampling distributions from typical assumptions on actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, suggesting potentially wide-spread problems with error rate control, specifically excessive false positives. This is an important factor that contributes to “reproducibility crisis”. We also review some other commonly used methods (such as permutation and methods based on finite sampling inequalities) in their application to multiple testing/small sample data. We point out that Edgeworth expansions, providing higher order approximations to the sampling distribution, offer a promising direction for data analysis that could improve reliability of studies relying on large numbers of comparisons with modest sample sizes.

Keywords: finite sample inference; hypothesis testing; multiple comparisons (search for similar items in EconPapers)
Date: 2017
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1515/ijb-2017-0012 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bpj:ijbist:v:13:y:2017:i:1:p:6:n:14

Ordering information: This journal article can be ordered from
https://www.degruyte ... journal/key/ijb/html

DOI: 10.1515/ijb-2017-0012

Access Statistics for this article

The International Journal of Biostatistics is currently edited by Antoine Chambaz, Alan E. Hubbard and Mark J. van der Laan

More articles in The International Journal of Biostatistics from De Gruyter
Bibliographic data for series maintained by Peter Golla ().