Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples

Harald, Binder; Martin, Schumacher

Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples

Binder Harald and Schumacher Martin
Additional contact information
Binder Harald: Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg
Schumacher Martin: Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg

Statistical Applications in Genetics and Molecular Biology, 2008, vol. 7, issue 1, 28

Abstract: The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.

Keywords: bootstrap; complexity bias; high-dimensional data; prediction performance (search for similar items in EconPapers)
Date: 2008
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (6)

Downloads: (external link)
https://doi.org/10.2202/1544-6115.1346 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:7:y:2008:i:1:n:12

Ordering information: This journal article can be ordered from
https://www.degruyte ... urnal/key/sagmb/html

DOI: 10.2202/1544-6115.1346

Access Statistics for this article

Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf

More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().