Diversity Subsampling: Custom Subsamples from Large Data Sets
Boyang Shang (),
Daniel W. Apley () and
Sanjay Mehrotra ()
Additional contact information
Boyang Shang: Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208
Daniel W. Apley: Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208
Sanjay Mehrotra: Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208
INFORMS Joural on Data Science, 2023, vol. 2, issue 2, 161-182
Abstract:
Subsampling from a large unlabeled (i.e., no response values are available yet) data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. In this paper, we borrow concepts from the well-known sampling/importance resampling technique, which samples from a specified probability distribution, to develop a diversity subsampling approach that selects a subsample from the original data with no prior knowledge of its underlying probability distribution. The goal is to produce a subsample that is independently and uniformly distributed over the support of distribution from which the data are drawn, to the maximum extent possible. We give an asymptotic performance guarantee of the proposed method and provide experimental results to show that the proposed method performs well for typical finite-size data. We also compare the proposed method with competing diversity subsampling algorithms and demonstrate numerically that subsamples selected by the proposed method are closer to a uniform sample than subsamples selected by other methods. The proposed diversity subsampling (DS) algorithm is more efficient than known methods. It takes only a few minutes to select tens of thousands of subsample points from a data set of size one million. Our DS algorithm easily generalizes to select subsamples following distributions other than uniform. We provide a Python package (FADS) that implements the proposed method.
Keywords: diversity subsampling; custom subsampling; representative; space-filling; fully sequential (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/ijds.2022.00017 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:orijds:v:2:y:2023:i:2:p:161-182
Access Statistics for this article
More articles in INFORMS Joural on Data Science from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().