Diversity Subsampling: Custom Subsamples from Large Data Sets

Shang, Boyang; Apley, Daniel W.; Mehrotra, Sanjay

Diversity Subsampling: Custom Subsamples from Large Data Sets

Boyang Shang (), Daniel W. Apley () and Sanjay Mehrotra ()
Additional contact information
Boyang Shang: Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208
Daniel W. Apley: Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208
Sanjay Mehrotra: Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208

INFORMS Joural on Data Science, 2023, vol. 2, issue 2, 161-182

Abstract: Subsampling from a large unlabeled (i.e., no response values are available yet) data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. In this paper, we borrow concepts from the well-known sampling/importance resampling technique, which samples from a specified probability distribution, to develop a diversity subsampling approach that selects a subsample from the original data with no prior knowledge of its underlying probability distribution. The goal is to produce a subsample that is independently and uniformly distributed over the support of distribution from which the data are drawn, to the maximum extent possible. We give an asymptotic performance guarantee of the proposed method and provide experimental results to show that the proposed method performs well for typical finite-size data. We also compare the proposed method with competing diversity subsampling algorithms and demonstrate numerically that subsamples selected by the proposed method are closer to a uniform sample than subsamples selected by other methods. The proposed diversity subsampling (DS) algorithm is more efficient than known methods. It takes only a few minutes to select tens of thousands of subsample points from a data set of size one million. Our DS algorithm easily generalizes to select subsamples following distributions other than uniform. We provide a Python package (FADS) that implements the proposed method.

Keywords: diversity subsampling; custom subsampling; representative; space-filling; fully sequential (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://dx.doi.org/10.1287/ijds.2022.00017 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:orijds:v:2:y:2023:i:2:p:161-182

Access Statistics for this article

More articles in INFORMS Joural on Data Science from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().