A synthetic subsampling and estimation procedure for imbalanced big data
Chen Guo,
Yang Liu,
Yan Fan and
Yukun Liu ()
Additional contact information
Chen Guo: East China Normal University, KLATASDS-MOE, School of Statistics
Yang Liu: Soochow University, School of Mathematical Sciences
Yan Fan: Shanghai University of International Business and Economics, School of Statistics and Data Science
Yukun Liu: East China Normal University, KLATASDS-MOE, School of Statistics
Statistical Papers, 2025, vol. 66, issue 7, No 11, 29 pages
Abstract:
Abstract Massive datasets with imbalanced binary outcomes are commonly seen in many areas. Existing optimal subsampling strategies largely overlook the binary and imbalance structure, leading to efficiency loss, and are usually built on inverse probability weighting (IPW), which is unstable if some probabilities are close to zero. In this paper, we propose a synthetic sampling and estimation procedure tailored for imbalanced big data. In the sampling stage, we derive the optimal case–control subsampling plan based on IPW. To overcome the instability of IPW for estimation, we propose a novel empirical likelihood weighting method based on a case–control sample. A real-data-based simulation study indicates that our synthetic subsampling and estimation procedure has smaller mean square error than existing estimation procedures.
Keywords: Optimal subsampling; Case–control sampling; Empirical likelihood weighting; Imbalanced data (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s00362-025-01774-y Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:stpapr:v:66:y:2025:i:7:d:10.1007_s00362-025-01774-y
Ordering information: This journal article can be ordered from
http://www.springer. ... business/journal/362
DOI: 10.1007/s00362-025-01774-y
Access Statistics for this article
Statistical Papers is currently edited by C. Müller, W. Krämer and W.G. Müller
More articles in Statistical Papers from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().