EconPapers    
Economics at your fingertips  
 

A synthetic subsampling and estimation procedure for imbalanced big data

Chen Guo, Yang Liu, Yan Fan and Yukun Liu ()
Additional contact information
Chen Guo: East China Normal University, KLATASDS-MOE, School of Statistics
Yang Liu: Soochow University, School of Mathematical Sciences
Yan Fan: Shanghai University of International Business and Economics, School of Statistics and Data Science
Yukun Liu: East China Normal University, KLATASDS-MOE, School of Statistics

Statistical Papers, 2025, vol. 66, issue 7, No 11, 29 pages

Abstract: Abstract Massive datasets with imbalanced binary outcomes are commonly seen in many areas. Existing optimal subsampling strategies largely overlook the binary and imbalance structure, leading to efficiency loss, and are usually built on inverse probability weighting (IPW), which is unstable if some probabilities are close to zero. In this paper, we propose a synthetic sampling and estimation procedure tailored for imbalanced big data. In the sampling stage, we derive the optimal case–control subsampling plan based on IPW. To overcome the instability of IPW for estimation, we propose a novel empirical likelihood weighting method based on a case–control sample. A real-data-based simulation study indicates that our synthetic subsampling and estimation procedure has smaller mean square error than existing estimation procedures.

Keywords: Optimal subsampling; Case–control sampling; Empirical likelihood weighting; Imbalanced data (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s00362-025-01774-y Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:stpapr:v:66:y:2025:i:7:d:10.1007_s00362-025-01774-y

Ordering information: This journal article can be ordered from
http://www.springer. ... business/journal/362

DOI: 10.1007/s00362-025-01774-y

Access Statistics for this article

Statistical Papers is currently edited by C. Müller, W. Krämer and W.G. Müller

More articles in Statistical Papers from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-11-22
Handle: RePEc:spr:stpapr:v:66:y:2025:i:7:d:10.1007_s00362-025-01774-y