EconPapers    
Economics at your fingertips  
 

Imbalanced data sampling design based on grid boundary domain for big data

Hanji He, Jianfeng He () and Liwei Zhang
Additional contact information
Hanji He: South China University of Technology
Jianfeng He: South China University of Technology
Liwei Zhang: Ping An Insurance Company of China

Computational Statistics, 2025, vol. 40, issue 1, No 2, 27-64

Abstract: Abstract The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.

Keywords: Mass of grid cell; Mixed-resampling; Boundary domain; Random under-sampling (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s00180-024-01471-8 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:compst:v:40:y:2025:i:1:d:10.1007_s00180-024-01471-8

Ordering information: This journal article can be ordered from
http://www.springer.com/statistics/journal/180/PS2

DOI: 10.1007/s00180-024-01471-8

Access Statistics for this article

Computational Statistics is currently edited by Wataru Sakamoto, Ricardo Cao and Jürgen Symanzik

More articles in Computational Statistics from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-04-12
Handle: RePEc:spr:compst:v:40:y:2025:i:1:d:10.1007_s00180-024-01471-8