EconPapers    
Economics at your fingertips  
 

Bootstrapping estimates of stability for clusters, observations and model selection

Han Yu (), Brian Chapman (), Arianna Di Florio (), Ellen Eischen (), David Gotz (), Mathews Jacob () and Rachael Hageman Blair ()
Additional contact information
Han Yu: State University of New York at Buffalo
Brian Chapman: University of Utah
Arianna Di Florio: Cardiff University School of Medicine
Ellen Eischen: University of Oregon
David Gotz: University of North Carolina at Chapel Hill
Mathews Jacob: University of Iowa
Rachael Hageman Blair: State University of New York at Buffalo

Computational Statistics, 2019, vol. 34, issue 1, No 15, 349-372

Abstract: Abstract Clustering is a challenging problem in unsupervised learning. In lieu of a gold standard, stability has become a valuable surrogate to performance and robustness. In this work, we propose a non-parametric bootstrapping approach to estimating the stability of a clustering method, which also captures stability of the individual clusters and observations. This flexible framework enables different types of comparisons between clusterings and can be used in connection with two possible bootstrap approaches for stability. The first approach, scheme 1, can be used to assess confidence (stability) around clustering from the original dataset based on bootstrap replications. A second approach, scheme 2, searches over the bootstrap clusterings for an optimally stable partitioning of the data. The two schemes accommodate different model assumptions that can be motivated by an investigator’s trust (or lack thereof) in the original data and additional computational considerations. We propose a hierarchical visualization extrapolated from the stability profiles that give insights into the separation of groups, and projected visualizations for the inspection of the stability of individual operations. Our approaches show good performance in simulation and on real data. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN).

Keywords: Ensemble; k-means; Jaccard coefficient; Clustering; Visualization (search for similar items in EconPapers)
Date: 2019
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s00180-018-0830-y Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:compst:v:34:y:2019:i:1:d:10.1007_s00180-018-0830-y

Ordering information: This journal article can be ordered from
http://www.springer.com/statistics/journal/180/PS2

DOI: 10.1007/s00180-018-0830-y

Access Statistics for this article

Computational Statistics is currently edited by Wataru Sakamoto, Ricardo Cao and Jürgen Symanzik

More articles in Computational Statistics from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-20
Handle: RePEc:spr:compst:v:34:y:2019:i:1:d:10.1007_s00180-018-0830-y