EconPapers    
Economics at your fingertips  
 

Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data

Osbert C Zalay

PLOS ONE, 2020, vol. 15, issue 1, 1-28

Abstract: Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.

Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0227788 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 27788&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0227788

DOI: 10.1371/journal.pone.0227788

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().

 
Page updated 2025-03-19
Handle: RePEc:plo:pone00:0227788