EconPapers    
Economics at your fingertips  
 

Geometry-Inference Based Clustering Heuristic: New k-means Metric for Gaussian Data and Experimental Proof of Concept

Mohammed Zakariae El Khattabi, Mostapha El Jai (), Youssef Lahmadi and Lahcen Oughdir
Additional contact information
Mohammed Zakariae El Khattabi: Ecole Nationale des Sciences Appliquées, Sidi Mohamed Ben Abdellah University
Mostapha El Jai: Euromed University of Fes
Youssef Lahmadi: Ecole Nationale des Sciences Appliquées, Sidi Mohamed Ben Abdellah University
Lahcen Oughdir: Ecole Nationale des Sciences Appliquées, Sidi Mohamed Ben Abdellah University

SN Operations Research Forum, 2024, vol. 5, issue 1, 1-26

Abstract: Abstract K-means is one of the algorithms that are most utilized in data clustering; the number of metrics is coupled to k-means to reach reasonable levels of clusters’ compactness and separation. In addition, an efficient data assignment to their related clusters is conditioned by a priori selection of the optimal number of clusters which constitutes in fact a crucial step of this process. The present work aims at proposing a new clustering metric/heuristic taking into account both dispersion and statistical characteristics of data to be clustered; a Geometry-Inference based Clustering (GIC) heuristic is derived for selecting the optimal clusters’ number for k-means clustering. The conceptual approach proposed herein introduced the ‘initial speed rate’ as the main random variable to be statistically studied, while the corresponding histograms were fitted according to a set of classical probability distributions. In the case of Gaussian datasets, the estimated probability distributions’ parameters were found to be 2-stages linear according to the number of clusters ‘k’, where the optimal $${k}^{*}$$ k ∗ was found perfectly matching the intersection of the 2-linear stages. Normal and exponential distribution parameters were experienced to be more accurate than other distributions with excellent Khi2 test fit. Furthermore, the GIC algorithm showed full quantitative aspects so that no qualitative or visual analysis was required. In contrast, the straightforward application of the GIC heuristic for non-Gaussian datasets resulted in weak clustering performance; then, an enhanced version of the GIC technique is currently under development using the geometrical data skeleton notion in 2D and higher dimension spaces.

Keywords: k-means; Information geometry; Clustering; Machine learning; Inferential statistics; Data spread shape (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s43069-024-00291-2 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:snopef:v:5:y:2024:i:1:d:10.1007_s43069-024-00291-2

Ordering information: This journal article can be ordered from
https://www.springer.com/journal/43069

DOI: 10.1007/s43069-024-00291-2

Access Statistics for this article

SN Operations Research Forum is currently edited by Marco Lübbecke

More articles in SN Operations Research Forum from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-04-12
Handle: RePEc:spr:snopef:v:5:y:2024:i:1:d:10.1007_s43069-024-00291-2