Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

Vera, J. Fernando; Macías, Rodrigo

Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

J. Fernando Vera () and Rodrigo Macías ()
Additional contact information
J. Fernando Vera: University of Granada
Rodrigo Macías: Centro de Investigación en Matemáticas, Unidad Monterrey

Psychometrika, 2017, vol. 82, issue 2, No 1, 275-294

Abstract: Abstract One of the main problems in cluster analysis is that of determining the number of groups in the data. In general, the approach taken depends on the cluster method used. For K-means, some of the most widely employed criteria are formulated in terms of the decomposition of the total point scatter, regarding a two-mode data set of N points in p dimensions, which are optimally arranged into K classes. This paper addresses the formulation of criteria to determine the number of clusters, in the general situation in which the available information for clustering is a one-mode $$N\times N$$ N × N dissimilarity matrix describing the objects. In this framework, p and the coordinates of points are usually unknown, and the application of criteria originally formulated for two-mode data sets is dependent on their possible reformulation in the one-mode situation. The decomposition of the variability of the clustered objects is proposed in terms of the corresponding block-shaped partition of the dissimilarity matrix. Within-block and between-block dispersion values for the partitioned dissimilarity matrix are derived, and variance-based criteria are subsequently formulated in order to determine the number of groups in the data. A Monte Carlo experiment was carried out to study the performance of the proposed criteria. For simulated clustered points in p dimensions, greater efficiency in recovering the number of clusters is obtained when the criteria are calculated from the related Euclidean distances instead of the known two-mode data set, in general, for unequal-sized clusters and for low dimensionality situations. For simulated dissimilarity data sets, the proposed criteria always outperform the results obtained when these criteria are calculated from their original formulation, using dissimilarities instead of distances.

Keywords: dissimilarity; cluster analysis; K-means; SYNCLUS; variance-based criterion; number of clusters (search for similar items in EconPapers)
Date: 2017
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://link.springer.com/10.1007/s11336-017-9561-1 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:psycho:v:82:y:2017:i:2:d:10.1007_s11336-017-9561-1

Ordering information: This journal article can be ordered from
http://www.springer. ... gy/journal/11336/PS2

DOI: 10.1007/s11336-017-9561-1

Access Statistics for this article

Psychometrika is currently edited by Irini Moustaki

More articles in Psychometrika from Springer, The Psychometric Society
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().