SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data

de Aquino, Roberto Douglas G.; Verri, Filipe A. N.; de Amorim, Renato Cordeiro; Curtis, Vitor V.

SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data

Roberto Douglas G. de Aquino (), Filipe A. N. Verri, Renato Cordeiro de Amorim and Vitor V. Curtis
Additional contact information
Roberto Douglas G. de Aquino: Department of Computer Systems, University of Sao Paulo, Sao Carlos 13566-590, SP, Brazil
Filipe A. N. Verri: Computer Science Division, Aeronautics Institute of Technology, Sao Jose dos Campos 12228-900, SP, Brazil
Renato Cordeiro de Amorim: School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK
Vitor V. Curtis: Computer Science Division, Aeronautics Institute of Technology, Sao Jose dos Campos 12228-900, SP, Brazil

Mathematics, 2025, vol. 13, issue 17, 1-31

Abstract: Cluster validation for categorical data remains a critical challenge in unsupervised learning, where traditional distance-based indices often fail to capture meaningful structures. This paper introduces SLEDgeHammer (SLEDgeH), an enhanced internal validation index that addresses these limitations through the optimized weighting of semantic descriptors derived from frequent patterns. Building upon the SLEDge framework, the proposed method systematically combines four indicators—Support, Length, Exclusivity, and Difference—using weight optimization to improve cluster discrimination, particularly in sparse and imbalanced scenarios. Unlike conventional methods, SLEDgeH does not rely on distance metrics; instead, it leverages the statistical prevalence and uniqueness of feature combinations within clusters. Through extensive experiments on 3600 synthetic categorical data sets and 18 real-world data sets, we demonstrate that SLEDgeH achieves significantly higher accuracy in identifying the optimal number of clusters and exhibits greater robustness, with lower standard deviation, compared to existing indices. Additionally, the index provides inherent interpretability by generating semantic cluster descriptions, making it a practical tool for supporting decision making in categorical data analysis.

Keywords: cluster validation index; categorical data; frequent patterns; semantic description (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/17/2832/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/17/2832/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:17:p:2832-:d:1740875

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().