High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing
Saida Ishak Boushaki,
Nadjet Kamel and
Omar Bendjeghaba ()
Additional contact information
Saida Ishak Boushaki: LRIA, University of Science and Technology Houari, Boumediene, Bab Ezzouar 16123, Algeria†Department of Informatics, University of M’Hamed Bougara Boumerdes, Boumerdes 35000, Algeria
Nadjet Kamel: LRIA, University of Science and Technology Houari, Boumediene, Bab Ezzouar 16123, Algeria‡Université Ferhat Abbas Setif 1, Sétif 19000, Algeria
Omar Bendjeghaba: #xA7;LREEI, University M’Hamed Bougara, Boumerdes, Boumerdes 35000, Algeria
Journal of Information & Knowledge Management (JIKM), 2018, vol. 17, issue 03, 1-24
Abstract:
The clustering is an important data analysis technique. However, clustering high-dimensional data like documents needs more effort in order to extract the richness relevant information hidden in the multidimensionality space. Recently, document clustering algorithms based on metaheuristics have demonstrated their efficiency to explore the search area and to achieve the global best solution rather than the local one. However, most of these algorithms are not practical and suffer from some limitations, including the requirement of the knowledge of the number of clusters in advance, they are neither incremental nor extensible and the documents are indexed by high-dimensional and sparse matrix. In order to overcome these limitations, we propose in this paper, a new dynamic and incremental approach (CS_LSI) for document clustering based on the recent cuckoo search (CS) optimization and latent semantic indexing (LSI). Conducted Experiments on four well-known high-dimensional text datasets show the efficiency of LSI model to reduce the dimensionality space with more precision and less computational time. Also, the proposed CS_LSI determines the number of clusters automatically by employing a new proposed index, focused on significant distance measure. This later is also used in the incremental mode and to detect the outlier documents by maintaining a more coherent clusters. Furthermore, comparison with conventional document clustering algorithms shows the superiority of CS_LSI to achieve a high quality of clustering.
Keywords: Cuckoo search optimisation; high-dimensional text clustering; number of clusters; incremental clustering; internal validity index; latent semantic indexing; document clustering; vector space model; optimisation; metaheuristic (search for similar items in EconPapers)
Date: 2018
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://www.worldscientific.com/doi/abs/10.1142/S0219649218500338
Access to full text is restricted to subscribers
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:wsi:jikmxx:v:17:y:2018:i:03:n:s0219649218500338
Ordering information: This journal article can be ordered from
DOI: 10.1142/S0219649218500338
Access Statistics for this article
Journal of Information & Knowledge Management (JIKM) is currently edited by Professor Suliman Hawamdeh
More articles in Journal of Information & Knowledge Management (JIKM) from World Scientific Publishing Co. Pte. Ltd.
Bibliographic data for series maintained by Tai Tone Lim ().