EconPapers    
Economics at your fingertips  
 

A novel text clustering model based on topic modelling and social network analysis

Babak Amiri and Ramin Karimianghadim

Chaos, Solitons & Fractals, 2024, vol. 181, issue C

Abstract: Document clustering is a well-known text-mining method that assists in the categorization and comprehension of textual data. Document clustering is vital in areas like information retrieval, knowledge management, and marketing, underscoring the need for a highly accurate clustering model. Current models in document clustering face significant hurdles, such as dealing with sparse, high-dimensional representations based on the bag-of-words (BOW) approach, which are not only computationally demanding on large datasets but also lack in capturing the semantic nuances of documents. Additionally, these models struggle with determining the ideal number of clusters and managing datasets with overlapping elements. To overcome these issues, this paper introduces a novel co-clustering strategy that merges community detection methods from social network analysis with advanced text analysis techniques. The proposed method transforms documents into a network structure, where each document is a node and connections (edges) are formed between documents that are most similar. Community detection algorithms are then employed to identify clusters within this network of documents. The study explores various document representation methods, including topic modelling and sentence embedding, to provide a rich contextual understanding of the documents. An extensive evaluation is carried out, examining different combinations of community detection algorithms, clustering methodologies, and document representation strategies, particularly focusing on their efficacy in handling overlapping and non-overlapping datasets. The findings demonstrate that the Element-Centric evaluation measure is effective in enabling community detection algorithms to autonomously ascertain the most suitable number of clusters, yielding promising results for both overlapping and non-overlapping datasets. The LCD model shows remarkable performance in addressing overlapping datasets. Furthermore, the research reveals that innovative document representation approaches significantly enhance the performance of the models. Additionally, the use of topic modelling in conjunction with co-clustering algorithms proves effective in clearly depicting the themes within the clusters.

Keywords: Document clustering; Social network analysis; Topic modelling; Sentence embedding; Overlapping communities (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S096007792400184X
Full text for ScienceDirect subscribers only

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:eee:chsofr:v:181:y:2024:i:c:s096007792400184x

DOI: 10.1016/j.chaos.2024.114633

Access Statistics for this article

Chaos, Solitons & Fractals is currently edited by Stefano Boccaletti and Stelios Bekiros

More articles in Chaos, Solitons & Fractals from Elsevier
Bibliographic data for series maintained by Thayer, Thomas R. ().

 
Page updated 2025-03-19
Handle: RePEc:eee:chsofr:v:181:y:2024:i:c:s096007792400184x