A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling
Arun Kumar Yadav (),
Tushar Gupta (),
Mohit Kumar () and
Divakar Yadav ()
Additional contact information
Arun Kumar Yadav: NIT Hamirpur (HP)
Tushar Gupta: NIT Hamirpur (HP)
Mohit Kumar: NIT Hamirpur (HP)
Divakar Yadav: SOCIS, IGNOU
Quality & Quantity: International Journal of Methodology, 2025, vol. 59, issue 3, No 20, 2408 pages
Abstract:
Abstract Topic modeling is a popular machine learning technique in natural language processing for identifying themes within unstructured text. One of the most prominent methods for this purpose is Latent Dirichlet Allocation (LDA), which can automatically uncover topics from large text corpora. However, LDA alone may not always provide the best results. Using Bidirectional Encoder Representations from Transformers (BERT) embeddings in topic modeling, significantly enhances the quality and coherence of discovered topics by leveraging deep contextual representations of words. Clustering is another powerful unsupervised machine learning technique frequently used for topic modeling and information extraction from unstructured text. This study introduces a hybrid approach that combines LDA with BERT for enhanced topic modeling, incorporating dimensionality reduction-based clustering. To manage the increasing complexity and computational load of clustering with many features, Uniform Manifold Approximation and Projection is utilized for dimensionality reduction. Experiments conducted on benchmark datasets, specifically Reuters-21578 and 20newsgroups, illustrate the effectiveness of this cluster-informed topic modeling framework. The empirical results suggest that integrating clustering with BERT-LDA for topic modeling can be highly effective, as dimensionality reduction via clustering helps derive more cohesive topics. The study evaluates coherence scores using the BERT-LDA model on the 20newsgroups and Reuters datasets. For the 20newsgroups dataset, BERT-LDA shows a significant improvement in coherence scores: nearly 59% for 10 topics, 42% for 20 topics, 11% for 50 topics, and 16% for 98 topics. Similarly, for the Reuters dataset, coherence scores improved by about 85% for 10 topics, 63% for 20 topics, 43% for 50 topics, and 41% for 98 topics. These results highlight how BERT-LDA enhances topic coherence compared to traditional models.
Keywords: Latent dirichlet allocation; k-means clustering; Dimensionality reduction; Bidirectional encoder representations from transformers (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s11135-025-02077-y Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:qualqt:v:59:y:2025:i:3:d:10.1007_s11135-025-02077-y
Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11135
DOI: 10.1007/s11135-025-02077-y
Access Statistics for this article
Quality & Quantity: International Journal of Methodology is currently edited by Vittorio Capecchi
More articles in Quality & Quantity: International Journal of Methodology from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().