Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
Christoph Weisser,
Christoph Gerloff,
Anton Thielmann (),
Andre Python,
Arik Reuter,
Thomas Kneib and
Benjamin Säfken
Additional contact information
Christoph Weisser: Georg-August-Universität Göttingen
Christoph Gerloff: Georg-August-Universität Göttingen
Anton Thielmann: Georg-August-Universität Göttingen
Andre Python: Zhejiang University
Arik Reuter: Georg-August-Universität Göttingen
Thomas Kneib: Georg-August-Universität Göttingen
Benjamin Säfken: Clausthal University of Technology
Computational Statistics, 2023, vol. 38, issue 2, No 5, 647-674
Abstract:
Abstract Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.
Keywords: Topic models; Collapsed Gibbs sampler algorithm for the Dirichlet multinomial model; Gamma-Poisson mixture topic model; Latent Dirichlet allocation; Model evaluation; Pseudo-document simulation; Covid-19; Social media; Twitter (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
http://link.springer.com/10.1007/s00180-022-01246-z Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:compst:v:38:y:2023:i:2:d:10.1007_s00180-022-01246-z
Ordering information: This journal article can be ordered from
http://www.springer.com/statistics/journal/180/PS2
DOI: 10.1007/s00180-022-01246-z
Access Statistics for this article
Computational Statistics is currently edited by Wataru Sakamoto, Ricardo Cao and Jürgen Symanzik
More articles in Computational Statistics from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().