Topic modeling, long texts and the best number of topics. Some Problems and solutions

Sbalchiero, Stefano; Eder, Maciej

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Stefano Sbalchiero () and Maciej Eder
Additional contact information
Stefano Sbalchiero: University of Padova
Maciej Eder: Polish Academy of Sciences and Pedagogical University of Kraków

Quality & Quantity: International Journal of Methodology, 2020, vol. 54, issue 4, No 1, 1095-1108

Abstract: Abstract The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

Keywords: Topic modeling; Latent Dirichlet Allocation; Long texts; Log-likelihood for the model; Best number of topics (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (14)

Downloads: (external link)
http://link.springer.com/10.1007/s11135-020-00976-w Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:qualqt:v:54:y:2020:i:4:d:10.1007_s11135-020-00976-w

Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11135

DOI: 10.1007/s11135-020-00976-w

Access Statistics for this article

Quality & Quantity: International Journal of Methodology is currently edited by Vittorio Capecchi

More articles in Quality & Quantity: International Journal of Methodology from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().