Mapping the unseen in practice: comparing latent Dirichlet allocation and BERTopic for navigating topic spaces

Benz, Pierre; Pradier, Carolina; Kozlowski, Diego; Shokida, Natsumi S.; Larivière, Vincent

Mapping the unseen in practice: comparing latent Dirichlet allocation and BERTopic for navigating topic spaces

Pierre Benz (), Carolina Pradier, Diego Kozlowski, Natsumi S. Shokida and Vincent Larivière
Additional contact information
Pierre Benz: École de bibliothéconomie et des sciences de l’information, Université de Montréal
Carolina Pradier: École de bibliothéconomie et des sciences de l’information, Université de Montréal
Diego Kozlowski: École de bibliothéconomie et des sciences de l’information, Université de Montréal
Natsumi S. Shokida: École de bibliothéconomie et des sciences de l’information, Université de Montréal
Vincent Larivière: École de bibliothéconomie et des sciences de l’information, Université de Montréal

Scientometrics, 2025, vol. 130, issue 7, No 21, 3839-3870

Abstract: Abstract This article focuses on comparing two widely used techniques of topic modeling, namely latent Dirichlet allocation (LDA) and BERTopic. The first is a Bayesian probabilistic model and the latter is rooted in deep learning. It remains unclear what those differences imply in practice, and how they contribute to our sociological understanding of the inner works of science. This paper compares results obtained by LDA and BERTopic applied to the same dataset composed of all scientific articles (n = 34,797) authored by all biology professors in Switzerland between 2008 and 2020. We propose a step-by-step demonstration from data pre-processing to the results. Hence we emphasize that understanding their underlying functioning is essential for effectively interpreting the outcomes and balance between the strengths and weaknesses of the two techniques. Although they differ in their operationalization, LDA and BERTopic produce topic spaces with a similar global configuration. However, major differences are observed when focusing on specific multidimensional concepts, such as gene. With evidence from our empirical demonstration, we overall stress that topic modeling offers a highly valuable ground for understanding the semantic structure of scientific fields when combined with in-depth knowledge of the object under scrutiny.

Keywords: LDA; BERTopic; Topic modeling; Deep learning; Embeddings; Probabilities; Biology (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s11192-025-05339-6 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:scient:v:130:y:2025:i:7:d:10.1007_s11192-025-05339-6

Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11192

DOI: 10.1007/s11192-025-05339-6

Access Statistics for this article

Scientometrics is currently edited by Wolfgang Glänzel

More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().