EconPapers    
Economics at your fingertips  
 

Validation of scientific topic models using graph analysis and corpus metadata

Manuel A. Vázquez (), Jorge Pereira-Delgado (), Jesús Cid-Sueiro () and Jerónimo Arenas-García ()
Additional contact information
Manuel A. Vázquez: Universidad Carlos III de Madrid
Jorge Pereira-Delgado: Universidad Carlos III de Madrid
Jesús Cid-Sueiro: Universidad Carlos III de Madrid
Jerónimo Arenas-García: Universidad Carlos III de Madrid

Scientometrics, 2022, vol. 127, issue 9, No 17, 5458 pages

Abstract: Abstract Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.

Keywords: Topic modeling; Latent Dirichlet Allocation; Graph analysis; Semantic similarity; Model validation (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s11192-022-04318-5 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:scient:v:127:y:2022:i:9:d:10.1007_s11192-022-04318-5

Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11192

DOI: 10.1007/s11192-022-04318-5

Access Statistics for this article

Scientometrics is currently edited by Wolfgang Glänzel

More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-20
Handle: RePEc:spr:scient:v:127:y:2022:i:9:d:10.1007_s11192-022-04318-5