Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya
Leacky Muchene and
Wende Safari
PLOS ONE, 2021, vol. 16, issue 1, 1-20
Abstract:
Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics.
Date: 2021
References: View complete reference list from CitEc
Citations: View citations in EconPapers (2)
Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0243208 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 43208&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0243208
DOI: 10.1371/journal.pone.0243208
Access Statistics for this article
More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().