Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment

Bodrunova, Svetlana S.; Orekhov, Andrey V.; Blekanov, Ivan S.; Lyudkevich, Nikolay S.; Tarasov, Nikita A.

Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment

Svetlana S. Bodrunova, Andrey V. Orekhov, Ivan S. Blekanov, Nikolay S. Lyudkevich and Nikita A. Tarasov
Additional contact information
Svetlana S. Bodrunova: School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia
Andrey V. Orekhov: School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia
Ivan S. Blekanov: School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia
Nikolay S. Lyudkevich: School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia
Nikita A. Tarasov: School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia

Future Internet, 2020, vol. 12, issue 9, 1-17

Abstract: The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward’s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the “e-2” hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.

Keywords: text classification; text clustering; clustering of short texts; neural network algorithms; distributive semantics; sentence embeddings; least squares method; Markov moment; DBSCAN (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (4)

Downloads: (external link)
https://www.mdpi.com/1999-5903/12/9/144/pdf (application/pdf)
https://www.mdpi.com/1999-5903/12/9/144/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:12:y:2020:i:9:p:144-:d:404427

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().