Computational thematics: comparing algorithms for clustering the genres of literary fiction
Oleg Sobchuk () and
Artjoms Šeļa
Additional contact information
Oleg Sobchuk: Max Planck Institute for Evolutionary Anthropology
Artjoms Šeļa: Polish Academy of Sciences
Palgrave Communications, 2024, vol. 11, issue 1, 1-12
Abstract:
Abstract What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call “computational thematics”. These algorithms belong to three steps of analysis: text pre-processing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options. Every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the “ground truth” genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1057/s41599-024-02933-6 Abstract (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:pal:palcom:v:11:y:2024:i:1:d:10.1057_s41599-024-02933-6
Ordering information: This journal article can be ordered from
https://www.nature.com/palcomms/about
DOI: 10.1057/s41599-024-02933-6
Access Statistics for this article
More articles in Palgrave Communications from Palgrave Macmillan
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().