A Language and Its Holes: The First-Order Homology of the Large-Scale Geometrical Structure of a Natural Language
Vasilii A. Gromov,
Quynh Nhu Dang and
Asel S. Erbolova
Complexity, 2025, vol. 2025, 1-15
Abstract:
The present paper employs topological data analysis methods to reveal ‘holes’ (stable persistent homologies) in the semantic spaces of words, bigrams, and trigrams of the English and Russian languages, and to ascertain their boundaries. Furthermore, the paper selects those holes that belong to the large-scale (coarse-grained) structure of the language that are not just local inhomogeneities of the sample—it appears that there are around a dozen of them for each of the languages (English and Russian). These boundaries delineate ‘blind spots’ of the respective language—the regions of the semantic spaces that do not contain words/bigrams/trigrams of the language—that is, regions of concepts that the language cannot see through its lens. The secondary goal of the paper is to solve the bot-detection problem in its strong statement, that is, one trains the classifiers on one set of bots and tests on the another set of bots. To this end, we estimate the average distances from words, bigrams, and trigrams of a text to the boundaries of the nearest ‘hole’, for texts both written by humans and generated by bots, and construct classifiers. The classifiers show comparatively good results: the average accuracy amounts to 0.8.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://downloads.hindawi.com/journals/complexity/2025/9659172.pdf (application/pdf)
http://downloads.hindawi.com/journals/complexity/2025/9659172.xml (application/xml)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:hin:complx:9659172
DOI: 10.1155/cplx/9659172
Access Statistics for this article
More articles in Complexity from Hindawi
Bibliographic data for series maintained by Mohamed Abdelhakeem ().