Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution

Škorić, Mihailo; Stanković, Ranka; Nešić, Milica Ikonić; Byszuk, Joanna; Eder, Maciej

Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution

Mihailo Škorić, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk and Maciej Eder
Additional contact information
Mihailo Škorić: Faculty of Mining and Geology, University of Belgrade, Djusina 7, 11120 Belgrade, Serbia
Ranka Stanković: Faculty of Mining and Geology, University of Belgrade, Djusina 7, 11120 Belgrade, Serbia
Milica Ikonić Nešić: Faculty of Philology, University of Belgrade, Studentski Trg 3, 11000 Belgrade, Serbia
Joanna Byszuk: Institute of Polish Language, Polish Academy of Sciences, al. Mickiewicza 31, 31-120 Kraków, Poland
Maciej Eder: Institute of Polish Language, Polish Academy of Sciences, al. Mickiewicza 31, 31-120 Kraków, Poland

Mathematics, 2022, vol. 10, issue 5, 1-27

Abstract: This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used these documents to produce four document embedding models using Stylo R package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document embedding model using mBERT for each of the seven languages. We created further derivations of these embeddings in the form of average, product, minimum, maximum, and l 2 norm of these document embedding matrices and tested them both including and excluding the mBERT-based document embeddings for each language. Finally, we trained several perceptrons on the portions of the dataset in order to procure adequate weights for a weighted combination approach. We tested standalone (two baselines) and composite embeddings for classification accuracy, precision, recall, weighted-average, and macro-averaged F 1 -score, compared them with one another and have found that for each language most of our composition methods outperform the baselines (with a couple of methods outperforming all baselines for all languages), with or without mBERT inputs, which are found to have no significant positive impact on the results of our methods.

Keywords: document embeddings; authorship attribution; language modelling; parallel architectures; stylometry; language processing pipelines (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/2227-7390/10/5/838/pdf (application/pdf)
https://www.mdpi.com/2227-7390/10/5/838/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:10:y:2022:i:5:p:838-:d:765407

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().