Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks

Romanov, Aleksandr; Kurtukova, Anna; Shelupanov, Alexander; Fedotova, Anastasia; Goncharov, Valery

Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks

Aleksandr Romanov, Anna Kurtukova, Alexander Shelupanov, Anastasia Fedotova and Valery Goncharov
Additional contact information
Aleksandr Romanov: Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia
Anna Kurtukova: Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia
Alexander Shelupanov: Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia
Anastasia Fedotova: Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia
Valery Goncharov: Department of Automation and Robotics, The National Research Tomsk Polytechnic University, 634050 Tomsk, Russia

Future Internet, 2020, vol. 13, issue 1, 1-16

Abstract: The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models’ accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.

Keywords: authorship; text mining; machine learning; attribution; neural networks; deep learning; forensic intelligence (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2020
References: View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/1999-5903/13/1/3/pdf (application/pdf)
https://www.mdpi.com/1999-5903/13/1/3/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:13:y:2020:i:1:p:3-:d:468370

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().