Principal component analysis for authorship attribution
Jamak Amir,
Savatić Alen and
Can Mehmet
Additional contact information
Jamak Amir: Faculty of Engineering and Natural Sciences, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
Savatić Alen: Faculty of Engineering and Natural Sciences, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
Can Mehmet: Faculty of Engineering and Natural Sciences, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
Business Systems Research, 2012, vol. 3, issue 2, 49-56
Abstract:
Background: To recognize the authors of the texts by the use of statistical tools, one first needs to decide about the features to be used as author characteristics, and then extract these features from texts. The features extracted from texts are mostly the counts of so called function words. Objectives: The data extracted are processed further to compress as a data with less number of features, such a way that the compressed data still has the power of effective discriminators. In this case feature space has less dimensionality then the text itself. Methods/Approach: In this paper, the data collected by counting words and characters in around a thousand paragraphs of each sample book, underwent a principal component analysis performed using neural networks. Once the analysis was complete, the first of the principal components is used to distinguish the books authored by a certain author. Results: The achieved results show that every author leaves a unique signature in written text that can be discovered by analyzing counts of short words per paragraph. Conclusions: In this article we have demonstrated that based on analyzing counts of short words per paragraph authorship could be traced using principal component analysis. Methodology could be used for other purposes, like fraud detection in auditing.
Keywords: principal components; authorship attribution; stylometry; text categorization; function words; classification task; stylistic features; syntactic characteristics; principal components; authorship attribution; stylometry; text categorization; function words; classification task; stylistic features; syntactic characteristics (search for similar items in EconPapers)
Date: 2012
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://doi.org/10.2478/v10305-012-0012-2 (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bit:bsrysr:v:3:y:2012:i:2:p:49-56
DOI: 10.2478/v10305-012-0012-2
Access Statistics for this article
Business Systems Research is currently edited by Mirjana Pejić Bach
More articles in Business Systems Research from Sciendo
Bibliographic data for series maintained by Peter Golla ().