Classifying AI vs. Human Content: Integrating BERT and Linguistic Features for Enhanced Classification

Yadav, Abhishek; Mc, Shunmuga Priya

Classifying AI vs. Human Content: Integrating BERT and Linguistic Features for Enhanced Classification

Abhishek Yadav () and Shunmuga Priya Mc ()
Additional contact information
Abhishek Yadav: Amrita School of Physical Sciences, Amrita Vishwa Vidyapeetham
Shunmuga Priya Mc: Amrita Vishwa Vidyapeetham

SN Operations Research Forum, 2025, vol. 6, issue 2, 1-12

Abstract: Abstract This study advances the detection of AI-generated content through a novel methodology integrating BERT (bidirectional encoder representations from transformers) with comprehensive linguistic features. The research evaluates three distinct frameworks: utilizing BERT’s last hidden layer outputs independently, combining BERT outputs with its predictions, and a hybrid approach incorporating both BERT-derived features and linguistic markers including readability scores, lexical diversity measures, and structural patterns. Experiments across these frameworks employ logistic regression, random forest, and XGBoost classifiers, with the hybrid XGBoost approach achieving superior accuracy of 83.57% on test data. To enhance transparency and understanding, the study implements LIME (local interpretable model-agnostic explanations), revealing key influential factors in classification decisions—notably BERT encodings and specific linguistic features such as Yule’s Characteristic K, Flesch Reading Ease, and Gunning Fog Index. The integration of machine learning with traditional linguistic analysis demonstrates significant advantages over single-method approaches, particularly in handling diverse writing styles and content types. The findings demonstrate that combining deep learning architectures with traditional linguistic analysis yields more robust AI content detection systems, contributing significantly to digital content verification capabilities and authenticity assessment tools. This research addresses the growing challenge of distinguishing between human and AI-generated content in various domains, including academic writing, news articles, and online content, offering practical applications for content moderation, plagiarism detection, and digital authenticity verification systems.

Keywords: Generative AI; Detection; Classification; Linguistic features; BERT (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s43069-025-00486-1 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:snopef:v:6:y:2025:i:2:d:10.1007_s43069-025-00486-1

Ordering information: This journal article can be ordered from
https://www.springer.com/journal/43069

DOI: 10.1007/s43069-025-00486-1

Access Statistics for this article

SN Operations Research Forum is currently edited by Marco Lübbecke

More articles in SN Operations Research Forum from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().