Distinguishing Human- and AI-Generated Image Descriptions Using CLIP Similarity and Transformer-Based Classification
Daniela Onita (),
Matei-Vasile Căpîlnaș and
Adriana Baciu (Birlutiu) ()
Additional contact information
Daniela Onita: Department of Computer Science and Engineering, “1 Decembrie 1918” University of Alba Iulia, 5, Gabriel Bethlen, 515900 Alba Iulia, Romania
Matei-Vasile Căpîlnaș: Department of Computer Science and Engineering, “1 Decembrie 1918” University of Alba Iulia, 5, Gabriel Bethlen, 515900 Alba Iulia, Romania
Adriana Baciu (Birlutiu): Department of Computer Science and Engineering, “1 Decembrie 1918” University of Alba Iulia, 5, Gabriel Bethlen, 515900 Alba Iulia, Romania
Mathematics, 2025, vol. 13, issue 19, 1-19
Abstract:
Recent advances in vision-language models such as BLIP-2 have made AI-generated image descriptions increasingly fluent and difficult to distinguish from human-authored texts. This paper investigates whether such differences can still be reliably detected by introducing a novel bilingual dataset of English and Romanian captions. The English subset was derived from the T4SA dataset, while AI-generated captions were produced with BLIP-2 and translated into Romanian using MarianMT; human-written Romanian captions were collected via manual annotation. We analyze the problem from two perspectives: (i) semantic alignment, using CLIP similarity, and (ii) supervised classification with both traditional and transformer-based models. Our results show that BERT achieves over 95% cross-validation accuracy (F1 = 0.95, ROC AUC = 0.99) in distinguishing AI from human texts, while simpler classifiers such as Logistic Regression also reach competitive scores (F1 ≈ 0.88). Beyond classification, semantic and linguistic analyses reveal systematic cross-lingual differences: English captions are significantly longer and more verbose, whereas Romanian texts—often more concise—exhibit higher alignment with visual content. Romanian was chosen as a representative low-resource language, where studying such differences provides insights into multilingual AI detection and challenges in vision-language modeling. These findings emphasize the novelty of our contribution: a publicly available bilingual dataset and the first systematic comparison of human vs. AI-generated captions in both high- and low-resource languages.
Keywords: human vs. AI text generation; AI-generated text detection; image–text alignment; transformer-based models; multilingual natural Language processing (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/13/19/3228/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/19/3228/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:19:p:3228-:d:1766935
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().