Can adversarial attacks by large language models be attributed?

Cebrian, Manuel; Abeliuk, Andres; Telle, Jan Arne

Can adversarial attacks by large language models be attributed?

Manuel Cebrian, Andres Abeliuk and Jan Arne Telle

PLOS Complex Systems, 2026, vol. 3, issue 2, 1-21

Abstract: Attributing outputs from Large Language Models (LLMs) in adversarial settings—such as cyberattacks and disinformation campaigns—presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM’s set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold’s classical result from 1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by extension of the deterministic case); (3) a finite class of deterministic LLMs is identifiable (consistent with Angluin’s tell-tale criterion); and (4) even a finite class of probabilistic LLMs can be non-identifiable (we provide a new counterexample establishing this negative result). Complementing these theoretical insights, we quantify the explosion in the number of plausible model origins (hypothesis space) for a given output in recent years. Even under conservative assumptions (each open-source model fine-tuned on at most one new dataset), the count of distinct candidate models doubles approximately every 0.5 years, and allowing multi-dataset fine-tuning combinations yields doubling times as short as 0.28 years. This combinatorial growth, alongside the extraordinary computational cost of brute-force likelihood attribution across all models and potential users renders exhaustive attribution infeasible in practice. Our findings highlight an urgent need for new strategies and proactive governance to mitigate risks posed by un-attributable, adversarial use of LLMs as their influence continues to expand.Author summary: When AI-generated attacks—from disinformation to cyberattacks—occur, can we reliably trace them back to their originating language model? This paper establishes theoretical limits, showing that in realistic settings, attributing outputs to specific large-language models is provably impossible, even with unlimited data. Empirically, we quantify the explosive growth in the number of plausible model origins, demonstrating how quickly attribution becomes infeasible in practice. These combined results have stark implications for cybersecurity, misinformation mitigation, and AI governance.

Date: 2026
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/complexsystems/article?id=10.1371/journal.pcsy.0000085 (text/html)
https://journals.plos.org/complexsystems/article/f ... 00085&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcsy00:0000085

DOI: 10.1371/journal.pcsy.0000085

Access Statistics for this article

More articles in PLOS Complex Systems from Public Library of Science
Bibliographic data for series maintained by complexsystem ().