TIBW: Task-Independent Backdoor Watermarking with Fine-Tuning Resilience for Pre-Trained Language Models

Mo, Weichuan; Chen, Kongyang; Xiao, Yatie

TIBW: Task-Independent Backdoor Watermarking with Fine-Tuning Resilience for Pre-Trained Language Models

Weichuan Mo, Kongyang Chen () and Yatie Xiao ()
Additional contact information
Weichuan Mo: School of Artificial Intelligence, Guangzhou University, Guangzhou 510006, China
Kongyang Chen: School of Artificial Intelligence, Guangzhou University, Guangzhou 510006, China
Yatie Xiao: School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China

Mathematics, 2025, vol. 13, issue 2, 1-14

Abstract: Pre-trained language models such as BERT, GPT-3, and T5 have made significant advancements in natural language processing (NLP). However, their widespread adoption raises concerns about intellectual property (IP) protection, as unauthorized use can undermine innovation. Watermarking has emerged as a promising solution for model ownership verification, but its application to NLP models presents unique challenges, particularly in ensuring robustness against fine-tuning and preventing interference with downstream tasks. This paper presents a novel watermarking scheme, TIBW (Task-Independent Backdoor Watermarking), that embeds robust, task-independent backdoor watermarks into pre-trained language models. By implementing a Trigger–Target Word Pair Search Algorithm that selects trigger–target word pairs with maximal semantic dissimilarity, our approach ensures that the watermark remains effective even after extensive fine-tuning. Additionally, we introduce Parameter Relationship Embedding (PRE) to subtly modify the model’s embedding layer, reinforcing the association between trigger and target words without degrading the model performance. We also design a comprehensive watermark verification process that evaluates task behavior consistency, quantified by the Watermark Embedding Success Rate (WESR). Our experiments across five benchmark NLP tasks demonstrate that the proposed watermarking method maintains a near-baseline performance on clean inputs while achieving a high WESR, outperforming existing baselines in both robustness and stealthiness. Furthermore, the watermark persists reliably even after additional fine-tuning, highlighting its resilience against potential watermark removal attempts. This work provides a secure and reliable IP protection mechanism for NLP models, ensuring watermark integrity across diverse applications.

Keywords: pre-trained language model; backdoor; watermarking; fine-tuning (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/2/272/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/2/272/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:2:p:272-:d:1567853

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().