Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
Longbin Jin,
Hyuntaek Jung,
Hyo Jin Jon and
Eun Yi Kim ()
Additional contact information
Longbin Jin: Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of Korea
Hyuntaek Jung: Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of Korea
Hyo Jin Jon: Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of Korea
Eun Yi Kim: Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of Korea
Mathematics, 2025, vol. 13, issue 9, 1-15
Abstract:
Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available.
Keywords: visual prompt; video recognition; visual-language model (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/13/9/1365/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/9/1365/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:9:p:1365-:d:1639696
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().