Towards Human-Interactive Controllable Video Captioning with Efficient Modeling
Yoonseok Heo,
Taehoon Kim,
Seunghwan Kim,
Jungyun Seo and
Juae Kim ()
Additional contact information
Yoonseok Heo: Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea
Taehoon Kim: LG AI Research, Seoul 07796, Republic of Korea
Seunghwan Kim: LG AI Research, Seoul 07796, Republic of Korea
Jungyun Seo: LG AI Research, Seoul 07796, Republic of Korea
Juae Kim: Department of English Linguistics and Language Technology, Division of Language & AI, Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea
Mathematics, 2024, vol. 12, issue 13, 1-14
Abstract:
Video captioning is a task of describing the visual scene of a given video in natural language. There have been several lines of research focused on developing large-scale models in a transfer learning paradigm, with major challenge being the tradeoff between scalability and performance in limited environments. To address this problem, we propose a simple yet effective encoder–decoder-based video captioning model integrating transformers and CLIP, both of which are widely adopted in the vision and language domains, together with appropriate temporal feature embedding modules. Taking this proposal a step further, we also address the challenge of human-interactive video captioning, where the captions are tailored to specific information desired by humans. To design a human-interactive environment, we assume that a human offers an object or action in the video as a short prompt; in turn, the system then provides a detailed explanation regarding the prompt. We embed human prompts within an LSTM-based prompt encoder and leverage soft prompting to tune the model effectively. We extensively evaluated our model on benchmark datasets, demonstrating comparable results, particularly on the MSR-VTT dataset, where we achieve state-of-the-art performance with 4% improvement. In addition, we also show potential for human-interactive video captioning through quantitative and qualitative analysis.
Keywords: video captioning; controllable video captioning; human-interactive; multimodal representation learning (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/13/2037/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/13/2037/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:13:p:2037-:d:1426092
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().