EconPapers    
Economics at your fingertips  
 

Towards Human-Interactive Controllable Video Captioning with Efficient Modeling

Yoonseok Heo, Taehoon Kim, Seunghwan Kim, Jungyun Seo and Juae Kim ()
Additional contact information
Yoonseok Heo: Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea
Taehoon Kim: LG AI Research, Seoul 07796, Republic of Korea
Seunghwan Kim: LG AI Research, Seoul 07796, Republic of Korea
Jungyun Seo: LG AI Research, Seoul 07796, Republic of Korea
Juae Kim: Department of English Linguistics and Language Technology, Division of Language & AI, Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea

Mathematics, 2024, vol. 12, issue 13, 1-14

Abstract: Video captioning is a task of describing the visual scene of a given video in natural language. There have been several lines of research focused on developing large-scale models in a transfer learning paradigm, with major challenge being the tradeoff between scalability and performance in limited environments. To address this problem, we propose a simple yet effective encoder–decoder-based video captioning model integrating transformers and CLIP, both of which are widely adopted in the vision and language domains, together with appropriate temporal feature embedding modules. Taking this proposal a step further, we also address the challenge of human-interactive video captioning, where the captions are tailored to specific information desired by humans. To design a human-interactive environment, we assume that a human offers an object or action in the video as a short prompt; in turn, the system then provides a detailed explanation regarding the prompt. We embed human prompts within an LSTM-based prompt encoder and leverage soft prompting to tune the model effectively. We extensively evaluated our model on benchmark datasets, demonstrating comparable results, particularly on the MSR-VTT dataset, where we achieve state-of-the-art performance with 4% improvement. In addition, we also show potential for human-interactive video captioning through quantitative and qualitative analysis.

Keywords: video captioning; controllable video captioning; human-interactive; multimodal representation learning (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/12/13/2037/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/13/2037/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:13:p:2037-:d:1426092

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-03-19
Handle: RePEc:gam:jmathe:v:12:y:2024:i:13:p:2037-:d:1426092