Multimodal Feature Learning for Video Captioning

Lee, Sujin; Kim, Incheol

Multimodal Feature Learning for Video Captioning

Sujin Lee and Incheol Kim

Mathematical Problems in Engineering, 2018, vol. 2018, 1-8

Abstract:

Video captioning refers to the task of generating a natural language sentence that explains the content of the input video clips. This study proposes a deep neural network model for effective video captioning. Apart from visual features, the proposed model learns additionally semantic features that describe the video content effectively. In our model, visual features of the input video are extracted using convolutional neural networks such as C3D and ResNet, while semantic features are obtained using recurrent neural networks such as LSTM. In addition, our model includes an attention-based caption generation network to generate the correct natural language captions based on the multimodal video feature sequences. Various experiments, conducted with the two large benchmark datasets, Microsoft Video Description (MSVD) and Microsoft Research Video-to-Text (MSR-VTT), demonstrate the performance of the proposed model.

Date: 2018
References: Add references at CitEc
Citations:

Downloads: (external link)
http://downloads.hindawi.com/journals/MPE/2018/3125879.pdf (application/pdf)
http://downloads.hindawi.com/journals/MPE/2018/3125879.xml (text/xml)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:hin:jnlmpe:3125879

DOI: 10.1155/2018/3125879

Access Statistics for this article

More articles in Mathematical Problems in Engineering from Hindawi
Bibliographic data for series maintained by Mohamed Abdelhakeem ().