Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

Nian, Fudong; Ding, Ling; Hu, Yuxia; Gu, Yanhong

Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

Fudong Nian (), Ling Ding, Yuxia Hu and Yanhong Gu
Additional contact information
Fudong Nian: School of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, China
Ling Ding: School of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, China
Yuxia Hu: Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling, Anhui Jianzhu University, Hefei 230601, China
Yanhong Gu: School of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, China

Mathematics, 2022, vol. 10, issue 18, 1-19

Abstract: This paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video–text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video–text retrieval by jointly modeling video–text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial–temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video–text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video–text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.

Keywords: video–text retrieval; multi-level space learning; cross-modal similarity calculation (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2022
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/10/18/3346/pdf (application/pdf)
https://www.mdpi.com/2227-7390/10/18/3346/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:10:y:2022:i:18:p:3346-:d:915697

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().