An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval

He, Liu; Liu, Shuyan; An, Ran; Zhuo, Yudong; Tao, Jian

An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval

Liu He, Shuyan Liu, Ran An, Yudong Zhuo and Jian Tao ()
Additional contact information
Liu He: Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China
Shuyan Liu: Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China
Ran An: Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China
Yudong Zhuo: Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China
Jian Tao: Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China

Mathematics, 2023, vol. 11, issue 10, 1-17

Abstract: Remote sensing cross-modal text-image retrieval (RSCTIR) has recently attracted extensive attention due to its advantages of fast extraction of remote sensing image information and flexible human–computer interaction. Traditional RSCTIR methods mainly focus on improving the performance of uni-modal feature extraction separately, and most rely on pre-trained object detectors to obtain better local feature representation, which not only lack multi-modal interaction information, but also cause the training gap between the pre-trained object detector and the retrieval task. In this paper, we propose an end-to-end RSCTIR framework based on vision-language fusion (EnVLF) consisting of two uni-modal (vision and language) encoders and a muti-modal encoder which can be optimized by multitask training. Specifically, to achieve an end-to-end training process, we introduce a vision transformer module for image local features instead of a pre-trained object detector. By semantic alignment of visual and text features, the vision transformer module achieves the same performance as pre-trained object detectors for image local features. In addition, the trained multi-modal encoder can improve the top-one and top-five ranking performances after retrieval processing. Experiments on common RSICD and RSITMD datasets demonstrate that our EnVLF can obtain state-of-the-art retrieval performance.

Keywords: remote sensing cross-modal text-image retrieval; vision-language fusion; multi-modal learning; multitask optimization (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/11/10/2279/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/10/2279/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:10:p:2279-:d:1146321

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().