Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders
Wenhao Liu,
Simiao Yuan,
Zhen Wang (),
Xinyi Chang,
Limeng Gao and
Zhenrui Zhang
Additional contact information
Wenhao Liu: School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China
Simiao Yuan: Zibo Medical Emergency Command Center, Zibo 255000, China
Zhen Wang: School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China
Xinyi Chang: School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China
Limeng Gao: School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China
Zhenrui Zhang: School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China
Mathematics, 2024, vol. 12, issue 20, 1-18
Abstract:
The image-recipe cross-modal retrieval task, which retrieves the relevant recipes according to food images and vice versa, is now attracting widespread attention. There are two main challenges for image-recipe cross-modal retrieval task. Firstly, a recipe’s different components (words in a sentence, sentences in an entity, and entities in a recipe) have different weight values. If a recipe’s different components own the same weight, the recipe embeddings cannot pay more attention to the important components. As a result, the important components make less contribution to the retrieval task. Secondly, the food images have obvious properties of locality and only the local food regions matter. There are still difficulties in enhancing the discriminative local region features in the food images. To address these two problems, we propose a novel framework named Dual Cross Attention Encoders for Cross-modal Food Retrieval (DCA-Food). The proposed framework consists of a hierarchical cross attention recipe encoder (HCARE) and a cross attention image encoder (CAIE). HCARE consists of three types of cross attention modules to capture the important words in a sentence, the important sentences in an entity and the important entities in a recipe, respectively. CAIE extracts global and local region features. Then, it calculates cross attention between them to enhance the discriminative local features in the food images. We conduct the ablation studies to validate our design choices. Our proposed approach outperforms the existing approaches by a large margin on the Recipe1M dataset. Specifically, we improve the R@1 performance by +2.7 and +1.9 on the 1k and 10k testing sets, respectively.
Keywords: image-recipe cross-modal retrieval; cross attention; recipe encoder; image encoder (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/20/3181/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/20/3181/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:20:p:3181-:d:1496606
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().