A multimodal transformer-based visual question answering method integrating local and global information

Huang, Cuiyang; Hu, Zihan

A multimodal transformer-based visual question answering method integrating local and global information

Cuiyang Huang and Zihan Hu

PLOS ONE, 2025, vol. 20, issue 7, 1-22

Abstract: Addressing the limitations in current visual question answering (VQA) models face limitations in multimodal feature fusion capabilities and often lack adequate consideration of local information, this study proposes a multimodal Transformer VQA network based on local and global information integration (LGMTNet). LGMTNet employs attention on local features within the context of global features, enabling it to capture both broad and detailed image information simultaneously, constructing a deep encoder-decoder module that directs image feature attention based on the question context, thereby enhancing visual-language feature fusion. A multimodal representation module is then designed to focus on essential question terms, reducing linguistic noise and extracting multimodal features. Finally, a feature aggregation module concatenates multimodal and question features to deepen question comprehension. Experimental results demonstrate that LGMTNet effectively focuses on local image features, integrates multimodal knowledge, and enhances feature fusion capabilities.

Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0324757 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 24757&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0324757

DOI: 10.1371/journal.pone.0324757

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().