A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure

Heo, Yoonseok; Kang, Sangwoo

A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure

Yoonseok Heo and Sangwoo Kang ()
Additional contact information
Yoonseok Heo: Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea
Sangwoo Kang: School of Computing, Gachon University, Seongnam 13120, Republic of Korea

Mathematics, 2023, vol. 11, issue 17, 1-15

Abstract: A rapidly expanding multimedia environment in recent years has led to an explosive increase in demand for multimodality that can communicate with humans in various ways. Even though the convergence of vision and language intelligence has shed light on the remarkable success over the last few years, there is still a caveat: it is unknown whether they truly understand the semantics of the image. More specifically, how they correctly capture relationships between objects represented within the image is still regarded as a black box. In order to testify whether such relationships are well understood, this work mainly focuses on the Graph-structured visual Question Answering (GQA) task which evaluates the understanding of an image by reasoning a scene graph describing the structural characteristics of an image in the form of natural language together with the image. Unlike the existing approaches that have been accompanied by an additional encoder for scene graphs, we propose a simple yet effective framework using pre-trained multimodal transformers for scene graph reasoning. Inspired by the fact that a scene graph can be regarded as a set of sentences describing two related objects with a relationship, we fuse them into the framework separately from the question. In addition, we propose a multi-task learning method that utilizes evaluating the grammatical validity of questions as an auxiliary task to better understand a question with complex structures. This utilizes the semantic role labels of the question to randomly shuffle the sentence structure of the question. We have conducted extensive experiments to evaluate the effectiveness in terms of task capabilities, ablation studies, and generalization.

Keywords: multimodal deep learning; scene graph reasoning; multimodal transformer; multi-task learning (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/2227-7390/11/17/3751/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/17/3751/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:17:p:3751-:d:1230049

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().