Spatial Position Reasoning of Image Entities Based on Location Words

Qin, Xingguo; Zhou, Ya; Li, Jun

Spatial Position Reasoning of Image Entities Based on Location Words

Xingguo Qin, Ya Zhou and Jun Li ()
Additional contact information
Xingguo Qin: School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
Ya Zhou: School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
Jun Li: School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

Mathematics, 2024, vol. 12, issue 24, 1-14

Abstract: The endeavor of spatial position reasoning effectively simulates the sensory and comprehension faculties of artificial intelligence, especially within the purview of multimodal modeling that fuses imagery with linguistic data. Recent progress in visual image–language models has marked significant advancements in multimodal reasoning tasks. Notably, contrastive learning models based on the Contrastive Language-Image pre-training (CLIP) framework have attracted substantial interest. Predominantly, current contrastive learning models focus on nominal and verbal elements within image descriptions, while spatial locatives receive comparatively less attention. However, prepositional spatial indicators are pivotal for encapsulating the critical positional data between entities within images, which is essential for the reasoning capabilities of image–language models. This paper introduces a spatial location reasoning model that is founded on spatial locative terms. The model concentrates on spatial prepositions within image descriptions, models the locational interrelations between entities in images through these prepositions, evaluates and corroborates the spatial interconnections of entities within images, and harmonizes the precision with image–textual descriptions. This model represents an enhancement of the CLIP model, delving deeply into the semantic characteristics of spatial prepositions and highlighting their directive role in visual language models. Empirical evidence suggests that the proposed model adeptly captures the correlation of spatial indicators in both image and textual representations across open datasets. The incorporation of spatial position terms into the model was observed to elevate the average predictive accuracy by approximately three percentage points.

Keywords: visual–spatial reasoning; locative preposition; contrastive learning; image–text retrieval (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/12/24/3940/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/24/3940/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:24:p:3940-:d:1543986

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().