Self-Organizing Memory Based on Adaptive Resonance Theory for Vision and Language Navigation
Wansen Wu,
Yue Hu (),
Kai Xu,
Long Qin and
Quanjun Yin
Additional contact information
Wansen Wu: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Yue Hu: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Kai Xu: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Long Qin: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Quanjun Yin: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Mathematics, 2023, vol. 11, issue 19, 1-19
Abstract:
Vision and Language Navigation (VLN) is a task in which an agent needs to understand natural language instructions to reach the target location in a real-scene environment. To improve the model ability of long-horizon planning, emerging research focuses on extending the models with different types of memory structures, mainly including topological maps or a hidden state vector. However, the fixed-length hidden state vector is often insufficient to capture long-term temporal context. In comparison, topological maps have been shown to be beneficial for many robotic navigation tasks. Therefore, we focus on building a feasible and effective topological map representation and using it to improve the navigation performance and the generalization across seen and unseen environments. This paper presents a S elf-organizing Memory based on Adaptive Resonance Theory (SMART) module for incremental topological mapping and a framework for utilizing the SMART module to guide navigation. Based on fusion adaptive resonance theory networks, the SMART module can extract salient scenes from historical observations and build a topological map of the environmental layout. It provides a compact spatial representation and supports the discovery of novel shortcuts through inferences while being explainable in terms of cognitive science. Furthermore, given a language instruction and on top of the topological map, we propose a vision–language alignment framework for navigational decision-making. Notably, the framework utilizes three off-the-shelf pre-trained models to perform landmark extraction, node–landmark matching, and low-level controlling, without any fine-tuning on human-annotated datasets. We validate our approach using the Habitat simulator on VLN-CE tasks, which provides a photo-realistic environment for the embodied agent in continuous action space. The experimental results demonstrate that our approach achieves comparable performance to the supervised baseline.
Keywords: vision and language navigation; adaptive resonance theory; mapping (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/11/19/4192/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/19/4192/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:19:p:4192-:d:1255073
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().