Exploring Spatial-Based Position Encoding for Image Captioning

Yang, Xiaobao; He, Shuai; Wu, Junsheng; Yang, Yang; Hou, Zhiqiang; Ma, Sugang

Exploring Spatial-Based Position Encoding for Image Captioning

Xiaobao Yang, Shuai He, Junsheng Wu (), Yang Yang, Zhiqiang Hou and Sugang Ma
Additional contact information
Xiaobao Yang: School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
Shuai He: School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China
Junsheng Wu: School of Software, Northwestern Polytechnical University, Xi’an 710072, China
Yang Yang: School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China
Zhiqiang Hou: School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China
Sugang Ma: School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China

Mathematics, 2023, vol. 11, issue 21, 1-16

Abstract: Image captioning has become a hot topic in artificial intelligence research and sits at the intersection of computer vision and natural language processing. Most recent imaging captioning models have adopted an “encoder + decoder” architecture, in which the encoder is employed generally to extract the visual feature, while the decoder generates the descriptive sentence word by word. However, the visual features need to be flattened into sequence form before being forwarded to the decoder, and this results in the loss of the 2D spatial position information of the image. This limitation is particularly pronounced in the Transformer architecture since it is inherently not position-aware. Therefore, in this paper, we propose a simple coordinate-based spatial position encoding method (CSPE) to remedy this deficiency. CSPE firstly creates the 2D position coordinates for each feature pixel, and then encodes them by row and by column separately via trainable or hard encoding, effectively strengthening the position representation of visual features and enriching the generated description sentences. In addition, in order to reduce the time cost, we also explore a diagonal-based spatial position encoding (DSPE) approach. Compared with CSPE, DSPE is slightly inferior in performance but has a faster calculation speed. Extensive experiments on the MS COCO 2014 dataset demonstrate that CSPE and DSPE can significantly enhance the spatial position representation of visual features. CSPE, in particular, demonstrates BLEU-4 and CIDEr metrics improved by 1.6% and 5.7%, respectively, compared with a baseline model without sequence-based position encoding, and also outperforms current sequence-based position encoding approaches by a significant margin. In addition, the robustness and plug-and-play ability of the proposed method are validated based on a medical captioning generation model.

Keywords: position encoding; image captioning; transformer (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/11/21/4550/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/21/4550/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:21:p:4550-:d:1274227

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().