Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers

Ahn, Jaesin; Hong, Jiuk; Ju, Jeongwoo; Jung, Heechul

Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers

Jaesin Ahn, Jiuk Hong, Jeongwoo Ju and Heechul Jung ()
Additional contact information
Jaesin Ahn: Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea
Jiuk Hong: Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea
Jeongwoo Ju: Captos Co., Ltd., Yangsan 50652, Republic of Korea
Heechul Jung: Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea

Mathematics, 2023, vol. 11, issue 8, 1-16

Abstract: There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved 71.4 % , 77.8 % , and 82.1 % on ImageNet-1k compared with 69.9 % , 77.1 % , and 82.0 % acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved 94.8 % in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 ( 94.5 % ). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models.

Keywords: vision transformer; Q/K/V embedding; shared embedding; non-linear embedding; image classification (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/11/8/1933/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/8/1933/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:8:p:1933-:d:1127798

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().