3D-ShuffleViT: An Efficient Video Action Recognition Network with Deep Integration of Self-Attention and Convolution

Wang, Yinghui; Zhu, Anlei; Ma, Haomiao; Ai, Lingyu; Song, Wei; Zhang, Shaojie

3D-ShuffleViT: An Efficient Video Action Recognition Network with Deep Integration of Self-Attention and Convolution

Yinghui Wang, Anlei Zhu (), Haomiao Ma (), Lingyu Ai, Wei Song and Shaojie Zhang
Additional contact information
Yinghui Wang: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
Anlei Zhu: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
Haomiao Ma: School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
Lingyu Ai: School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, China
Wei Song: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
Shaojie Zhang: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China

Mathematics, 2023, vol. 11, issue 18, 1-18

Abstract: Compared with traditional methods, the action recognition model based on 3D convolutional deep neural network captures spatio-temporal features more accurately, resulting in higher accuracy. However, the large number of parameters and computational requirements of 3D models make it difficult to deploy on mobile devices with limited computing power. In order to achieve an efficient video action recognition model, we have analyzed and compared classic lightweight network principles and proposed the 3D-ShuffleViT network. By deeply integrating the self-attention mechanism with convolution, we have introduced an efficient ACISA module that further enhances the performance of our proposed model. This has resulted in exceptional performance in both context-sensitive and context-independent action recognition, while reducing deployment costs. It is worth noting that our 3D-ShuffleViT network, with a computational cost of only 6% of that of SlowFast-ResNet101, achieved 98% of the latter’s Top1 accuracy on the EgoGesture dataset. Furthermore, on the same CPU (Intel i5-8300H), its speed was 2.5 times that of the latter. In addition, when we deployed our model on edge devices, our proposed network achieved the best balance between accuracy and speed among lightweight networks of the same order.

Keywords: lightweight networks; 3D convolution; self attention; edge device (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/11/18/3848/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/18/3848/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:18:p:3848-:d:1235675

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().