Interactive Learning of a Dual Convolution Neural Network for Multi-Modal Action Recognition

Li, Qingxia; Gao, Dali; Zhang, Qieshi; Wei, Wenhong; Ren, Ziliang

Interactive Learning of a Dual Convolution Neural Network for Multi-Modal Action Recognition

Qingxia Li, Dali Gao, Qieshi Zhang, Wenhong Wei and Ziliang Ren ()
Additional contact information
Qingxia Li: School of Computer and Information, Dongguan City College, Dongguan 523419, China
Dali Gao: School of Mathematics and Computer Science, Quanzhou Normal University, Quanzhou 362000, China
Qieshi Zhang: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
Wenhong Wei: School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China
Ziliang Ren: School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China

Mathematics, 2022, vol. 10, issue 21, 1-15

Abstract: RGB and depth modalities contain more abundant and interactive information, and convolutional neural networks (ConvNets) based on multi-modal data have achieved successful progress in action recognition. Due to the limitation of a single stream, it is difficult to improve recognition performance by learning multi-modal interactive features. Inspired by the multi-stream learning mechanism and spatial-temporal information representation methods, we construct dynamic images by using the rank pooling method and design an interactive learning dual-ConvNet (ILD-ConvNet) with a multiplexer module to improve action recognition performance. Built on the rank pooling method, the constructed visual dynamic images can capture the spatial-temporal information from entire RGB videos. We extend this method to depth sequences to obtain more abundant multi-modal spatial-temporal information as the inputs of the ConvNets. In addition, we design a dual ILD-ConvNet with multiplexer modules to jointly learn the interactive features of two-stream from RGB and depth modalities. The proposed recognition framework has been tested on two benchmark multi-modal datasets—NTU RGB + D 120 and PKU-MMD. The proposed ILD-ConvNet with a temporal segmentation mechanism achieves an accuracy of 86.9% and 89.4% for Cross-Subject (C-Sub) and Cross-Setup (C-Set) on NTU RGB + D 120, 92.0% and 93.1% for Cross-Subject (C-Sub) and Cross-View (C-View) on PKU-MMD, which are comparable with the state of the art. The experimental results shown that our proposed ILD-ConvNet with a multiplexer module can extract interactive features from different modalities to enhance action recognition performance.

Keywords: convolutional neural network; rank pooling; feature interactive learning; action recognition (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2022
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/10/21/3923/pdf (application/pdf)
https://www.mdpi.com/2227-7390/10/21/3923/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:10:y:2022:i:21:p:3923-:d:950357

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().