CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts

Mengara, Axel Gedeon Mengara; Moon, Yeon-kug

CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts

Axel Gedeon Mengara Mengara and Yeon-kug Moon ()
Additional contact information
Axel Gedeon Mengara Mengara: Department of Artificial Intelligence Data Science, Sejong University, 209 Neungdong-ro, Gwangjin District, Seoul 05006, Republic of Korea
Yeon-kug Moon: Department of Artificial Intelligence Data Science, Sejong University, 209 Neungdong-ro, Gwangjin District, Seoul 05006, Republic of Korea

Mathematics, 2025, vol. 13, issue 12, 1-37

Abstract: Multimodal emotion recognition faces substantial challenges due to the inherent heterogeneity of data sources, each with its own temporal resolution, noise characteristics, and potential for incompleteness. For example, physiological signals, audio features, and textual data capture complementary yet distinct aspects of emotion, requiring specialized processing to extract meaningful cues. These challenges include aligning disparate modalities, handling varying levels of noise and missing data, and effectively fusing features without diluting critical contextual information. In this work, we propose a novel Mixture of Experts (MoE) framework that addresses these challenges by integrating specialized transformer-based sub-expert networks, a dynamic gating mechanism with sparse Top- k activation, and a cross-modal attention module. Each modality is processed by multiple dedicated sub-experts designed to capture intricate temporal and contextual patterns, while the dynamic gating network selectively weights the contributions of the most relevant experts. Our cross-modal attention module further enhances the integration by facilitating precise exchange of information among modalities, thereby reinforcing robustness in the presence of noisy or incomplete data. Additionally, an auxiliary diversity loss encourages expert specialization, ensuring the fused representation remains highly discriminative. Extensive theoretical analysis and rigorous experiments on benchmark datasets—the Korean Emotion Multimodal Database (KEMDy20) and the ASCERTAIN dataset—demonstrate that our approach significantly outperforms state-of-the-art methods in emotion recognition, setting new performance baselines in affective computing.

Keywords: multimodal emotion recognition; deep learning; multimodal fusion; transformers; mixture of experts (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/12/1907/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/12/1907/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:12:p:1907-:d:1673997

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().