Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection

Zhao, Guilong; Zhao, Yixia; Yin, Xiangrong; Lin, Lei; Zhu, Jizhao

Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection

Guilong Zhao, Yixia Zhao, Xiangrong Yin, Lei Lin () and Jizhao Zhu ()
Additional contact information
Guilong Zhao: Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
Yixia Zhao: Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
Xiangrong Yin: Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
Lei Lin: Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
Jizhao Zhu: School of Computer Science, Shenyang Aerospace University, Shenyang 110136, China

Mathematics, 2025, vol. 13, issue 20, 1-22

Abstract: Sarcasm is a complex emotional expression often marked by semantic contrast and incongruity between textual and visual modalities. In recent years, multi-modal sarcasm detection (MMSD) has emerged as a vital task in affective computing. However, existing models frequently rely on superficial spurious cues—such as emojis or hashtags—during training and inference, limiting their ability to capture deeper semantic inconsistencies and undermining generalization to real-world scenarios. To tackle these challenges, we propose Multi-Modal Mixture-of-Experts (MM-MoE), a novel framework that integrates diverse expert modules through a global dynamic gating mechanism for adaptive cross-modal interaction and selective semantic fusion. This architecture allows for the model to better capture modality-level incongruity. Furthermore, we introduce MMSD3.0 and MMSD4.0, two cross-dataset evaluation benchmarks derived from two open source benchmark datasets, MMSD and MMSD2.0, to assess model robustness under varying distributions of spurious cues. Extensive experiments demonstrate that MM-MoE achieves strong performance and generalization ability, consistently outperforming state-of-the-art baselines when encountering superficial spurious correlations.

Keywords: multi-modal sarcasm detection; Mixture-of-Experts; cross-modal fusion; spurious cues; robust multi-modal modeling (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/20/3250/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/20/3250/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:20:p:3250-:d:1768617

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().