Bridging Modalities: An Analysis of Cross-Modal Wasserstein Adversarial Translation Networks and Their Theoretical Foundations
Joseph Tafataona Mtetwa (),
Kingsley A. Ogudo and
Sameerchand Pudaruth
Additional contact information
Joseph Tafataona Mtetwa: Department of Electrical and Electronics Engineering, University of Johannesburg, Johannesburg 2006, South Africa
Kingsley A. Ogudo: Department of Electrical and Electronics Engineering, University of Johannesburg, Johannesburg 2006, South Africa
Sameerchand Pudaruth: ICT Department, University of Mauritius, Reduit 80837, Mauritius
Mathematics, 2025, vol. 13, issue 16, 1-32
Abstract:
What if machines could seamlessly translate between the visual richness of images and the semantic depth of language with mathematical precision? This paper presents a theoretical and empirical analysis of five novel cross-modal Wasserstein adversarial translation networks that challenge conventional approaches to cross-modal understanding. Unlike traditional generative models that rely on stochastic noise, our frameworks learn deterministic translation mappings that preserve semantic fidelity across modalities through rigorous mathematical foundations. We systematically examine: (1) cross-modality consistent dual-critical networks; (2) Wasserstein cycle consistency; (3) multi-scale Wasserstein distance; (4) regularization through modality invariance; and (5) Wasserstein information bottleneck. Each approach employs adversarial training with Wasserstein distances to establish theoretically grounded translation functions between heterogeneous data representations. Through mathematical analysis—including information-theoretic frameworks, differential geometry, and convergence guarantees—we establish the theoretical foundations underlying cross-modal translation. Our empirical evaluation across MS-COCO, Flickr30K, and Conceptual Captions datasets, including comparisons with transformer-based baselines, reveals that our proposed multi-scale Wasserstein cycle consistent (MS-WCC) framework achieves remarkable performance gains—12.1% average improvement in FID scores and 8.0% enhancement in cross-modal translation accuracy—compared to state-of-the-art methods, while maintaining superior computational efficiency. These results demonstrate that principled mathematical approaches to cross-modal translation can significantly advance machine understanding of multimodal data, opening new possibilities for applications requiring seamless communication between visual and textual domains.
Keywords: cross-modal translation; Wasserstein adversarial training; multi-modal learning; cycle consistency; information bottleneck (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/13/16/2545/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/16/2545/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:16:p:2545-:d:1720716
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().