EconPapers    
Economics at your fingertips  
 

Harmonizer: A Universal Signal Tokenization Framework for Multimodal Large Language Models

Amin Amiri, Alireza Ghaffarnia, Nafiseh Ghaffar Nia, Dalei Wu and Yu Liang ()
Additional contact information
Amin Amiri: Department of Computer Science and Engineering, University of Tennessee at Chattanooga (UTC), 615 McCallie Ave, Chattanooga, TN 37377, USA
Alireza Ghaffarnia: Department of Computer Science and Engineering, University of Tennessee at Chattanooga (UTC), 615 McCallie Ave, Chattanooga, TN 37377, USA
Nafiseh Ghaffar Nia: Department of Electrical and Computer Engineering, Northwestern University, 633 Clark Street, Evanston, IL 60208, USA
Dalei Wu: Department of Computer Science and Engineering, University of Tennessee at Chattanooga (UTC), 615 McCallie Ave, Chattanooga, TN 37377, USA
Yu Liang: Department of Computer Science and Engineering, University of Tennessee at Chattanooga (UTC), 615 McCallie Ave, Chattanooga, TN 37377, USA

Mathematics, 2025, vol. 13, issue 11, 1-44

Abstract: This paper introduces Harmonizer, a universal framework designed for tokenizing heterogeneous input signals, including text, audio, and video, to enable seamless integration into multimodal large language models (LLMs). Harmonizer employs a unified approach to convert diverse, non-linguistic signals into discrete tokens via its FusionQuantizer architecture, built on FluxFormer, to efficiently capture essential signal features while minimizing complexity. We enhance features through STFT-based spectral decomposition, Hilbert transform analytic signal extraction, and SCLAHE spectrogram contrast optimization, and train using a composite loss function to produce reliable embeddings and construct a robust vector vocabulary. Experimental validation on music datasets such as E-GMD v1.0.0, Maestro v3.0.0, and GTZAN demonstrates high fidelity across 288 s of vocal signals (MSE = 0.0037, CC = 0.9282, Cosine Sim. = 0.9278, DTW = 12.12, MFCC Sim. = 0.9997, Spectral Conv. = 0.2485). Preliminary tests on text reconstruction and UCF-101 video clips further confirm Harmonizer’s applicability across discrete and spatiotemporal modalities. Rooted in the universality of wave phenomena and Fourier theory, Harmonizer offers a physics-inspired, modality-agnostic fusion mechanism via wave superposition and interference principles. In summary, Harmonizer integrates natural language processing and signal processing into a coherent tokenization paradigm for efficient, interpretable multimodal learning.

Keywords: tokenization; multimodal LLM; STFT; Hilbert transform; SCLAHE (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/11/1819/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/11/1819/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:11:p:1819-:d:1667602

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-05-30
Handle: RePEc:gam:jmathe:v:13:y:2025:i:11:p:1819-:d:1667602