Harmonizer: A Universal Signal Tokenization Framework for Multimodal Large Language Models
Amin Amiri,
Alireza Ghaffarnia,
Nafiseh Ghaffar Nia,
Dalei Wu and
Yu Liang ()
Additional contact information
Amin Amiri: Department of Computer Science and Engineering, University of Tennessee at Chattanooga (UTC), 615 McCallie Ave, Chattanooga, TN 37377, USA
Alireza Ghaffarnia: Department of Computer Science and Engineering, University of Tennessee at Chattanooga (UTC), 615 McCallie Ave, Chattanooga, TN 37377, USA
Nafiseh Ghaffar Nia: Department of Electrical and Computer Engineering, Northwestern University, 633 Clark Street, Evanston, IL 60208, USA
Dalei Wu: Department of Computer Science and Engineering, University of Tennessee at Chattanooga (UTC), 615 McCallie Ave, Chattanooga, TN 37377, USA
Yu Liang: Department of Computer Science and Engineering, University of Tennessee at Chattanooga (UTC), 615 McCallie Ave, Chattanooga, TN 37377, USA
Mathematics, 2025, vol. 13, issue 11, 1-44
Abstract:
This paper introduces Harmonizer, a universal framework designed for tokenizing heterogeneous input signals, including text, audio, and video, to enable seamless integration into multimodal large language models (LLMs). Harmonizer employs a unified approach to convert diverse, non-linguistic signals into discrete tokens via its FusionQuantizer architecture, built on FluxFormer, to efficiently capture essential signal features while minimizing complexity. We enhance features through STFT-based spectral decomposition, Hilbert transform analytic signal extraction, and SCLAHE spectrogram contrast optimization, and train using a composite loss function to produce reliable embeddings and construct a robust vector vocabulary. Experimental validation on music datasets such as E-GMD v1.0.0, Maestro v3.0.0, and GTZAN demonstrates high fidelity across 288 s of vocal signals (MSE = 0.0037, CC = 0.9282, Cosine Sim. = 0.9278, DTW = 12.12, MFCC Sim. = 0.9997, Spectral Conv. = 0.2485). Preliminary tests on text reconstruction and UCF-101 video clips further confirm Harmonizer’s applicability across discrete and spatiotemporal modalities. Rooted in the universality of wave phenomena and Fourier theory, Harmonizer offers a physics-inspired, modality-agnostic fusion mechanism via wave superposition and interference principles. In summary, Harmonizer integrates natural language processing and signal processing into a coherent tokenization paradigm for efficient, interpretable multimodal learning.
Keywords: tokenization; multimodal LLM; STFT; Hilbert transform; SCLAHE (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/13/11/1819/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/11/1819/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:11:p:1819-:d:1667602
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().