Frame and Utterance Emotional Alignment for Speech Emotion Recognition

Byun, Seounghoon; Lee, Seok-Pil

Frame and Utterance Emotional Alignment for Speech Emotion Recognition

Seounghoon Byun and Seok-Pil Lee ()
Additional contact information
Seounghoon Byun: Department of Computer Science, Graduate School, Sangmyung University, Seoul 03016, Republic of Korea
Seok-Pil Lee: Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea

Future Internet, 2025, vol. 17, issue 11, 1-14

Abstract: Speech Emotion Recognition (SER) is important for applications such as Human–Computer Interaction (HCI) and emotion-aware services. Traditional SER models rely on utterance-level labels, aggregating frame-level representations through pooling operations. However, emotional states can vary across frames within an utterance, making it difficult for models to learn consistent and robust representations. To address this issue, we propose two auxiliary loss functions, Emotional Attention Loss (EAL) and Frame-to-Utterance Alignment Loss (FUAL). The proposed approach uses a Classification token (CLS) self-attention pooling mechanism, where the CLS summarizes the entire utterance sequence. EAL encourages frames of the same emotion to align closely with the CLS while separating frames of different classes, and FUAL enforces consistency between frame-level and utterance-level predictions to stabilize training. Model training proceeds in two stages: Stage 1 fine-tunes the wav2vec 2.0 backbone with Cross-Entropy (CE) loss to obtain stable frame embeddings, and stage 2 jointly optimizes CE, EAL and FUAL within the CLS-based pooling framework. Experiments on the IEMOCAP four-class dataset demonstrate that our method consistently outperforms baseline models, showing that the proposed losses effectively address representation inconsistencies and improve SER performance. This work advances Artificial Intelligence by improving the ability of models to understand human emotions through speech.

Keywords: speech emotion recognition; self-supervised learning; frame-level emotion alignment; attention; artificial intelligence (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/17/11/509/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/11/509/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:11:p:509-:d:1787745

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().