Enhancing Dysarthric Speech for Improved Clinical Communication: A Deep Learning Approach

Balaji, A. P. Yeshwanth; Karti, T. R. Eshwanth; Ariyha, K. Nithish; Vikash, J.; Lal, G. Jyothish

Enhancing Dysarthric Speech for Improved Clinical Communication: A Deep Learning Approach

A. P. Yeshwanth Balaji (), T. R. Eshwanth Karti (), K. Nithish Ariyha (), J. Vikash () and G. Jyothish Lal ()
Additional contact information
A. P. Yeshwanth Balaji: Amrita Vishwa Vidyapeetham
T. R. Eshwanth Karti: Amrita Vishwa Vidyapeetham
K. Nithish Ariyha: Amrita Vishwa Vidyapeetham
J. Vikash: Amrita Vishwa Vidyapeetham
G. Jyothish Lal: Amrita Vishwa Vidyapeetham

A chapter in Machine Learning and Deep Learning Modeling and Algorithms with Applications in Medical and Health Care, 2025, pp 1-22 from Springer

Abstract: Abstract Dysarthric speech poses significant challenges to modern speech processing systems due to its inherently low intelligibility, irregular prosody, and atypical articulation patterns. Traditional methods that rely on intermediate automatic speech recognition (ASR) stages often perform poorly under such conditions, especially when speech is severely degraded. In this work, we propose a fully end-to-end enhancement pipeline that directly improves dysarthric speech quality and intelligibility using GAN-based models, bypassing the limitations of transcription-based systems. We employ a MelSEGAN architecture coupled with a SepFormer to address spectral and temporal distortions in the speech signal. Through a comparative analysis of preprocessing strategies, we find that dynamic time warping (DTW) in conjunction with variational mode decomposition (VMD) offers more stable and intelligible outputs than conventional voice activity detection (VAD), particularly in cases of temporally misaligned or fragmented speech. DTW not only enables better convergence during training but also results in clearer formant structures and reduced background artifacts in the enhanced speech. Further, we extend our pipeline with Model-Agnostic Meta-Learning (MAML) to improve speaker-specific adaptation. The MAML-augmented models demonstrate superior generalization and refinement of harmonic features, especially when paired with DTW-based preprocessing. Additionally, we are investigating an alternative enhancement path that combines a UNet-based encoder-decoder with a HiFi-GAN vocoder. Early qualitative assessments suggest that this hybrid model produces higher naturalness and improved intelligibility, offering a promising direction for future development. Overall, our findings highlight the importance of robust temporal preprocessing and adaptive learning strategies in building effective enhancement systems for disordered speech scenarios.

Keywords: Dysarthric speech enhancement; Deep learning; Generative adversarial networks; Meta-learning; Dynamic time warping (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:ssrchp:978-3-031-98728-1_1

Ordering information: This item can be ordered from
http://www.springer.com/9783031987281

DOI: 10.1007/978-3-031-98728-1_1

Access Statistics for this chapter

More chapters in Springer Series in Reliability Engineering from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().