Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition
June-Woo Kim,
Hoon Chung and
Ho-Young Jung ()
Additional contact information
June-Woo Kim: Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea
Hoon Chung: Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea
Ho-Young Jung: Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea
Mathematics, 2023, vol. 11, issue 3, 1-17
Abstract:
Unsupervised learning-based approaches for training speech vector representations (SVR) have recently been widely applied. While pretrained SVR models excel in relatively clean automatic speech recognition (ASR) tasks, such as those recorded in laboratory environments, they are still insufficient for practical applications with various types of noise, intonation, and dialects. To cope with this problem, we present a novel unsupervised SVR learning method for practical end-to-end ASR models. Our approach involves designing a speech feature masking method to stabilize SVR model learning and improve the performance of the ASR model in a downstream task. By introducing a noise masking strategy into diverse combinations of the time and frequency regions of the spectrogram, the SVR model becomes a robust representation extractor for the ASR model in practical scenarios. In pretraining experiments, we train the SVR model using approximately 18,000 h of Korean speech datasets that included diverse speakers and were recorded in environments with various amounts of noise. The weights of the pretrained SVR extractor are then frozen, and the extracted speech representations are used for ASR model training in a downstream task. The experimental results show that the ASR model using our proposed SVR extractor significantly outperforms conventional methods.
Keywords: speech vector representation; representation learning; unsupervised learning; feature representation extractor; speech recognition; deep learning; neural network; speech processing (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/11/3/622/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/3/622/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:3:p:622-:d:1047352
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().