Assessment of Multiple ASR Architectures on Common Speech Datasets: Analyzing Training Duration, Framework, and Word Error Rate in the Development of a Swahili ASR Model

Nyambo, Devotha G.; Mdegela, Lawrence N.; Kihara, Wangoru

Assessment of Multiple ASR Architectures on Common Speech Datasets: Analyzing Training Duration, Framework, and Word Error Rate in the Development of a Swahili ASR Model

Devotha G. Nyambo (), Lawrence N. Mdegela () and Wangoru Kihara ()
Additional contact information
Devotha G. Nyambo: The Nelson Mandela African Institution of Science and Technology
Lawrence N. Mdegela: The Nelson Mandela African Institution of Science and Technology
Wangoru Kihara: Badili Innovations Ltd

A chapter in Advancement in Embedded and Mobile Systems, 2026, pp 385-402 from Springer

Abstract: Abstract This paper delves into the world of Automatic Speech Recognition (ASR) systems, where the pursuit of performance optimization is unceasing. Our study conducts a thorough analysis of diverse ASR architectures, with a specific focus on their performance in processing Swahili speech data. In this concise exploration, we uncover valuable insights regarding the pivotal role of training duration and the advantages of model re-use in enhancing ASR efficiency. Our analysis underscores the crucial practice of fine-tuning pre-trained ASR models. By aligning these models with Swahili’s unique nuances, accents, and phonetic intricacies, we significantly enhance transcription accuracy. We assess the performance of the Coqui_stt and Tensorflow implementations and observe a direct relationship between increased training duration and reduced Word Error Rate (WER). However, our findings highlight an area for improvement in the pre-trained Swahili ASR Model from the COQUI STT library, which, despite its extended 475-h training duration, achieves a WER of 0.39. Furthermore, our evaluation includes an assessment of six pre-trained models, among which facebook/Wav2Vec2-XLS-R-1B, facebook/wav2vec2-large-xls-r-53, and facebook/Wav2Vec2-XLS-R-300M emerge as promising candidates for Swahili ASR development, with increased training times ranging from 10.7 to 76.3 h. This assessment underscores the trade-off between performance and training duration, notably exemplified by the COQUI STT library’s pre-trained Swahili ASR model, which required 475 h of training to achieve a WER of 0.39. In conclusion, our work is dedicated to optimizing Swahili ASR models by exploring various architectures, hyperparameters, and training methodologies. To this end, we identify three robust candidates for Swahili ASR development: pre-trained ASR from Nemo Library (stt_rw_conformer_ctc_large.nemo), facebook/Wav2Vec2-XLS-R-300M, and facebook/wav2vec2-large-xls-r-53. In the future, these models will be further explored with a custom dataset from local communities.

Keywords: Common speech dataset; Automatic Speech Recognition; Swahili ASR; NLP (search for similar items in EconPapers)
Date: 2026
References: Add references at CitEc
Citations:

There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:prochp:978-3-031-99219-3_26

Ordering information: This item can be ordered from
http://www.springer.com/9783031992193

DOI: 10.1007/978-3-031-99219-3_26

Access Statistics for this chapter

More chapters in Progress in IS from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().