Lightweight End-to-End Diacritical Arabic Speech Recognition Using CTC-Transformer with Relative Positional Encoding

Alaqel, Haifa; Hindi, Khalil El

Lightweight End-to-End Diacritical Arabic Speech Recognition Using CTC-Transformer with Relative Positional Encoding

Haifa Alaqel () and Khalil El Hindi
Additional contact information
Haifa Alaqel: Department of Computer Science, College of Computer and Information Science, King Saud University, Riyadh 11451, Saudi Arabia
Khalil El Hindi: Department of Computer Science, College of Computer and Information Science, King Saud University, Riyadh 11451, Saudi Arabia

Mathematics, 2025, vol. 13, issue 20, 1-26

Abstract: Arabic automatic speech recognition (ASR) faces distinct challenges due to its complex morphology, dialectal variations, and the presence of diacritical marks that strongly influence pronunciation and meaning. This study introduces a lightweight approach for diacritical Arabic ASR that employs a Transformer encoder architecture enhanced with Relative Positional Encoding (RPE) and Connectionist Temporal Classification (CTC) loss, eliminating the need for a conventional decoder. A two-stage training process was applied: initial pretraining on Modern Standard Arabic (MSA), followed by progressive three-phase fine-tuning on diacritical Arabic datasets. The proposed model achieves a WER of 22.01% on the SASSC dataset, improving over traditional systems (best 28.4% WER) while using only ≈14 M parameters. In comparison, XLSR-Large (300 M parameters) achieves a WER of 12.17% but requires over 20× more parameters and substantially higher training and inference costs. Although XLSR attains lower error rates, the proposed model is far more practical for resource-constrained environments, offering reduced complexity, faster training, and lower memory usage while maintaining competitive accuracy. These results show that encoder-only Transformers with RPE, combined with CTC training and systematic architectural optimization, can effectively model Arabic phonetic structure while maintaining computational efficiency. This work establishes a new benchmark for resource-efficient diacritical Arabic ASR, making the technology more accessible for real-world deployment.

Keywords: transformer encoder; relative positional encoding; connectionist temporal classification; modern standard Arabic speech recognition; transfer learning; log mel-spectrogram (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/20/3352/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/20/3352/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:20:p:3352-:d:1776235

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().