EconPapers    
Economics at your fingertips  
 

Lightweight End-to-End Diacritical Arabic Speech Recognition Using CTC-Transformer with Relative Positional Encoding

Haifa Alaqel () and Khalil El Hindi
Additional contact information
Haifa Alaqel: Department of Computer Science, College of Computer and Information Science, King Saud University, Riyadh 11451, Saudi Arabia
Khalil El Hindi: Department of Computer Science, College of Computer and Information Science, King Saud University, Riyadh 11451, Saudi Arabia

Mathematics, 2025, vol. 13, issue 20, 1-26

Abstract: Arabic automatic speech recognition (ASR) faces distinct challenges due to its complex morphology, dialectal variations, and the presence of diacritical marks that strongly influence pronunciation and meaning. This study introduces a lightweight approach for diacritical Arabic ASR that employs a Transformer encoder architecture enhanced with Relative Positional Encoding (RPE) and Connectionist Temporal Classification (CTC) loss, eliminating the need for a conventional decoder. A two-stage training process was applied: initial pretraining on Modern Standard Arabic (MSA), followed by progressive three-phase fine-tuning on diacritical Arabic datasets. The proposed model achieves a WER of 22.01% on the SASSC dataset, improving over traditional systems (best 28.4% WER) while using only ≈14 M parameters. In comparison, XLSR-Large (300 M parameters) achieves a WER of 12.17% but requires over 20× more parameters and substantially higher training and inference costs. Although XLSR attains lower error rates, the proposed model is far more practical for resource-constrained environments, offering reduced complexity, faster training, and lower memory usage while maintaining competitive accuracy. These results show that encoder-only Transformers with RPE, combined with CTC training and systematic architectural optimization, can effectively model Arabic phonetic structure while maintaining computational efficiency. This work establishes a new benchmark for resource-efficient diacritical Arabic ASR, making the technology more accessible for real-world deployment.

Keywords: transformer encoder; relative positional encoding; connectionist temporal classification; modern standard Arabic speech recognition; transfer learning; log mel-spectrogram (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/20/3352/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/20/3352/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:20:p:3352-:d:1776235

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-10-25
Handle: RePEc:gam:jmathe:v:13:y:2025:i:20:p:3352-:d:1776235