Adaptive Transformer-Based Deep Learning Framework for Continuous Sign Language Recognition and Translation

Said, Yahia; Boubaker, Sahbi; Altowaijri, Saleh M.; Alsheikhy, Ahmed A.; Atri, Mohamed

Adaptive Transformer-Based Deep Learning Framework for Continuous Sign Language Recognition and Translation

Yahia Said (), Sahbi Boubaker, Saleh M. Altowaijri, Ahmed A. Alsheikhy and Mohamed Atri
Additional contact information
Yahia Said: Center for Scientific Research and Entrepreneurship, Northern Border University, Arar 73213, Saudi Arabia
Sahbi Boubaker: Department of Computer & Network Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi Arabia
Saleh M. Altowaijri: Department of Information Systems, Faculty of Computing and Information Technology, Northern Border University, Rafha 91911, Saudi Arabia
Ahmed A. Alsheikhy: Department of Electrical Engineering, College of Engineering, Northern Border University, Arar 91431, Saudi Arabia
Mohamed Atri: College of Computer Sciences, King Khalid University, Abha 62529, Saudi Arabia

Mathematics, 2025, vol. 13, issue 6, 1-23

Abstract: Sign language recognition and translation remain pivotal for facilitating communication among the deaf and hearing communities. However, end-to-end sign language translation (SLT) faces major challenges, including weak temporal correspondence between sign language (SL) video frames and gloss annotations and the complexity of sequence alignment between long SL videos and natural language sentences. In this paper, we propose an Adaptive Transformer (ADTR)-based deep learning framework that enhances SL video processing for robust and efficient SLT. The proposed model incorporates three novel modules: Adaptive Masking (AM), Local Clip Self-Attention (LCSA), and Adaptive Fusion (AF) to optimize feature representation. The AM module dynamically removes redundant video frame representations, improving temporal alignment, while the LCSA module learns hierarchical representations at both local clip and full-video levels using a refined self-attention mechanism. Additionally, the AF module fuses multi-scale temporal and spatial features to enhance model robustness. Unlike conventional SLT models, our framework eliminates the reliance on gloss annotations, enabling direct translation from SL video sequences to spoken language text. The proposed method was evaluated using the ArabSign dataset, demonstrating state-of-the-art performance in translation accuracy, processing efficiency, and real-time applicability. The achieved results confirm that ADTR is a highly effective and scalable deep learning solution for continuous sign language recognition, positioning it as a promising AI-driven approach for real-world assistive applications.

Keywords: sign language translation (SLT); deep learning; adaptive transformer; self-attention mechanism; multimodal feature fusion; computer vision; natural language processing (NLP); deaf communication assistance (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/6/909/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/6/909/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:6:p:909-:d:1608156

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().