TAT-SARNet: A Transformer-Attentive Two-Stream Soccer Action Recognition Network with Multi-Dimensional Feature Fusion and Hierarchical Temporal Classification

Alqarafi, Abdulrahman; Almogadwy, Bassam

TAT-SARNet: A Transformer-Attentive Two-Stream Soccer Action Recognition Network with Multi-Dimensional Feature Fusion and Hierarchical Temporal Classification

Abdulrahman Alqarafi () and Bassam Almogadwy
Additional contact information
Abdulrahman Alqarafi: College of Computer Science and Engineering, Taibah University, Madinah 42353, Saudi Arabia
Bassam Almogadwy: College of Computer Science and Engineering, Taibah University, Madinah 42353, Saudi Arabia

Mathematics, 2025, vol. 13, issue 18, 1-26

Abstract: (1) Background: Soccer action recognition (SAR) is essential in modern sports analytics, supporting automated performance evaluation, tactical strategy analysis, and detailed player behavior modeling. Although recent advances in deep learning and computer vision have enhanced SAR capabilities, many existing methods remain limited to coarse-grained classifications, grouping actions into broad categories such as attacking, defending, or goalkeeping. These models often fall short in capturing fine-grained distinctions, contextual nuances, and long-range temporal dependencies. Transformer-based approaches offer potential improvements but are typically constrained by the need for large-scale datasets and high computational demands, limiting their practical applicability. Moreover, current SAR systems frequently encounter difficulties in handling occlusions, background clutter, and variable camera angles, which contribute to misclassifications and reduced accuracy. (2) Methods: To overcome these challenges, we propose TAT-SARNet, a structured framework designed for accurate and fine-grained SAR. The model begins by applying Sparse Dilated Attention (SDA) to emphasize relevant spatial dependencies while mitigating background noise. Refined spatial features are then processed through the Split-Stream Feature Processing Module (SSFPM), which separately extracts appearance-based (RGB) and motion-based (optical flow) features using ResNet and 3D CNNs. These features are temporally refined by the Multi-Granular Temporal Processing (MGTP) module, which integrates ResIncept Patch Consolidation (RIPC) and Progressive Scale Construction Module (PSCM) to capture both short- and long-range temporal patterns. The output is then fused via the Context-Guided Dual Transformer (CGDT), which models spatiotemporal interactions through a Bi-Transformer Connector (BTC) and Channel–Spatial Attention Block (CSAB); (3) Results: Finally, the Cascaded Temporal Classification (CTC) module maps these features to fine-grained action categories, enabling robust recognition even under challenging conditions such as occlusions and rapid movements. (4) Conclusions: This end-to-end architecture ensures high precision in complex real-world soccer scenarios.

Keywords: soccer action recognition; dual contextual transformer; feature fusion; feature processing module (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/18/3011/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/18/3011/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:18:p:3011-:d:1751848

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().