Multimodal Fall Detection Using Spatial–Temporal Attention and Bi-LSTM-Based Feature Fusion

Shin, Jungpil; Miah, Abu Saleh Musa; Egawa, Rei; Hassan, Najmul; Hirooka, Koki; Tomioka, Yoichi

Multimodal Fall Detection Using Spatial–Temporal Attention and Bi-LSTM-Based Feature Fusion

Jungpil Shin (), Abu Saleh Musa Miah (), Rei Egawa, Najmul Hassan, Koki Hirooka and Yoichi Tomioka
Additional contact information
Jungpil Shin: School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan
Abu Saleh Musa Miah: School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan
Rei Egawa: School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan
Najmul Hassan: School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan
Koki Hirooka: School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan
Yoichi Tomioka: School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan

Future Internet, 2025, vol. 17, issue 4, 1-22

Abstract: Human fall detection is a significant healthcare concern, particularly among the elderly, due to its links to muscle weakness, cardiovascular issues, and locomotive syndrome. Accurate fall detection is crucial for timely intervention and injury prevention, which has led many researchers to work on developing effective detection systems. However, existing unimodal systems that rely solely on skeleton or sensor data face challenges such as poor robustness, computational inefficiency, and sensitivity to environmental conditions. While some multimodal approaches have been proposed, they often struggle to capture long-range dependencies effectively. In order to address these challenges, we propose a multimodal fall detection framework that integrates skeleton and sensor data. The system uses a Graph-based Spatial-Temporal Convolutional and Attention Neural Network (GSTCAN) to capture spatial and temporal relationships from skeleton and motion data information in stream-1, while a Bi-LSTM with Channel Attention (CA) processes sensor data in stream-2, extracting both spatial and temporal features. The GSTCAN model uses AlphaPose for skeleton extraction, calculates motion between consecutive frames, and applies a graph convolutional network (GCN) with a CA mechanism to focus on relevant features while suppressing noise. In parallel, the Bi-LSTM with CA processes inertial signals, with Bi-LSTM capturing long-range temporal dependencies and CA refining feature representations. The features from both branches are fused and passed through a fully connected layer for classification, providing a comprehensive understanding of human motion. The proposed system was evaluated on the Fall Up and UR Fall datasets, achieving a classification accuracy of 99.09% and 99.32%, respectively, surpassing existing methods. This robust and efficient system demonstrates strong potential for accurate fall detection and continuous healthcare monitoring.

Keywords: ageing people; AlphaPose; body pose detection; channel attention; graph convolutional network (GCN); human fall detection; multimodal (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/17/4/173/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/4/173/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:4:p:173-:d:1635077

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().