Multi-Scale Audio Spectrogram Transformer for Classroom Teaching Interaction Recognition

Liu, Fan; Fang, Jiandong

Multi-Scale Audio Spectrogram Transformer for Classroom Teaching Interaction Recognition

Fan Liu () and Jiandong Fang
Additional contact information
Fan Liu: College of Information Engineering, Inner Mongolia University of Technology, Hohhot 010080, China
Jiandong Fang: College of Information Engineering, Inner Mongolia University of Technology, Hohhot 010080, China

Future Internet, 2023, vol. 15, issue 2, 1-19

Abstract: Classroom interactivity is one of the important metrics for assessing classrooms, and identifying classroom interactivity through classroom image data is limited by the interference of complex teaching scenarios. However, audio data within the classroom are characterized by significant student–teacher interaction. This study proposes a multi-scale audio spectrogram transformer (MAST) speech scene classification algorithm and constructs a classroom interactive audio dataset to achieve interactive teacher–student recognition in the classroom teaching process. First, the original speech signal is sampled and pre-processed to generate a multi-channel spectrogram, which enhances the representation of features compared with single-channel features; Second, in order to efficiently capture the long-range global context of the audio spectrogram, the audio features are globally modeled by the multi-head self-attention mechanism of MAST, and the feature resolution is reduced during feature extraction to continuously enrich the layer-level features while reducing the model complexity; Finally, a further combination with a time-frequency enrichment module maps the final output to a class feature map, enabling accurate audio category recognition. The experimental comparison of MAST is carried out on the public environment audio dataset and the self-built classroom audio interaction datasets. Compared with the previous state-of-the-art methods on public datasets AudioSet and ESC-50, its accuracy has been improved by 3% and 5%, respectively, and the accuracy of the self-built classroom audio interaction dataset has reached 92.1%. These results demonstrate the effectiveness of MAST in the field of general audio classification and the smart classroom domain.

Keywords: audio classification; classroom interaction recognition; multi-channel features; transformer; enrichment module (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/15/2/65/pdf (application/pdf)
https://www.mdpi.com/1999-5903/15/2/65/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:15:y:2023:i:2:p:65-:d:1055890

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().