Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition

Isobe, Shinnosuke; Tamura, Satoshi; Hayamizu, Satoru; Gotoh, Yuuto; Nose, Masaki

Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition

Shinnosuke Isobe, Satoshi Tamura, Satoru Hayamizu, Yuuto Gotoh and Masaki Nose
Additional contact information
Shinnosuke Isobe: Graduate School of Natural Science and Technology, Gifu University, 1-1 Yanagido, Gifu 501-1193, Japan
Satoshi Tamura: Faculty of Engineering, Gifu University, 1-1 Yanagido, Gifu 501-1193, Japan
Satoru Hayamizu: Faculty of Engineering, Gifu University, 1-1 Yanagido, Gifu 501-1193, Japan
Yuuto Gotoh: Ricoh Company, Ltd., 2-7-1 Izumi, Ebina, Kanagawa 243-0460, Japan
Masaki Nose: Ricoh Company, Ltd., 2-7-1 Izumi, Ebina, Kanagawa 243-0460, Japan

Future Internet, 2021, vol. 13, issue 7, 1-12

Abstract: Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or profile faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classification is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then confirmed that our approach achieved the best performance among conventional VSR schemes in a phrase classification task. In addition, we found that our AVSR results are better than ASR and VSR results.

Keywords: visual speech recognition; multi-angle lipreading; automatic speech recognition; audio-visual speech recognition; deep learning; view classification (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2021
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/13/7/182/pdf (application/pdf)
https://www.mdpi.com/1999-5903/13/7/182/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:13:y:2021:i:7:p:182-:d:595038

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().