On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks

Mertens, Robert; Huang, Po-Sen; Gottlieb, Luke; Friedland, Gerald; Divakaran, Ajay; Hasegawa-Johnson, Mark

On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks

Robert Mertens, Po-Sen Huang, Luke Gottlieb, Gerald Friedland, Ajay Divakaran and Mark Hasegawa-Johnson
Additional contact information
Robert Mertens: International Computer Science Institute, University of California, Berkeley, USA
Po-Sen Huang: Beckman Institute, University of Illinois at Urbana-Champaign, USA
Luke Gottlieb: International Computer Science Institute, University of California, Berkeley, USA
Gerald Friedland: International Computer Science Institute, University of California, Berkeley, USA
Ajay Divakaran: SRI International Sarnoff, USA
Mark Hasegawa-Johnson: Beckman Institute, University of Illinois at Urbana-Champaign, USA

International Journal of Multimedia Data Engineering and Management (IJMDEM), 2012, vol. 3, issue 3, 1-19

Abstract: A video’s soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as “ngine sounds,” “utdoor/indoor sounds.” These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question “ho spoke when?”by finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.

Date: 2012
References: Add references at CitEc
Citations:

Downloads: (external link)
http://services.igi-global.com/resolvedoi/resolve. ... 018/jmdem.2012070101 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:igg:jmdem0:v:3:y:2012:i:3:p:1-19

Access Statistics for this article

International Journal of Multimedia Data Engineering and Management (IJMDEM) is currently edited by Chengcui Zhang

More articles in International Journal of Multimedia Data Engineering and Management (IJMDEM) from IGI Global Scientific Publishing
Bibliographic data for series maintained by Journal Editor ().