The goal of the ICASSP-2017 tutorial is to provide a concise overview of the computational aspects of human attention as applied to multimodal signal processing and multimodal (i.e., audio-visual-text) salient event detection in multimodal information streams, such as videos with audio and text. It will present state-of-the-art work in multimodal signal processing, audio-visual saliency models, related audio processing and computer vision algorithms, how to tackle the task of semantic saliency computation for text, multimodal fusion, technological applications, such as audio and movie summarization and outstanding research frontiers in this area. Application areas of saliency computation approaches include audio-visual event detection, video abstraction and summarization, image/video retrieval, scene analysis, action recognition, object recognition, perception-based video processing. Additionally, in this tutorial state-of-the-art algorithms will be presented and specifically a unified energy-based audio-visual framework for frontend processing, a method for text saliency computation, detection of perceptually salient events from videos, as well as a movie summarization system for the automatic production of summaries. Further, a state-of-the-art multimodal video database, namely COGNIMUSE, will be presented as well. The database is annotated with sensory and semantic saliency, events, cross-media semantics and emotion, which can be used for training and evaluation of event detection and summarization algorithms, for classification and recognition of audio-visual and cross-media events, as well as for emotion tracking.

DOI

https://dx.doi.org/10.17023/pbp5-9j74

Duration

2:56:37

Subtitles

✖

Forty Years of Automatic Speech Recognition (ASR): From Statistical Decision Theory to Deep Learning

View on the SPS Resource Center

Man vs. Machine in Conversational Speech Recognition

View on the SPS Resource Center

Category

Proficiency

Language

Media Type

EDICs

Intended Audience

Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Keywords

Automatic Speech Recognition (ASR)

Switchboard and CallHome datasets

Recurrent neural network

Convolutional neural network

Acoustic modeling

Speaker-adversarial training

Human speech recognition

Authors

George Saon

Date

13 January 2018

We live in an era where more and more tasks, once thought to be impregnable bastions of human intelligence, have succumb to AI. Are we at the cusp where ASR systems have matched expert humans in conversational speech recognition? We try to answer this question with some experimental evidence on the Switchboard English conversational telephony corpus. On the human side, we describe some listening experiments that established a new human performance benchmark. On the ASR side, we discuss a series of deep learning architectures and techniques for acoustic and language modeling that were instrumental in lowering the word error rate to record levels on this task.

DOI

https://dx.doi.org/10.17023/d235-0947

Duration

0:59:34

Subtitles

✖

Subscribe to SPE-RECO

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

SPE-RECO

Multimodal Signal Processing, Saliency, and Summarization

Forty Years of Automatic Speech Recognition (ASR): From Statistical Decision Theory to Deep Learning

Man vs. Machine in Conversational Speech Recognition

IEEE Signal Processing Society on

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

About IEEE SPS

For Volunteers

Career & Industry

Education & Training