Skip to main content

Multimodal Signal Processing, Saliency and Summarization

SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Date
The goal of the ICASSP-2017 tutorial is to provide a concise overview of the computational aspects of human attention as applied to multimodal signal processing and multimodal (i.e., audio-visual-text) salient event detection in multimodal information streams, such as videos with audio and text. It will present state-of-the-art work in multimodal signal processing, audio-visual saliency models, related audio processing and computer vision algorithms, how to tackle the task of semantic saliency computation for text, multimodal fusion, technological applications, such as audio and movie summarization and outstanding research frontiers in this area. Application areas of saliency computation approaches include audio-visual event detection, video abstraction and summarization, image/video retrieval, scene analysis, action recognition, object recognition, perception-based video processing. Additionally, in this tutorial state-of-the-art algorithms will be presented and specifically a unified energy-based audio-visual framework for frontend processing, a method for text saliency computation, detection of perceptually salient events from videos, as well as a movie summarization system for the automatic production of summaries. Further, a state-of-the-art multimodal video database, namely COGNIMUSE, will be presented as well. The database is annotated with sensory and semantic saliency, events, cross-media semantics and emotion, which can be used for training and evaluation of event detection and summarization algorithms, for classification and recognition of audio-visual and cross-media events, as well as for emotion tracking.
Duration
2:56:37
Subtitles

Crossing Speaker and Language Barriers in Speech Processing

SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
A person’s speech is strongly conditioned by his own articulators and the language(s) he speaks, hence rendering speech in an inter-speaker or inter-language manner from a source speaker’s speech data collected in his native language is both academically challenging and technology/application desirable. The quality of the rendered speech is assessed in three dimensions: naturalness, intelligibility and similarity to the source speaker. Usually, the three criteria cannot be all met when rendering is done in both cross-speaker and cross-language ways. We will analyze the key factors of rendering quality in both acoustic and phonetic domains objectively. Monolingual speech databases but recorded by different speakers or bilingual ones recorded by the same speaker(s) are used. Measures in the acoustic space and phonetic space are adopted to quantify naturalness, intelligibility and speaker’s timber objectively. Our “trajectory tiling” algorithm-based, cross-lingual TTS is used as the baseline system for comparison. To equalize speaker difference automatically, DNN-based ASR acoustic model trained speaker independently is used. Kullback-Leibler Divergence is proposed to statistically measure the phonetic similarity between any two given speech segments, which are from different speakers or languages, in order to select good rendering candidates. Demos of voice conversion, speaker adaptive TTS, cross-lingual TTS will be shown either inter-speaker or inter-language wise, or both. The implications of this research on low-resourced speech research, speaker adaptation, “average speaker’s voice”, accented/dialectical speech processing, speech-to-speech translation, audio-visual TTS, etc. will be discussed.
Duration
1:01:08
Subtitles

HLT-MMPL