Recognition of visual content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding visual content using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge vision with natural language, which can be regarded as the ultimate goal of visual understanding. We will present recent advances in exploring the synergy of visual understanding and language processing techniques, including vision-language alignment, visual captioning and commenting, visual emotion analysis, visual question answering, visual storytelling, and as well as open issues for this emerging research area.

DOI

https://dx.doi.org/10.17023/2bqp-4t38

Duration

1:16:00

Subtitles

✖

Vision and Language: Bridging Vision and Language with Deep Learning (Part 2 of 2)

View on the SPS Resource Center

Category

Proficiency

Language

Media Type

EDICs

Intended Audience

Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Keywords

Convolutional Neural Network

LSTM

Long Short-Term Memor

Video commenting

VQA

Visual Question Answering

2017 IEEE International Conference on Image Processing

ICIP 2017

Authors

Tao Mei; Jiebo Luo

Date

17 September 2017

Recognition of visual content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding visual content using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge vision with natural language, which can be regarded as the ultimate goal of visual understanding. We will present recent advances in exploring the synergy of visual understanding and language processing techniques, including vision-language alignment, visual captioning and commenting, visual emotion analysis, visual question answering, and visual storytelling, as well as open issues for this emerging research area.

DOI

https://dx.doi.org/10.17023/9k64-9k20

Duration

1:28:02

Subtitles

✖

Multimodal Signal Processing, Saliency, and Summarization

View on the SPS Resource Center

Category

Proficiency

Language

Media Type

EDICs

Intended Audience

Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Keywords

audio-visual perception

Natural Language Processing

NLP

ICASSP 2017

Authors

Petros Maragos; Alexandros Potamianos; Athanasia Zlatintsi; Petros Koutras

Date

19 June 2017

The goal of the ICASSP-2017 tutorial is to provide a concise overview of the computational aspects of human attention as applied to multimodal signal processing and multimodal (i.e., audio-visual-text) salient event detection in multimodal information streams, such as videos with audio and text. It will present state-of-the-art work in multimodal signal processing, audio-visual saliency models, related audio processing and computer vision algorithms, how to tackle the task of semantic saliency computation for text, multimodal fusion, technological applications, such as audio and movie summarization and outstanding research frontiers in this area. Application areas of saliency computation approaches include audio-visual event detection, video abstraction and summarization, image/video retrieval, scene analysis, action recognition, object recognition, perception-based video processing. Additionally, in this tutorial state-of-the-art algorithms will be presented and specifically a unified energy-based audio-visual framework for frontend processing, a method for text saliency computation, detection of perceptually salient events from videos, as well as a movie summarization system for the automatic production of summaries. Further, a state-of-the-art multimodal video database, namely COGNIMUSE, will be presented as well. The database is annotated with sensory and semantic saliency, events, cross-media semantics and emotion, which can be used for training and evaluation of event detection and summarization algorithms, for classification and recognition of audio-visual and cross-media events, as well as for emotion tracking.

DOI

https://dx.doi.org/10.17023/pbp5-9j74

Duration

2:56:37

Subtitles

✖