IEEE Transactions on Multimedia

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

The video captioning task aims to describe video content using several natural-language sentences. Although one-step encoder-decoder models have achieved promising progress, the generations always involve many errors, which are mainly caused by the large semantic gap between the visual domain and the language domain and by the difficulty in long-sequence generation.

The prevailing use of both images and text to express opinions on the web leads to the need for multimodal sentiment recognition. Some commonly used social media data containing short text and few images, such as tweets and product reviews, have been well studied. However, it is still challenging to predict the readers’ sentiment after reading online news articles, since news articles often have more complicated structures, e.g., longer text and more images.

Recently, dense video captioning has made attractive progress in detecting and captioning all events in a long untrimmed video. Despite promising results were achieved, most existing methods do not sufficiently explore the scene evolution within an event temporal proposal for captioning, and therefore perform less satisfactorily when the scenes and objects change over a relatively long proposal. To address this problem, we propose a graph-based partition-and-summarization (GPaS) framework for dense video captioning within two stages.

Benefiting from the powerful discriminative feature learning capability of convolutional neural networks (CNNs), deep learning techniques have achieved remarkable performance improvement for the task of salient object detection (SOD) in recent years.

While current research on multimedia is essentially dealing with the information derived from our observations of the world, internal activities inside human brains, such as imaginations and memories of past events etc., could become a brand new concept of multimedia, for which we coin as “brain-media”.

JPEG lossy image compression is a still image compression algorithm model that is currently widely used in major network media. However, it is unsatisfactory in the quality of compressed images at low bit rates. The objective of this paper is to improve the quality of compressed images and suppress blocking artifacts by improving the JPEG image compression model at low bit rates.

We have recently seen great progress in image classification due to the success of deep convolutional neural networks and the availability of large-scale datasets. Most of the existing work focuses on single-label image classification. However, there are usually multiple tags associated with an image. The existing works on multi-label classification are mainly based on lab curated labels.

The mnemonic descent method (MDM) algorithm is the first end-to-end recurrent convolutional system for high-accuracy face alignment. However, the heavy computational complexity and high memory access demands make it difficult to satisfy the requirements of real-time applications. To address this problem, an improved MDM (I-MDM) algorithm is proposed for efficient hardware implementation based on several hardware-oriented optimizations.

Recently, soft video multicasting has gained a lot of attention, especially in broadcast and mobile scenarios where the bit rate supported by the channel may differ across receivers, and may vary quickly over time. Unlike the conventional designs that force the source to use a single bit rate according to the receiver with the worst channel quality, soft video delivery schemes transmit the video such that the video quality at each receiver is commensurate with its specific instantaneous channel quality.

An automatic speech recognition (ASR) system is a key component in current speech-based systems. However, the surrounding acoustic noise can severely degrade the performance of an ASR system. An appealing solution to address this problem is to augment conventional audio-based ASR systems with visual features describing lip activity. 


SPS on Twitter

  • The Brain Space Initiative Talk Series continues this Friday, 24 September at 11:00 AM EDT when Dr. Jessica Damoise…
  • The 2022 membership year has begun! Join our community of more than 17,000 signal processing and data science profe…
  • Join us this Tuesday, 21 September for the Women in Signal Processing event at ICIP 2021! Registration available on…
  • The SPACE Webinar Series continues this Tuesday, 21 September when Dr. Bin Dong presents "Data- and Task-Driven CT…
  • Join SPS President Ahmed Tewfik on Wednesday, 22 September for the IEEE Signal Processing Society Town Hall in conj…

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar