Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

JSTSP Volume 14 Issue 3

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

network_general_2.jpg

By:

Chao Zhang; Zichao Yang; Xiaodong He; Li Deng

Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

Read on IEEE Xplore

Tags:

IEEE JSTSP Article

SPS Social Media

IEEE SPS Facebook Page https://www.facebook.com/ieeeSPS
IEEE SPS X Page https://x.com/IEEEsps
IEEE SPS Instagram Page https://www.instagram.com/ieeesps/?hl=en
IEEE SPS LinkedIn Page https://www.linkedin.com/company/ieeesps/
IEEE SPS YouTube Channel https://www.youtube.com/ieeeSPS

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel

© Copyright 2025 IEEE - All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

IVMSP_2020.jpg

(IVMSP 2026) IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop

Webinar.jpg

SPS BSI Webinar: Integration of Brain Imaging and Genomics with Interpretable Multimodal Collaborative Learning

webinar_general_dsi.jpg

SA-TWG Webinar: Channel Estimation for Beyond Diagonal RIS via Tensor Decomposition

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Journal of Selected Topics in Signal Processing

Publications & Resources

For Authors

SP-Magazine-Front_Cover-March-2025.jpg

CAI_2027_Call_for_Proposals.png

nominate_2_general.jpg

Top Reasons to Join SPS Today!

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

network_general_2.jpg

SPS Social Media

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Search form

You are here

Journal of Selected Topics in Signal Processing

Publications & Resources

For Authors

Top Reasons to Join SPS Today!

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

SPS Social Media

IEEE SPS Educational Resources