IEEE TMM Special Issue on Large Multi-modal Models for Dynamic Visual Scene Understanding

Manuscript Due: 25 August 2024
Publication Date: TBD


Breakthroughs in large models, such as ChatGPT in language processing or large vision models (e.g., ViT or SAM), have demonstrated remarkable versatility and prowess in various fields and tasks. Despite their success, these single-modal models present limitations in fulfilling the broader requirements of daily-life applications, especially in the pursuit of achieving artificial general intelligence. This has spurred researchers in the multimedia community to delve into the realm of Large Multi-Modal Models (LMMs), exemplified by models like Clip, to enhance multi-modality understanding. More recent LMMs, such as Gemini (Google) and Sora (OpenAI), have demonstrated powerful ability to understand or create realistic and imaginative videos.

While LMMs have garnered widespread attention, they encounter numerous challenges in dealing with dynamic visual scenes. These challenges include integrating and aligning data from multiple modalities such as video, music, and 3D data, addressing issues related to domain shifts, handling noisy data and label problems, and discovering novel objects or patterns. Additionally, for comprehending video scenes, infusing temporal consistency and coherence properties into LMMs presents a significant challenge. Moreover, there is a critical need for research on achieving Parameter-Efficient Fine-Tuning of LMMs for diverse dynamic scene tasks.

This special issue aims to provide a platform for researchers to share their latest advances in the theory of Large Multi-modal Models for dynamic visual scene understanding. We also encourage submissions that explore the potential of LMMs for improving accessibility, diversity, and inclusivity in visual scene understanding. 

We invite original and high-quality papers including but not limited to:

  1. New LMM algorithms/models for dynamic visual scene understanding;
  2. Text-video/ audio-video synthesis/generation and other multimedia algorithms;
  3. Application of LMM in various industries and daily life applications, such as advertising and entertainment;
  4. Dynamic video scene analysis with fundamental multi-modal AI models, such as video segmentation and video understanding;
  5. Dynamic scene graph generation, knowledge discovering and reasoning;
  6. Training and adaptation for LMMs, like lightweighting of LMM;
  7. Open-world visual scene perception with large multi-modality models;
  8. 2D/3D visual scene parsing with LMMs.

Submission Guidelines

Prospective authors should submit their manuscripts following the IEEE TMM guidelines. Authors should submit a PDF version of their complete manuscript to according to the following schedule:

Important Dates

  • Submission Deadline: 25 August 2024
  • First Review: 15 October 2024
  • Revisions due: 15 November 2024
  • Second Review: 15 December 2024
  • Final Manuscripts Decision: 31 January 2025

Guest Editors