Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect Training Dataset for Image Captioning

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

TIP Volume 28 Issue 11

Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect Training Dataset for Image Captioning

TIP Featured Articles

By:

Yang Xian; Yingli Tian

In this paper, a self-guiding multimodal LSTM (sgLSTM) image captioning model is proposed to handle an uncontrolled imbalanced real-world image-sentence dataset. We collect a FlickrNYC dataset from Flickr as our testbed with 306,165 images and the original text descriptions uploaded by the users are utilized as the ground truth for training. Descriptions in the FlickrNYC dataset vary dramatically ranging from short term-descriptions to long paragraph-descriptions and can describe any visual aspects, or even refer to objects that are not depicted. To deal with the imbalanced and noisy situation and to fully explore the dataset itself, we propose a novel guiding textual feature extracted utilizing a multimodal LSTM (mLSTM) model. Training of mLSTM is based on the portion of data in which the image content and the corresponding descriptions are strongly bonded. Afterward, during the training of sgLSTM on the rest training data, this guiding information serves as additional input to the network along with the image representations and the ground-truth descriptions. By integrating these input components into a multimodal block, we aim to form a training scheme with the textual information tightly coupled with the image content. The experimental results demonstrate that the proposed sgLSTM model outperforms the traditional state-of-the-art multimodal RNN captioning framework in successfully describing the key components of the input images.

Read on IEEE Xplore

Tags:

IEEE TIP Article

SPS Social Media

IEEE SPS Facebook Page https://www.facebook.com/ieeeSPS
IEEE SPS X Page https://x.com/IEEEsps
IEEE SPS Instagram Page https://www.instagram.com/ieeesps/?hl=en
IEEE SPS LinkedIn Page https://www.linkedin.com/company/ieeesps/
IEEE SPS YouTube Channel https://www.youtube.com/ieeeSPS

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel

© Copyright 2025 IEEE - All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

ISBI_Logo_CFP.jpg

Call for Proposals: 2028 IEEE International Symposium on Biomedical Imaging (ISBI)

ICASSP 2026

Call for Papers for ICASSP 2026 Now Open!

ICASSP 2026

Call for Papers for ICASSP 2026 Now Open!

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect Training Dataset for Image Captioning

Publications & Resources

Transactions on Image Processing

For Authors

ICASSP 2026

ICASSP 2026

general_get_involved_tc_article_full.jpg

Top Reasons to Join SPS Today!

Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect Training Dataset for Image Captioning

SPS Social Media

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect Training Dataset for Image Captioning

Search form

You are here

Publications & Resources

Transactions on Image Processing

For Authors

Top Reasons to Join SPS Today!

Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect Training Dataset for Image Captioning

SPS Social Media

IEEE SPS Educational Resources