Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect Training Dataset for Image Captioning

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Self-Guiding Multimodal LSTM - When We Do Not Have a Perfect Training Dataset for Image Captioning

Yang Xian; Yingli Tian

In this paper, a self-guiding multimodal LSTM (sgLSTM) image captioning model is proposed to handle an uncontrolled imbalanced real-world image-sentence dataset. We collect a FlickrNYC dataset from Flickr as our testbed with 306,165 images and the original text descriptions uploaded by the users are utilized as the ground truth for training. Descriptions in the FlickrNYC dataset vary dramatically ranging from short term-descriptions to long paragraph-descriptions and can describe any visual aspects, or even refer to objects that are not depicted. To deal with the imbalanced and noisy situation and to fully explore the dataset itself, we propose a novel guiding textual feature extracted utilizing a multimodal LSTM (mLSTM) model. Training of mLSTM is based on the portion of data in which the image content and the corresponding descriptions are strongly bonded. Afterward, during the training of sgLSTM on the rest training data, this guiding information serves as additional input to the network along with the image representations and the ground-truth descriptions. By integrating these input components into a multimodal block, we aim to form a training scheme with the textual information tightly coupled with the image content. The experimental results demonstrate that the proposed sgLSTM model outperforms the traditional state-of-the-art multimodal RNN captioning framework in successfully describing the key components of the input images.

SPS on Twitter

  • THIS FRIDAY: Join our Vice President-Membership, K.V.S. Hari, and Membership Development Committee Chair, Arash Moh…
  • The SPACE webinar series continues tomorrow, Tuesday, 11 August at 11 AM ET with Dr. Xiao Xiang Zhu presenting "Dat…
  • now accepting submissions for special sessions, tutorials, and papers! The conference is set for June 2…
  • DEADLINE EXTENDED: The IEEE Journal of Selected Topics in Signal Processing is now accepting papers for a Special I…
  • NEW WEBINAR: Join us on Friday, 14 August at 11:00 AM ET for the 2021 SPS Membership Preview! Society leadership wi…

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar