Deep Reinforcement Polishing Network for Video Captioning

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Deep Reinforcement Polishing Network for Video Captioning

By: 
Wanru Xu; Jian Yu; Zhenjiang Miao; Lili Wan; Yi Tian; Qiang Ji

The video captioning task aims to describe video content using several natural-language sentences. Although one-step encoder-decoder models have achieved promising progress, the generations always involve many errors, which are mainly caused by the large semantic gap between the visual domain and the language domain and by the difficulty in long-sequence generation. The underlying challenge of video captioning, i.e., sequence-to-sequence mapping across different domains, is still not well handled. Inspired by the proofreading procedure of human beings, the generated caption can be gradually polished to improve its quality. In this paper, we propose a deep reinforcement polishing network (DRPN) to refine the caption candidates, which consists of a word-denoising network (WDN) to revise word errors and a grammar-checking network (GCN) to revise grammar errors. On the one hand, the long-term reward in deep reinforcement learning benefits the long-sequence generation, which takes the global quality of caption sentences into account. On the other hand, the caption candidate can be considered a bridge between visual and language domains, where the semantic gap is gradually reduced with better candidates generated by repeated revisions. In experiments, we present adequate evaluations to show that the proposed DRPN achieves comparable and even better performance than the state-of-the-art methods. Furthermore, the DRPN is model-irrelevant and can be integrated into any video captioning models to refine their generated caption sentences.

SPS on Twitter

  • New SPS Webinar! On Friday, 29 October, join Dr. Jérôme Gilles for "Empirical Wavelets," based on his original arti… https://t.co/ZuZ7qwO9Pc
  • The Brain Space Initiative Talk Series continues on Friday, 29 October when Dr. Selin Aviyente presents "Cross-Freq… https://t.co/Jxgu2soJCc
  • Join the Brain Space Initiative for another virtual mixing event on Wednesday, 27 October! Grab a coffee and meet w… https://t.co/KA3kuPUGw0
  • We're proud to sponsor a new journal, IEEE Transactions on Quantum Engineering, publishing regular, review, and tut… https://t.co/cZskrh9cvX
  • We are now seeking mentors and students for the launch of a new initiative, Mentoring Experiences for Underrepresen… https://t.co/i9SarNyKm9

SPS Videos


Signal Processing in Home Assistants

 


Multimedia Forensics


Careers in Signal Processing             

 


Under the Radar