Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos

Hanbo Wu; Xin Ma; Yibin Li

With the help of convolutional neural networks (CNNs), video-based human action recognition has made significant progress. CNN features that are spatial and channelwise can provide rich information for powerful image description. However, CNNs lack the ability to process the long-term temporal dependency of an entire video and further cannot well focus on the informative motion regions of actions. Aiming at the two problems, we propose a novel video-based action recognition framework in this paper. We first represent videos with dynamic image sequences (DISs), which effectively describe videos by modeling the local spatial-temporal dynamics and dependencies. Then a channel and spatial-temporal interest points (STIPs) attention model (CSAM) based on CNNs is proposed to focus on the discriminative channels in networks and the informative spatial motion regions of human actions. Specifically, channel attention (CA) is implemented by automatically learning channel-wise convolutional features and assigning different weights for different channels. STIPs attention (SA) is encoded by projecting the detected STIPs on frames of dynamic image sequences into the corresponding convolutional feature map space. The proposed CSAM is embedded after CNN convolutional layers to refine the feature maps, followed by global average pooling to produce effective feature representations for videos. Finally frame-level video representations are fed into an LSTM to capture the temporal dependencies and make classification. Experiments on three challenging RGB-D datasets show that our method has better performance and outperforms the state-of-the-art approaches with only depth data. 

SPS on Twitter

  • The DEGAS Webinar Series continues on 1 February when Dr. Francesca Parise presents "Tractable Network Intervention…
  • The Brain Space Initiative Talk Series continues this Friday, 27 January when Dr. Fan Lam presents "Quantitative, M…
  • CALL FOR PAPERS: The IEEE Transactions on Multimedia is accepting submissions for a Special Issue on When Multimedi…
  • As part of our 75th anniversary celebration, we're holding monthly trivia contests all year long! Enter now for the…
  • New SPS Webinar: On 13 February, join Dr. Harshit Gupta when he presents "CryoGAN: A New Reconstruction Paradigm fo…

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar