Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos

You are here

IEEE Transactions on Multimedia

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos

By: 
Hanbo Wu; Xin Ma; Yibin Li

With the help of convolutional neural networks (CNNs), video-based human action recognition has made significant progress. CNN features that are spatial and channelwise can provide rich information for powerful image description. However, CNNs lack the ability to process the long-term temporal dependency of an entire video and further cannot well focus on the informative motion regions of actions. Aiming at the two problems, we propose a novel video-based action recognition framework in this paper. We first represent videos with dynamic image sequences (DISs), which effectively describe videos by modeling the local spatial-temporal dynamics and dependencies. Then a channel and spatial-temporal interest points (STIPs) attention model (CSAM) based on CNNs is proposed to focus on the discriminative channels in networks and the informative spatial motion regions of human actions. Specifically, channel attention (CA) is implemented by automatically learning channel-wise convolutional features and assigning different weights for different channels. STIPs attention (SA) is encoded by projecting the detected STIPs on frames of dynamic image sequences into the corresponding convolutional feature map space. The proposed CSAM is embedded after CNN convolutional layers to refine the feature maps, followed by global average pooling to produce effective feature representations for videos. Finally frame-level video representations are fed into an LSTM to capture the temporal dependencies and make classification. Experiments on three challenging RGB-D datasets show that our method has better performance and outperforms the state-of-the-art approaches with only depth data. 

SPS on Twitter

  • DEADLINE EXTENDED: The 2023 IEEE International Workshop on Machine Learning for Signal Processing is now accepting… https://t.co/NLH2u19a3y
  • ONE MONTH OUT! We are celebrating the inaugural SPS Day on 2 June, honoring the date the Society was established in… https://t.co/V6Z3wKGK1O
  • The new SPS Scholarship Program welcomes applications from students interested in pursuing signal processing educat… https://t.co/0aYPMDSWDj
  • CALL FOR PAPERS: The IEEE Journal of Selected Topics in Signal Processing is now seeking submissions for a Special… https://t.co/NPCGrSjQbh
  • Test your knowledge of signal processing history with our April trivia! Our 75th anniversary celebration continues:… https://t.co/4xal7voFER

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel