1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.
Contributed by Dr. Joe (Zhou) Reng, based on the IEEEXplore® article, “Robust Part-Based Hand Gesture Recognition Using Kinect Sensor”, published in the IEEE Transactions on Multimedia, February 2013, and the SPS webinar, “Human Centric Visual Analysis - Hand, Gesture, Pose, Action, and Beyond,” available on the SPS Resource Center.
Human centric visual analysis is the task of analyzing humans using computer vision algorithms. It is essential in computer vision since humans are the key element for cameras to analyze. Analyzing human characteristics, such as face, hand, gesture, pose, gait, action, etc., are important to answer the questions like “who the person is”, “what the person is doing”, and other finer grained human understanding. Different input modalities have been proposed for this task, such as RGBD data, RGB image, and video, etc. In this article I will introduce 4 visual analysis tasks on analyzing human hand, gesture, pose, and action respectively; more specifically, I will introduce the following 4 tasks: a) hand gesture recognition using RGBD camera [1]; b) 3D hand shape and pose estimation from a single RGB image [2]; c) human pose estimation and tracking by modeling temporal dynamics [3]; and d) weakly-guided self-supervised pretraining for action detection [4].
In recent years, we have seen great progress in understanding human hand, body [5] and action [6] using RGBD cameras. With the additional depth information, it is easier to detect and segment the object, thus making it more robust to cluttered background and occlusion. However, compared to the entire human body, the hand is a smaller object with more complex articulations and more easily affected by segmentation errors. It is thus a very challenging problem to recognize hand gestures even with RGBD input. Our work [1, 7] focuses on building a robust part-based hand gesture recognition system using Kinect sensor. To handle the noisy hand shapes obtained from the Kinect sensor, it proposed a novel distance metric, Finger-Earth Mover’s Distance (FEMD), to measure the dissimilarity between hand shapes. As it only matches the finger parts while not the whole hand, it can better distinguish the hand gestures of slight differences. The extensive experiments demonstrate that the proposed hand gesture recognition system is accurate (a 93.2% mean accuracy on a challenging 10-gesture dataset), efficient (average 0.0750 s per frame), robust to hand articulations, distortions and orientation or scale changes, and can work in uncontrolled environments (cluttered backgrounds and lighting conditions). The superiority of the proposed hand gesture recognition system has been further demonstrated in two real-life HCI applications [8, 9].
Most current methods in 3D hand analysis from monocular RGB images [10, 11] only focus on estimating the 3D locations of hand keypoints, which cannot fully express the 3D shape of hand. In contrast, our work [2] addressed a novel and challenging problem of estimating the full 3D hand shape and pose from a single RGB image. We proposed a Graph Convolutional Neural Network (Graph CNN) based method to reconstruct a full 3D mesh of hand surface that contains richer information of both 3D hand shape and pose. To train networks with full supervision, a large-scale synthetic dataset was created, which contains both ground truth 3D meshes and 3D poses. When fine-tuning the networks on real-world datasets without 3D ground truth, our work [2] proposed a weakly-supervised approach by leveraging the depth map as a weak supervision in training. Through extensive evaluations on the proposed new datasets and two public datasets, we show that our proposed method can produce accurate and reasonable 3D hand mesh from a single RGB image, and can achieve superior 3D hand pose estimation accuracy when compared with prior methods.
Multi-person pose estimation and tracking serve as crucial steps for video understanding. Most existing approaches [12, 13] rely on first estimating poses in each frame and only then implementing data association and refinement. Despite the promising results achieved, such a strategy is inevitably prone to missed detections especially in heavily cluttered scenes, since this tracking-by-detection paradigm is, by nature, largely dependent on visual evidences that are absent in the case of occlusion. Our paper [3] has proposed a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame, and hence may serve as a robust estimation even in challenging scenarios including occlusion. Specifically, it derives this prediction of dynamics through a graph neural network (GNN) that explicitly accounts for both spatial-temporal and visual information. It takes as input the historical pose tracklets and directly predicts the corresponding poses in the following frame for each tracklet. The predicted poses will then be aggregated with the detected poses, if any, at the same frame so as to produce the final pose, potentially recovering the occluded joints missed by the estimator. Experiments on PoseTrack 2017 and PoseTrack 2018 datasets have demonstrated that the proposed method achieved state of the art results on both human pose estimation and tracking tasks.
Temporal action detection aims to predict action classes per frame, in contrast to video-level predictions as done in action classification (i.e., action recognition). Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. Thus, commonly, previous work on temporal action detection resorts to fine-tuning a classification model pretrained on large-scale classification datasets (e.g., Kinetics-400). However, such pretrained models are not ideal for downstream detection performance due to the disparity between the pretraining and the downstream fine-tuning tasks. Our work [4] has proposed a novel self-supervised pretraining method for detection leveraging classification labels to mitigate such disparity by introducing frame-level pseudo labels, multi-action frames, and action segments. We show that the models pretrained with the proposed self-supervised detection task outperform prior work on multiple challenging action detection benchmarks, including Charades and MultiTHUMOS. Our extensive ablations further provide insights on when and how to use the proposed models for action detection.
With finer and better analysis of humans’ characteristics from hand to gesture, human pose, and action, we can make cameras smarter to better analyze humans. Moreover, different applications can be developed with such fine-grained understanding, such as Person Re-Identification (ReID), tracking, activity understanding, and Human-Computer-Interaction (HCI). In order to further understand complicated questions as “who is doing what”, “why someone is doing what”, etc., we must strive to develop computer vision and machine learning technologies that can better holistically understand humans. Looking to the future, we should develop our efforts to advance approaches to effectively model human characteristics, efficiently utilize the huge amount of human multi-modal data on the Internet, and leverage the contextual information available.
[1] Z. Ren, J. Yuan, J. Meng and Z. Zhang, "Robust Part-Based Hand Gesture Recognition Using Kinect Sensor," in IEEE Transactions on Multimedia, vol. 15, no. 5, pp. 1110-1120, Aug. 2013, doi: https://dx.doi.org/10.1109/TMM.2013.2246148.
[2] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, and Junsong Yuan, “3D Hand Shape and Pose Estimation from a Single RGB Image”, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019 (Oral), doi: https://dx.doi.org/10.1109/CVPR.2019.01109.
[3] Yiding Yang, Zhou Ren, Haoxiang Li, Chunluan Zhou, Xinchao Wang, and Gang Hua, “Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking”, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, doi: https://dx.doi.org/10.1109/CVPR46437.2021.00798.
[4] Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Gang Hua, and Michael S. Ryoo, “Weakly-guided Self-supervised Pretraining for Temporal Activity Detection”, in AAAI Conference on Artificial Intelligence (AAAI), 2023.
[5] Qin Cai, David Gallup, Cha Zhang, and Zhengyou Zhang, “3D deformable face tracking with a commodity depth camera”, in European Conference on Computer Vision (ECCV), 2010, doi: https://dl.acm.org/doi/10.5555/1927006.1927026.
[6] Jiang Wang, Zicheng Liu, Ying Wu and Junsong Yuan, “Mining actionlet ensemble for action recognition with depth cameras”, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, doi: https://dx.doi.org/10.1109/CVPR.2012.6247813.
[7] Zhou Ren, Junsong Yuan, and Zhengyou Zhang, “Robust Hand Gesture Recognition based on Finger-Earth Mover’s Distance with a Commodity Depth Camera”, in ACM Multimedia (ACM MM), 2011, doi: https://dl.acm.org/doi/10.1145/2072298.2071946.
[8] Zhou Ren, Jingjing Meng, Junsong Yuan, and Zhengyou Zhang, “Robust Hand Gesture Recognition with Kinect Sensor”, in ACM Multimedia (ACM MM), 2011, doi: https://dl.acm.org/doi/10.1145/2072298.2072443.
[9] Zhou Ren, Jingjing Meng, and Junsong Yuan, “Depth Camera based Hand Gesture Recognition and its Applications in Human-Computer-Interaction”, in IEEE International Conference on Information, Communication, and Signal Processing (ICICS), 2011 (Oral), doi: https://dx.doi.org/10.1109/ICICS.2011.6173545.
[10] Adrian Spurr, Jie Song, Seonwook Park, and Otmar Hilliges, “Cross-modal deep variational hand pose estimation”, in CVPR, 2018. doi: https://dx.doi.org/10.1109/CVPR.2018.00017.
[11] Christian Zimmermann and Thomas Brox, “Learning to estimate 3d hand pose from single RGB images”, in International Conference on Computer Vision (ICCV), 2017, doi: https://dx.doi.org/10.1109/ICCV.2017.525.
[12] Chunluan Zhou, Zhou Ren, and Gang Hua, “Temporal keypoint matching and refinement network for pose estimation and tracking”, European Conference on Computer Vision (ECCV), 2020, doi: https://doi.org/10.1007/978-3-030-58542-6_41.
[13] Manchen Wang, Joseph Tighe, and Davide Modolo, “Combining detection and tracking for human pose estimation in videos”, In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, doi: https://dx.doi.org/10.1109/CVPR42600.2020.01110.