Audio-Visual Deep Neural Network for Robust Person Verification

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Audio-Visual Deep Neural Network for Robust Person Verification

Yanmin Qian; Zhengyang Chen; Shuai Wang

Voice and face are two most popular biometrics for person verification, usually used in speaker verification and face verification tasks. It has already been observed that simply combining the information from these two modalities can lead to a more powerful and robust person verification system. In this article, to fully explore the multi-modal learning strategies for person verification, we proposed three types of audio-visual deep neural network (AVN), including feature level AVN (AVN-F), embedding level AVN (AVN-E), and embedding level combination with joint learning AVN (AVN-J). To further enhance the system robustness in real noisy conditions where not both modalities can be accessed with high-quality, we proposed several data augmentation strategies for each proposed AVN: A feature-level multi-modal data augmentation is proposed for AVN-F and an embedding-level data augmentation with novel noise distribution matching is designed for AVN-E. For AVN-J, both the feature and embedding level multi-modal data augmentation methods can be applied. All the proposed models are trained on the VoxCeleb2 dev dataset and evaluated on the standard VoxCeleb1 dataset, and the best system achieves 0.558, 0.441% and 0.793% EER on the three official trial lists of VoxCeleb1, which is to our knowledge the best published single system results on this corpus for person verification. To validate the robustness of the proposed approaches, a noisy evaluation set based on the VoxCeleb1 is constructed, and experimental results show that the proposed system can significantly boost the system robustness and still show promising performance under this noisy scenario.


IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel