End-to-End Audiovisual Speech Recognition System With Multitask Learning

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

TMM Volume 23 | 2021

End-to-End Audiovisual Speech Recognition System With Multitask Learning

By:

Fei Tao; Carlos Busso

An automatic speech recognition (ASR) system is a key component in current speech-based systems. However, the surrounding acoustic noise can severely degrade the performance of an ASR system. An appealing solution to address this problem is to augment conventional audio-based ASR systems with visual features describing lip activity. This paper proposes a novel end-to-end, multitask learning (MTL), audiovisual ASR (AV-ASR) system. A key novelty of the approach is the use of MTL, where the primary task is AV-ASR, and the secondary task is audiovisual voice activity detection (AV-VAD). We obtain a robust and accurate audiovisual system that generalizes across conditions. By detecting segments with speech activity, the AV-ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage from the AV-VAD alignment information. Furthermore, the end-to-end system learns from the raw audiovisual inputs a discriminative high-level representation for both speech tasks, providing the flexibility to mine information directly from the data. The proposed architecture considers the temporal dynamics within and across modalities, providing an appealing and practical fusion scheme. We evaluate the proposed approach on a large audiovisual corpus (over 60 hours), which contains different channel and environmental conditions, comparing the results with competitive single task learning (STL) and MTL baselines. Although our main goal is to improve the performance of our ASR task, the experimental results show that the proposed approach can achieve the best performance across all conditions for both speech tasks. In addition to state-of-the-art performance in AV-ASR, the proposed solution can also provide valuable information about speech activity, solving two of the most important tasks in speech-based applications.

Read on IEEE Xplore

Tags:

IEEE TMM Article

SPS on Twitter

DEADLINE EXTENDED: The 2023 IEEE International Workshop on Machine Learning for Signal Processing is now accepting… https://t.co/NLH2u19a3y
ONE MONTH OUT! We are celebrating the inaugural SPS Day on 2 June, honoring the date the Society was established in… https://t.co/V6Z3wKGK1O
The new SPS Scholarship Program welcomes applications from students interested in pursuing signal processing educat… https://t.co/0aYPMDSWDj
CALL FOR PAPERS: The IEEE Journal of Selected Topics in Signal Processing is now seeking submissions for a Special… https://t.co/NPCGrSjQbh
Test your knowledge of signal processing history with our April trivia! Our 75th anniversary celebration continues:… https://t.co/4xal7voFER

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel

© Copyright 2024 IEEE – All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

award_nomination_article_2023_new.jpg

Nominate a Colleague! Nominations Open for 2024 IEEE SPS Awards

webinar_4.jpg

SPS Student Services Committee Webinar: Effectively Applying for Industrial Research

news_general.jpg

Action: Add Your Chapter Officers to Receive Important Communications

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

End-to-End Audiovisual Speech Recognition System With Multitask Learning

TMM Menu

Publications & Resources

For Authors

award_nomination_article_2023_new.jpg

news_general.jpg

SP_ML_general.jpg

Top Reasons to Join SPS Today!

End-to-End Audiovisual Speech Recognition System With Multitask Learning

SPS on Twitter

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

End-to-End Audiovisual Speech Recognition System With Multitask Learning

Search form

You are here

TMM Menu

Publications & Resources

For Authors

Top Reasons to Join SPS Today!

End-to-End Audiovisual Speech Recognition System With Multitask Learning

SPS on Twitter

IEEE SPS Educational Resources