SPS Webinar: 19 January 2023, presented by Dr. Fei Tao and Dr. Carlos Busso

January 2023

SPS Webinar: 19 January 2023, presented by Dr. Fei Tao and Dr. Carlos Busso

Upcoming SPS Webinar!

Title: Advances on Multimodal Machine Learning Solutions for Speech Processing Tasks and Emotion Recognition
Date: 19 January 2023
Time: 1:00 PM Eastern (New York time)
Duration: Approximately 1 Hour
Presenters: Dr. Fei Tao and Dr. Carlos Busso

Based on the IEEE Xplore® article: End-to-End Audiovisual Speech Recognition System with Multitask Learning
Published: IEEE Transactions on Multimedia, February 2020, available in IEEE Xplore®

Download: The original article is available for download.

Abstract:

Recent advances in multimodal processing have led to promising solutions for speech-processing tasks. One example is automatic speech recognition (ASR), which is a key component in current speech-based systems. Since the surrounding acoustic noise can severely degrade the performance of an ASR system, an appealing solution is to augment conventional audio-based ASR systems with visual features describing lip activity. We describe a novel end-to-end, multitask learning (MTL), audiovisual ASR (AV-ASR) system. A key novelty of the approach is the use of MTL, where the primary task is AV-ASR, and the secondary task is audiovisual voice activity detection (AV-VAD). We obtain a robust and accurate audiovisual system that generalizes across conditions. By detecting segments with speech activity, the AV-ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage from the AV-VAD alignment information. Furthermore, the end-to-end system learns from the raw audiovisual inputs a discriminative high-level representation for both speech tasks, providing the flexibility to mine information directly from the data. The proposed architecture considers the temporal dynamics within and across modalities, providing an appealing and practical fusion scheme. In addition to state-of-the-art performance in AV-ASR, the proposed solution can also provide valuable information about speech activity, solving two of the most important tasks in speech-based applications. This webinar will also discuss advances of multimodal solutions for emotion recognition. We describe multimodal pre-text tasks that are carefully designed to learn better representations for predicting emotional cues from speech, leveraging the relationship between acoustic and facial features. We also discuss our current effort to design multimodal emotion recognition strategies that effectively combine auxiliary networks, a transformer architecture, and an optimized training mechanism for aligning modalities, capturing temporal information, and handling missing features. These models offer principled solutions to increase the generalization and robustness of emotion recognition systems.

Biography:

Dr. Fei Tao received the B.S. degree in electrical engineering from Beijing Jiaotong University, Beijing (BJTU), Beijing, China, in 2009, the M.S. degree from Texas Southern University (TSU), Houston, TX, USA, and the Ph.D. degree from the University of Texas at Dallas, Richardson, TX, USA, in 2018.

He is currently a Senior Applied Scientist at Amazon Web Services (AWS) leading a team developing multimodal artificial intelligence/machine learning (AIML) models. His research interest is in multimodal machine learning, which involves audio, video, and text. He has worked on speech recognition, speaker verification, multimodal emotion recognition, multimodal active speaker detection, source separation, text-to-speech synthesis, music synthesis, and multimodal advertisement recommendation.

Dr. Tao has served as reviewer for academia, including top conference, ICMI, Interspeech, ICASSP, NAACL, ACL, EMNLP, AAAI, and top journals, like IEEE Transactions on Acoustics, Speech, and Signal Processing.

Dr. Carlos Busso received the B.S. and M.S. (hons.) degrees in electrical engineering from the University of Chile, Santiago, Chile, in 2000 and 2003, respectively, and the Ph.D. degree in electrical engineering from the University of Southern California (USC), Los Angeles, CA, USA, in 2008.

He is a Professor at the University of Texas at Dallas’s Electrical and Computer Engineering Department, where he is also the director of the Multimodal Signal Processing (MSP) Laboratory. His research interest is in human-centered multimodal machine intelligence and application, with focus on the broad areas of affective computing, multimodal human-machine interfaces, in-vehicle active safety system, and machine learning methods for multimodal processing. He has worked on audio-visual emotion recognition, analysis of emotional modulation in gestures and speech, designing realistic human-like virtual characters, and detection of driver distractions. His research group has received research funding from different agencies including the National Science Foundation (NSF), National Institutes of Health (NIH), Biometric Center of Excellence (BCOE), Semiconductor Research Corporation (SRC), and grants from industry (Samsung, Robert Bosch LLC, Microsoft, Honda Research Institute).

Dr. Busso is a recipient of an NSF CAREER Award. In 2014, he received the ICMI Ten-Year Technical Impact Award. In 2015, his student received the third prize IEEE ITSS Best Dissertation Award (N. Li). He also received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain), and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He received the Best of IEEE Transactions on Affective Computing Paper Collection in 2021 (with R. Lotfian) and the Best Paper Award from IEEE Transactions on Affective Computing in 2022 (with Yannakakis and Cowie). He is the co-author of the winning paper of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge. He has served in chair positions for top conferences of the field including International conference on multimodal Interaction (ICMI), Interspeech, the IEEE International Conference on Multimedia & Expo (ICME), IEEE International Conference on Automatic Face and Gesture Recognition (FG), and AAAC Conference on Affective Computing and Intelligent Interaction. He has served as the general chair of ACII 2017 and ICMI 2021, and Program Chair of ICMI 2016, VCIP2017 and ASRU 2021. He is currently serving as an associate editor of the IEEE Transactions on Affective Computing. He is an IEEE Fellow. He is a member of the International Speech Communication Association (ISCA), and the Association for the Advancement of Affective Computing (AAAC), and a senior member of Association for Computing Machinery (ACM).

Nomination/Position	Deadline
Nominate an IEEE Fellow Today!	7 February 2026
Call for Nominations for IEEE SPS Editors-in-Chief	9 February 2026
Call for Short Course Proposals for IEEE International Conference on Multimedia and Expo 2026	13 February 2026
Call for Officer Nominations: Vice President-Conferences and Vice President-Publications	13 February 2026

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

NEWS AND RESOURCES FOR MEMBERS OF THE IEEE SIGNAL PROCESSING SOCIETY