SPS SLTC/AASP TC Webinar: Audio Signal Enhancement: A Weakly Supervised Deep Learning Approach

Date: 15 January 2025
Time: 7:00 AM ET (New York Time)
Presenter(s): Dr. Nobutaka Ito, Dr. Yoshiaki Bando

Abstract

Audio signals are often contaminated by noise, degrading perceptual quality and system performance (e.g., automatic speech recognition). Audio signal enhancement aims to extract the desired signal from such noisy signals.

The predominant supervised deep learning approach relies on paired noisy and clean signals during training. However, this approach often fails in the real world, because it assumes ideal conditions that are rarely met. First, assuming that clean signals are readily available, these are used to synthesize noisy signals by adding reverberation and noise, resulting in paired training data. Yet this assumption is often unrealistic, as recording clean signals can be time-consuming, costly, or even infeasible. Second, the training data are assumed to follow the same probability distribution as real test data. In practice, however, domain shifts commonly arise from the difficulty of accurately simulating real data and fully capturing the diversity of real noise environments in the training data, which may result in poor performance on real test data.

This webinar introduces weakly supervised deep learning as a robust alternative for non-ideal conditions. We begin by outlining fundamental concepts and then explore two approaches that circumvent the need for paired training data. The first is a generative speech enhancement approach, in which a variational autoencoder is pretrained on unpaired clean speech and nonnegative matrix factorization for noise is optimized during inference on the fly. This allows the approach to adapt to diverse, in-the-wild noise environments. The second approach is based on positive–unlabeled learning from unpaired noisy signals and noise, both of which can be recorded directly in real noisy environments. The approach is applicable even when clean signals are unavailable during training.

Biography

Nobutako ItoNobutaka Ito (M’12–SM’20) received the B.E. degree from the University of Tokyo, Tokyo, Japan, in 2007, and the M.S. and Ph.D. degrees in information science and technology from the same university in 2009, and 2012, respectively.

He is currently a Project Lecturer at the Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo. He joined the Communication Science Laboratories, NTT Corporation, Japan, in 2012, where he held the position of a Senior Research Scientist from 2020 to 2021. From 2019 to 2020, he served as a Visiting Industrial Fellow at the University of Cambridge, UK. His research interests include source separation, microphone array processing, signal enhancement, dereverberation, speaker diarization, and multi-target tracking.

Dr. Ito received the Best Paper Award at the IEEE International Conference on Acoustics, Speech, and Signal Processing in 2023, the Best Paper Award at the International Workshop on Acoustic Signal Enhancement in 2018, the Best Paper Award Honorable Mention at the IEEE Workshop on Automatic Speech Recognition and Understanding in 2015, and the JSIAM Transactions Best Paper Award of the Japan Society for Industrial and Applied Mathematics (JSIAM) in 2009. He chairs the EDICS Subcommittee of the Audio and Acoustic Signal Processing Technical Committee and serves as an elected member of the Speech and Language Technical Committee of the IEEE Signal Processing Society. Additionally, he is a Senior Area Editor of the IEEE Signal Processing Letters and an Associate Editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing. He is a member of the Acoustical Society of Japan and the Institute of Electronics, Information and Communication Engineers.

Yoshiaki BandoYoshiaki Bando (M’18) received the M.S. and Ph.D. degrees in informatics from Kyoto University, Kyoto, Japan, in 2015 and 2018, respectively.

Currently, he is a Senior Researcher at the AIRC since 2022. He has also been a Visiting Researcher at the RIKEN Center for Advanced Intelligence Project (AIP), Japan since 2020. From 2018 to 2022, he was a Researcher at the Artificial Intelligence Research Center (AIRC), the National Institute of Advanced Industrial Science and Technology (AIST), Japan.  His research interests include microphone array signal processing, deep Bayesian learning, and robot audition.

Dr. Bando received the IEEE SPS Japan Student Conference Paper Award in 2018, the Advanced Robotics Best Paper Award in 2016, and the Most Innovative Paper Award at IEEE SSRR 2015. He is a member of the Acoustical Society of Japan and the Robotics Society of Japan.