SPS SL TC Webinar: Neural Target Speech and Sound Extraction: An Overview

Date: 6 June 2024
Time: 7:30 PM ET (New York Time)
Presenter(s): Dr. Marc Delcroix

Abstract

Humans can listen to a desired sound within a complex acoustic scene consisting of a mixture of various sounds. This phenomenon, called the cocktail party effect or selective hearing, enables us to listen to an interlocutor in a noisy cafe, focus on a particular instrument in a song, or notice a siren on the road. One of the long-term goals of speech and audio processing research is to reproduce the selective hearing ability of humans computationally.

In this webinar, the presenter will discuss target speech/sound extraction (TSE), which is one approach towards achieving this goal. TSE isolates the speech signal of a target speaker or a target sound from a mixture of several speakers or sounds using clues that identify the target in the mixture. Such clues might be a spatial clue indicating the direction of the target, a video of the target, or a prerecorded enrollment audio from which the speaker’s voice or the target sound characteristics can be derived.

TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktail party problem and involves aspects of signal processing such as audio, visual, and array processing, as well as deep learning. In this webinar, he will introduce the foundation and present recent research on neural-based TSE for speech and arbitrary sounds. The presenter will guide the audience through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.

Biography

Melissa HandaMarc Delcroix (Sr. M. IEEE) received the M.Eng. degree from the Free University of Brussels, Brussels, Belgium, and Ecole Centrale Paris, Paris, France, in 2003, and the Ph.D. degree from the Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan, in 2007.

He is currently a Distinguished Researcher with NTT Communication Science Laboratories, NTT Corporation, Japan. He was a Research Associate with NTT Communication Science Laboratories from 2007 to 2008 and from 2010 to 2012, and then he became a Permanent Research Scientist with the same lab in 2012. He was a Visiting Lecturer with the Faculty of Science and Engineering of Waseda University, Tokyo, Japan, from 2015 to 2018. His research interests include various aspects of speech and audio processing, such as target speech and sound extraction, speech enhancement, robust speech recognition, model adaptation, and speaker diarization.

Dr. Delcroix is a member of the CHiME challenge steering committee. He was a member of the IEEE Signal Processing (SP) Society Speech and Language Processing Technical Committee (SL-TC) from 2018 to 2023 and the organizing committees of the REVERB Challenge 2014, the ASRU 2017, and SLT 2022. He was the recipient of the 2006 Student Paper Award from the IEEE Kansai Section, the 2015 IEEE Automatic Speech Recognition and Understanding Workshop Best Paper Award honorable mention, and the ACM Multimedia 2022 Best Paper Runner-Up Award.