SPS Webinar: The Changing Landscape of Speech Foundation Models

Date: 6 August 2024
Time: 1:00 PM ET (New York Time)
Presenter(s): Dr. Shinji Watanabe, Dr. Abdelrahman Mohamed
Dr. Karen Livescu, Dr. Hung-yi Lee, Dr. Tara Sainath,
Dr. Katrin Kirchhoff & Dr. Shang-Wen Li

Original article publicly available for download on the day of the webinar for 48 hours: Download article


The paper "Self-Supervised Speech Representation Learning: A Review", published in 2022, focused on how representation learning transformed the landscape of speech perception models and AI applications. However, over the past two years since the article was published, there have been numerous developments in building "Foundation Models" that have blurred the boundaries between domains. Generative models have had the largest share of research innovation due to their impressive performance across many modalities and their applicability to a wider set of scenarios. In this talk, the presenters will connect their 2022 review of self-supervised approaches to the current developments in foundation perception and generative models. They will highlight active directions of research in foundation models, methods to analyze them, and their standing in comparison to other approaches across a wide range of speech applications.


Dr. Shinji Watanabe received the B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan.  His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing.

He is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. Prior to that he was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia Institute of Technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Before Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020.

Dr. Watanabe has published over 400 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He is a Senior Area Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP). He is an IEEE and ISCA Fellow.

Dr. Abdelrahman Mohamed received the Ph.D. in computer science from the University of Toronto where he was part of the team starting the deep learning revolution in spoken language processing in 2009. His research work spans representation learning, weakly, semi-, and self-supervised methods; 2D and 3D computer vision, speech recognition, language understanding, and modular deep learning.

He is a co-founder and Chief Scientist of Rembrand, an augmented reality company. Before founding Rembrand, he was a Research Director at FAIR, the Fundamental AI Research group at Meta, a Principal Scientist/Manager at Amazon Alexa AI, and a researcher at Microsoft Research.

Dr. Mohamed has more than 70 research journal and conference publications with more than 60,000 citations. He is the recipient of the 2016 IEEE Signal Processing Society Best Journal Paper Award. He served as a member of the IEEE Speech and Language Processing Technical Committee.

Dr. Karen Livescu received the B.S. in Physics at Princeton University and the Ph.D. at MIT.  Her interests span a variety of topics in spoken, written, and signed language processing, representation learning, and learning from multi-modal signals.

She is currently a Professor at TTI-Chicago. Her group’s work has received multiple awards, including a 2023 ICML Test of Time Honorable Mention and the 2023 IEEE ASRU Best Student Paper award.

Dr. Livescu has served as a program chair/co-chair for ICLR, Interspeech, and ASRU, and is an Associate Editor for TACL and IEEE T-PAMI.  She also served as a member of the IEEE Speech and Language Technical Committee (SLTC) in 2011-17.  She is an ISCA Fellow and a recent IEEE Distinguished Lecturer.

Dr. Hung-yi Lee received the M.S. and Ph.D. degrees from National Taiwan University, Taipei, Taiwan, in 2010 and 2012, respectively. His recent research focuses on developing technology that can reduce the requirement of annotated data for speech processing (including voice conversion and speech recognition) and natural language processing (including abstractive summarization and question answering).

He is a professor of the Department of Electrical Engineering at National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering of the university.

Dr. Lee won Salesforce Research Deep Learning Grant in 2019, AWS ML Research Award in 2020, Outstanding Young Engineer Award from The Chinese Institute of Electrical Engineering in 2018, Young Scholar Innovation Award from Foundation for the Advancement of Outstanding Scholarship in 2019, Ta-You Wu Memorial Award from Ministry of Science and Technology of Taiwan in 2019, and The 59th Ten Outstanding Young Person Award in Science and Technology Research & Development of Taiwan. He owns a YouTube channel that teaches deep learning in Mandarin and has about 210k subscribers.

Dr. Tara Sainath received the S.B., M.Eng and Ph.D. in electrical engineering and computer science (EECS) from MIT. Her primary research interests are in deep neural networks for speech and audio processing.

She is currently a Distinguished Research Scientist and the co-lead of the Gemini Audio Pillar at Google DeepMind.  There, she focuses on the integration of audio capabilities with large language models (LLMs).

Dr. Sainath is recognized through her IEEE and ISCA Fellowships, and awards such as the 2021 IEEE SPS Industrial Innovation Award and the 2022 IEEE SPS Signal Processing Magazine Best Paper Award. She has served as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) as well as the Associate Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing. Dr. Sainath's leadership is exemplified by her roles as Program Chair for ICLR (2017, 2018) and her extensive work co-organizing influential conferences and workshops, including: Interspeech (2010, 2016, 2019), ICML (2013, 2017), and NeurIPS 2020.

Dr. Katrin Kirchhoff is a Director of Applied Science at AWS AI Labs, where she leads multiple science efforts on speech and natural language processing, Health AI, and Data Analytics. Previously she was a Research Professor at the Department of Electrical Engineering at the University of Washington, where she co-founded the Signal, Speech and Language Interpretation Lab. Her research interests span multilingual speech processing, NLP, unsupervised learning, and multimodality.

Dr. Kirchhoff has over 20 years of experience in speech processing and has served on as a member of the IEEE Speech and Language Processing Technical Committee (SLTC), on the Editorial Boards of Speech Communication, IEEE ACM Transactions on Audio, Speech, and Language Processing, and Computer, Speech and Language.

Dr. Shang-Wen Li received the Ph.D. in 2016 from MIT in the Spoken Language Systems group of Computer Science and Artificial Intelligence Laboratory (CSAIL). His recent research is focused on multimodal large language models, multimodal representation learning, and spoken language understanding.

He is a Research Lead and Manager at Meta’s Fundamental AI Research (FAIR) team, and he worked at Apple Siri, Amazon Alexa and AWS before joining FAIR.

Dr. Li served as area chairs in influential conferences including ICLR (2023, 2024), NeurIPS (2024), Interspeech (2022), and AAAI (2023).