Aug
12
Date: 12-August-2026
Time: 11:30 AM ET (New York Time)
Presenter: Dr. Haohe Liu
Based on the IEEE Xplore® article under the same title
Published IEEE Transactions on Audio, Speech, and Language Processing, May 2024.
Download article: Original article will be made publicly available for download on the day of the webinar for 48 hours. ARTICLE LINK
Abstract
AudioLDM 2 is a holistic framework for unified audio generation that produces speech, music, and sound effects using a single model. Unlike prior approaches that require separate architectures with task-specific designs for each audio type, AudioLDM 2 introduces a general audio representation called the "language of audio" (LOA), learned through AudioMAE, a self-supervised pretrained model. During generation, a GPT-2 model translates input conditions such as text into LOA, which then guides a latent diffusion model to synthesize high-quality audio. This design unifies diverse audio generation tasks under one framework while enabling advantages such as in-context learning and reusable pretrained components.
The talk will cover the motivation behind holistic audio generation, the AudioLDM 2 architecture, key experimental findings, practical lessons learned from building a unified audio generation system, and recent advancement in this rapidly evolving field.
Biography
Haohe Liu (M’26) received the Ph.D. degree from the Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, U.K., in 2026, supervised by Prof. Mark D. Plumbley and Prof. Wenwu Wang.
He is currently a Research Scientist at Meta SuperIntelligence Lab (FAIR), Seattle, WA, USA. His research spans generative AI for audio, speech, and music, with a focus on developing foundational models that address core machine learning challenges. He is best known as the creator of the AudioLDM series. His open-source projects, including VoiceFixer, AudioSR, and AudioLDM, have collectively received over 11,000 GitHub stars, and his work has received over 5,600 citations.
Dr. Liu received the Postgraduate Researcher of the Year 2024 Award from CVSSP, the Best Technical Paper Award at the 159th AES Convention, and the Judges' Award in the DCASE 2023 Foley Sound Synthesis Challenge. His work has been published at venues including ICML, NeurIPS, AAAI, TPAMI, JSTSP, ICASSP, INTERSPEECH, and IEEE/ACM Transactions on Audio, Speech, and Language Processing.
