SPS Webinar: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Date: 29 February 2024
Time: 8:00 AM ET (New York Time)
Presenter(s): Dr. Qiuqiang Kong

Original article: Download Open Access article
The original article will be made freely available for download on the day of the webinar for 48 hours.

Abstract

Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. In this webinar, the presenter will introduce pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. The presenter will investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks and propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Their best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. They transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks.

Biography

Qiuqiang Kong received the Ph.D. degree from the University of Surrey, Guildford, UK, in 2019. Following his Ph.D., he joined ByteDance as a research scientist.

He is currently an assistant professor at the EE department of the Chinese University of Hong Kong. His research topic includes the classification, detection, separation, and generation of general sounds and music.

Dr. Kong was the top 2% scientist in 2021 in “Updated science-wide author databases of standardized citation indicators. He was known for developing pretrained audio neural networks (PANNs) for audio tagging and was awarded the IEEE SPS Young Author Best Paper in 2023. He won the detection and classification of acoustic scenes and events (DCASE) challenge in 2017. He was known for transcribing the largest piano MIDI dataset GiantMIDI-Piano in the world. He has co-authored over 50 papers in journals and conferences, including IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), ICASSP, INTERSPEECH, IJCAI, DCASE, EUSIPCO, LVA-ICA. He has been cited 3156 times, with an H-index of 28 till Sep. 2023. He was a frequent reviewer for world well known journals and conferences, including TASLP, TMM, SPL, TKDD, JASM, EURASIP, Neurocomputing, Neural Networks, ISMIR, CSMT. He assisted with organizing the LVA-ICA 2018 in Guildford, UK and the DCASE 2018 Workshop in Woking, UK. He is serving as a co-editor for the Frontiers in Signal Processing journal.