PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Monday, 1 April, 2024
Dr. Qiuqiang Kong

Contributed by Dr. Qiuqiang Kong, based on the IEEEXplore® article, “PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition,” published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing in October 2020, and the SPS Webinar of the same name, available on the SPS Resource Center.

We find ourselves immersed in a symphony of sounds that carry valuable information about our surroundings and ongoing events. Audio pattern recognition plays a crucial role in identifying these sound events and scenes in our everyday lives. This field is connected with speech processing [1] and encompasses various sub-tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification, and sound event detection. As a pivotal research area within machine learning, audio pattern recognition has garnered growing interest in recent years.

In the early stages of audio pattern recognition, emphasis was placed on private datasets, utilizing hidden Markov models (HMMs) to categorize three distinct types of sounds—wooden doors opening and closing, dropped metal objects, and poured water [2]. The Detection and Classification of Acoustic Scenes and Events (DCASE) challenge series [3] has more recently made acoustic scene classification and sound event detection datasets publicly accessible. Nevertheless, there remains an open question regarding the performance of an audio pattern recognition system when trained on datasets 100 times larger than the DCASE dataset. A pivotal development in audio pattern recognition occurred with the introduction of AudioSet [4], a dataset boasting over 5,000 hours of audio recordings across 527 sound classes. We introduced pretrained audio neural networks (PANNs) [5], leveraging the AudioSet dataset as the training foundation for various audio pattern recognition tasks. PANNs played a crucial role in bridging the gaps among pretrained models in computer audition, computer vision, and natural language processing.

PANNs encompassed a suite of convolutional neural networks (CNNs) trained on the extensive 5800-hour AudioSet dataset. The training of PANNs posed several challenges. Firstly, the audio recordings were weakly labeled, with each clip receiving only a binary label indicating the presence or absence of sound events, without specifying their occurrence time. Certain events, like gunshots, transpire within mere hundreds of milliseconds in an audio clip. To address this, PANNs introduced an end-to-end training strategy, where the entire 10-second audio clip served as input, enabling direct prediction of the presence or absence probability of tags under weak label supervision. Subsequently, a global pooling operation was employed to automatically identify regions of interest (ROIs) most likely to contain sound events. Secondly, the AudioSet dataset exhibited a notable imbalance in sound events, with occurrences such as speech and music surpassing 40% of audio clips, while others like toothbrush sounds appeared in less than 0.01% of the dataset. PANNs introduced a balanced training strategy, ensuring uniform sampling of sound events across all classes to constitute a mini-batch for training. Thirdly, sound classes with limited training samples tended to exhibit overfitting due to insufficient data. PANNs countered this by implementing spectrogram augmentation, randomly erasing time and frequency stripes to augment the dataset. Additionally, the mixup technique was applied, randomly blending the waveform and targets of two audio clips to simulate more diverse training data.

PANNs delved into a comprehensive exploration of CNN architectures, encompassing time-domain CNNs, frequency-domain CNNs, and the hybrid system Wavegram-Logmel-CNN, which combines both time and frequency domains. The investigation spanned CNNs ranging from 6 to 54 layers, revealing performance improvements with increased layers when trained on large-scale data. Notably, the 14-layer CNN struck a commendable balance between tagging performance and computational complexity, achieving a mean average precision (mAP) of 0.431—outperforming the AudioSet baseline system, which scored 0.314. To leverage the strengths of both time and frequency domain systems, PANNs introduced the Wavegram-Logmel-CNN architecture, incorporating Wavegram features learned from raw waveforms and logarithmic mel spectrogram features. This integration further elevated the tagging mAP to 0.439. Training PANNs on the extensive 5800-hour dataset showcased significant performance gains, with an mAP of 0.431 compared to 0.278 when trained on only 1% of the data (58 hours). This outcome underscores the clear benefits of large-scale data in training effective audio pattern recognition systems. Concurrently, PANNs explored lightweight MobileNet-based systems with a mere 6.7% flops and 5.0% parameters compared to the CNN14 system. Despite its reduced computational footprint, the lightweight PANNs achieved a commendable mAP of 0.383, making them well-suited for deployment on portable devices with limited computation resources.

Recently, PANNs have emerged as pretrained models, demonstrating excellence across various audio pattern recognition tasks and achieving state-of-the-art results in multiple subtasks. Noteworthy accomplishments include superior performance in music genre tagging on the GTZAN dataset, audio tagging on the Making Sense of Sounds (MSOS) dataset, and speech emotion recognition on the RAVDESS dataset as of 2021. Users have the flexibility to employ PANNs either as feature extractors or fine-tune them on their own data. PANNs have been officially recognized as baseline systems for subtasks in the DCASE challenges from 2020 to 2022, the Holistic Evaluation of Audio Representations (HEAR) challenge in 2021, and have found practical application in industry, contributing to the construction of the television program classification system at the BBC [6]. In essence, PANNs are purposefully designed for audio pattern recognition in our daily lives and serve as robust backbone architectures for a diverse array of audio-related tasks.


[1] H. Meng, “Advancing Technological Equity in Speech and Language Processing: Aspects, Challenges, Successes, and Future Actions,” IEEE SPS Blog, 2020.

[2] J. P. Woodard, "Modeling and classification of natural sounds by product code hidden Markov models," in IEEE Transactions on Signal Processing, vol. 40, no. 7, pp. 1833-1835, July 1992, doi:

[3] DCASE 2022 Challenge, “IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events,” 2022.

[4] J. F. Gemmeke et al., "Audio Set: An ontology and human-labeled dataset for audio events," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 2017, pp. 776-780, doi:

[5] Q. Kong, Y. Cao et al., "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880-2894, 2020, doi:

[6] L. Pham et al., "An Audio-Based Deep Learning Framework For BBC Television Programme Classification," 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 2021, pp. 56-60, doi:




IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel