Generative Model Driven Representation Learning in a Hybrid Framework for Environmental Audio Scene and Sound Event Recognition

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Generative Model Driven Representation Learning in a Hybrid Framework for Environmental Audio Scene and Sound Event Recognition

S. Chandrakala; S. L. Jayalakshmi

The analysis of sound information is helpful for audio surveillance, multimedia information retrieval, audio tagging, and forensic applications. Environmental audio scene recognition (EASR) and sound event recognition (SER) for audio surveillance are challenging tasks due to the presence of multiple sound sources, background noises, and the existence of overlapping or polyphonic contexts. We focus on learning robust and compact representations for environmental audio scenes and sound events using mel-frequency cepstral coefficients as basic features, which have proved to be effective in speech and audio-related tasks. In this paper, we propose a common hybrid model-based framework that learns representations with the help of generative models. We explore instance-specific adapted Gaussian mixture models for environmental audio scenes and instance-specific hidden Markov models for sound events to compute a robust, compact, and discriminatory representations. A discriminative model based classifier is then used to recognize these representations as environmental audio scenes and sound events. The performance of the proposed approaches is evaluated using the DCASE2013 scene dataset and TUT-DCASE2016 scene dataset for EASR task. Environmental Sound Classification (ESC-10) and UrbanSound8K datasets are used for SER task. The recognition accuracy of the proposed framework is significantly better than many of the state-of-the-art approaches proposed in the recent literature. The discriminative nature of the model-driven representations leads to improved efficiency for EASR and SER task. The proposed approaches are more suitable for tasks with less training data.

SPS on Twitter

  • THIS FRIDAY: Join our Vice President-Membership, K.V.S. Hari, and Membership Development Committee Chair, Arash Moh…
  • The SPACE webinar series continues tomorrow, Tuesday, 11 August at 11 AM ET with Dr. Xiao Xiang Zhu presenting "Dat…
  • now accepting submissions for special sessions, tutorials, and papers! The conference is set for June 2…
  • DEADLINE EXTENDED: The IEEE Journal of Selected Topics in Signal Processing is now accepting papers for a Special I…
  • NEW WEBINAR: Join us on Friday, 14 August at 11:00 AM ET for the 2021 SPS Membership Preview! Society leadership wi…

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar