Clean vs. Overlapped Speech-Music Detection Using Harmonic-Percussive Features and Multi-Task Learning

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Clean vs. Overlapped Speech-Music Detection Using Harmonic-Percussive Features and Multi-Task Learning

Mrinmoy Bhattacharjee; S. R. M. Prasanna; Prithwijit Guha

Detection of speech and music signals in isolated and overlapped conditions is an essential preprocessing step for many audio applications. Speech signals have wavy and continuous harmonics, while music signals exhibit horizontally linear and discontinuous harmonic patterns. Music signals also contain more percussive components than speech signals, manifested as vertical striations in the spectrograms. In case of speech music overlap, it might be challenging for automatic feature learning systems to extract class-specific horizontal and vertical striations from the combined spectrogram representation. A pre-processing step of separating the harmonic and percussive components before training might aid the classifier. Thus, this work proposes the use of harmonic-percussive source separation method to generate features for better detection of speech and music signals. Additionally, this work also explores the traditional and cascaded-information multi-task learning (MTL) frameworks to design better classifiers. MTL framework aids the training of the main task by employing simultaneous learning of several related auxiliary tasks. Results have been reported both on synthetically generated speech music overlapped signals and real recordings. Four state-of-the-art approaches are used for performance comparison. Experiments show that harmonic and percussive decomposition of spectrograms perform better as features. Moreover, the MTL-framework based classifiers further improve performances.

Speech and music are the most frequently encountered audio categories in movies, TV shows, web series, and radio broadcasts. Researchers have been tackling the problem of speech vs. music classification for a long time now. State-of-the-art methods [1][4] can identify isolated speech and music segments with impressive accuracy. However, speech and music are often found as overlapping mixtures in most practical scenarios. For example, sentimental scenes in movies and TV shows frequently have speech with background music to highlight the scene’s mood. If such segments are not identified beforehand and processed separately, these may disrupt the performance of high-level applications like automatic speech recognition and music information retrieval. Hence, this work focuses on discriminating isolated speech and music segments from their overlapping mixtures.

SPS on Twitter

  • New SPS Webinar: On Wednesday, 8 February, join Dr. Roula Nassif for "Decentralized learning over multitask graphs"…
  • CALL FOR PAPERS: IEEE Signal Processing Magazine welcomes submissions for a Special Issue on Hypercomplex Signal an…
  • New SPS Webinar: On 15 February, join Mr. Wei Liu, Dr. Li Chen and Dr. Wenyi Zhang presenting "Decentralized Federa…
  • New SPS Webinar: On Monday, 13 February, join Dr. Joe (Zhou) Ren when he presents "Human Centric Visual Analysis -…
  • Help us illustrate the SPS story! In honor of our 75th anniversary, we need your support to capture the people, mem…

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar