Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

By: 
Ryandhimas E. Zezario; Szu-Wei Fu; Fei Chen; Chiou-Shann Fuh; Hsin-Min Wang; Yu Tsao

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.

SPS on Twitter

  • New SPS Webinar: On Wednesday, 8 February, join Dr. Roula Nassif for "Decentralized learning over multitask graphs"… https://t.co/GOgHb7vfAv
  • CALL FOR PAPERS: IEEE Signal Processing Magazine welcomes submissions for a Special Issue on Hypercomplex Signal an… https://t.co/UDvjUY2llT
  • New SPS Webinar: On 15 February, join Mr. Wei Liu, Dr. Li Chen and Dr. Wenyi Zhang presenting "Decentralized Federa… https://t.co/em0sQAK4V5
  • New SPS Webinar: On Monday, 13 February, join Dr. Joe (Zhou) Ren when he presents "Human Centric Visual Analysis -… https://t.co/Rc39HpkPKr
  • Help us illustrate the SPS story! In honor of our 75th anniversary, we need your support to capture the people, mem… https://t.co/MnYU9MzIok

SPS Videos


Signal Processing in Home Assistants

 


Multimedia Forensics


Careers in Signal Processing             

 


Under the Radar