Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification

By: 
Shuai Wang; Zili Huang; Yanmin Qian; Kai Yu

Short duration text-independent speaker verification remains a hot research topic in recent years, and deep neural network based embeddings have shown impressive results in such conditions. Good speaker embeddings require the property of both small intra-class variation and large inter-class difference, which is critical for the ability of discrimination and generalization. Current embedding learning strategies can be grouped into two frameworks: “Cascade embedding learning” with multiple stages and “direct embedding learning” from spectral feature directly. We propose new approaches to achieve more discriminant speaker embeddings. Within the cascade framework, a neural network based deep discriminant analysis (DDA) is proposed to project i-vector to more discriminative embeddings. Within the direct embedding framework, a deep model with more advanced center loss and A-softmax loss is used, the focal loss is also investigated in this framework. Moreover, the traditional i-vector and neural embeddings are finally combined with neural network based DDA to achieve further gain. Main experiments are carried out on a short-duration text-independent speaker verification dataset generated from the SRE corpus. The results show that the newly proposed method is promising for short-duration text-independent speaker verification, and it is consistently better than traditional i-vector and neural embedding baselines. The best embeddings achieve roughly 30% relative EER reduction compared to the i-vector baseline, which could be further enhanced when combined with the i-vector system.

SPS on Twitter

SPS Videos


Signal Processing in Home Assistants

 


Multimedia Forensics


Careers in Signal Processing             

 


Under the Radar