Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

By: 
Jing Zhao; Wei-Qiang Zhang

Speech self-supervised learning has attracted much attention due to its promising performance in multiple downstream tasks, and has become a new growth engine for speech recognition in low-resource languages. In this paper, we exploit and analyze a series of wav2vec pre-trained models for speech recognition in 15 low-resource languages in the OpenASR21 Challenge. The investigation covers two important variables during pre-training, three fine-tuning methods, as well as applications in End-to-End and hybrid systems. First, pre-trained models with different pre-training audio data and architectures (wav2vec2.0, HuBERT and WavLM) are explored for their speech recognition performance in low-resource languages. Second, we investigate data utilization, multilingual learning, and the use of a phoneme-level recognition task in fine-tuning. Furthermore, we explore what effect fine-tuning has on the similarity of representations extracted from different transformer layers. The similarity analyses cover different pre-trained architectures and fine-tuning languages. We apply pre-trained representations to End-to-End and hybrid systems to confirm our representation analyses, which have obtained better performances as well.

Low-resource languages occupy a large proportion of the languages in the world as 94% of languages are spoken by fewer than 1,000,000 people [1]. It is urgent and necessary to pay attention to the research on these languages to conserve the languages as well as the corresponding cultural heritages. Automatic Speech Recognition (ASR) in low-resource languages remains challenging. There are a series of studies focusing on the low-resource problem [2][7]. Compared with common languages, it is much more challenging to build an applicable ASR system for low-resource languages due to the lack of transcribed speech data, language scripts and pronunciation lexicons.

SPS on Twitter

  • New SPS Webinar: On Wednesday, 8 February, join Dr. Roula Nassif for "Decentralized learning over multitask graphs"… https://t.co/GOgHb7vfAv
  • CALL FOR PAPERS: IEEE Signal Processing Magazine welcomes submissions for a Special Issue on Hypercomplex Signal an… https://t.co/UDvjUY2llT
  • New SPS Webinar: On 15 February, join Mr. Wei Liu, Dr. Li Chen and Dr. Wenyi Zhang presenting "Decentralized Federa… https://t.co/em0sQAK4V5
  • New SPS Webinar: On Monday, 13 February, join Dr. Joe (Zhou) Ren when he presents "Human Centric Visual Analysis -… https://t.co/Rc39HpkPKr
  • Help us illustrate the SPS story! In honor of our 75th anniversary, we need your support to capture the people, mem… https://t.co/MnYU9MzIok

SPS Videos


Signal Processing in Home Assistants

 


Multimedia Forensics


Careers in Signal Processing             

 


Under the Radar