Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Jing Zhao; Wei-Qiang Zhang

Speech self-supervised learning has attracted much attention due to its promising performance in multiple downstream tasks, and has become a new growth engine for speech recognition in low-resource languages. In this paper, we exploit and analyze a series of wav2vec pre-trained models for speech recognition in 15 low-resource languages in the OpenASR21 Challenge. The investigation covers two important variables during pre-training, three fine-tuning methods, as well as applications in End-to-End and hybrid systems. First, pre-trained models with different pre-training audio data and architectures (wav2vec2.0, HuBERT and WavLM) are explored for their speech recognition performance in low-resource languages. Second, we investigate data utilization, multilingual learning, and the use of a phoneme-level recognition task in fine-tuning. Furthermore, we explore what effect fine-tuning has on the similarity of representations extracted from different transformer layers. The similarity analyses cover different pre-trained architectures and fine-tuning languages. We apply pre-trained representations to End-to-End and hybrid systems to confirm our representation analyses, which have obtained better performances as well.

Low-resource languages occupy a large proportion of the languages in the world as 94% of languages are spoken by fewer than 1,000,000 people [1]. It is urgent and necessary to pay attention to the research on these languages to conserve the languages as well as the corresponding cultural heritages. Automatic Speech Recognition (ASR) in low-resource languages remains challenging. There are a series of studies focusing on the low-resource problem [2][7]. Compared with common languages, it is much more challenging to build an applicable ASR system for low-resource languages due to the lack of transcribed speech data, language scripts and pronunciation lexicons.

SPS on Twitter

  • New SPS Webinar: On Wednesday, 8 February, join Dr. Roula Nassif for "Decentralized learning over multitask graphs"…
  • CALL FOR PAPERS: IEEE Signal Processing Magazine welcomes submissions for a Special Issue on Hypercomplex Signal an…
  • New SPS Webinar: On 15 February, join Mr. Wei Liu, Dr. Li Chen and Dr. Wenyi Zhang presenting "Decentralized Federa…
  • New SPS Webinar: On Monday, 13 February, join Dr. Joe (Zhou) Ren when he presents "Human Centric Visual Analysis -…
  • Help us illustrate the SPS story! In honor of our 75th anniversary, we need your support to capture the people, mem…

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar