Advancing Technological Equity in Speech and Language Processing: Aspects, Challenges, Successes, and Future Actions

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Advancing Technological Equity in Speech and Language Processing: Aspects, Challenges, Successes, and Future Actions

Thursday, June 23, 2022
Dr. Helen Meng

Contributed by Dr. Helen Meng, based on a Plenary Talk at IEEE ICASSP 2021.


Recent years have seen great strides being made in R&D of speech and language technologies. As these technologies continue to permeate our daily lives, they need to support diverse users and usage contexts, especially those with inputs that deviate from the mainstream. Examples include non-native speakers, code-switched speech, speech carrying a myriad of emotions and styles, and speakers with impairments and disorders. State-of-the-art technologies often suffer performance degradations in face of such users and usage contexts and cannot sufficiently fulfill the needs of use. The problem is rooted in data scarcity and sparsity in training the models and algorithms.

For example, it is very difficult to find speakers with speech disorders to record data, and even if we do, their condition will quickly cause exhaustion as they record for us. Also, the problem is exacerbated by high variability, due to the etiology of a disorder, different levels of severity, broad ranges in regional accents with different gradations, continual biological changes such as aging, and progressive pathological changes such as neurocognitive disorder or dementia.

Hence, data scarcity, data sparsity, and high data variability leads to technological inequity, but this disadvantage is remediable. To handle this low data resource challenge, approaches for knowledge transfer can be positioned at various levels of the speech and language processing pipeline, including levering through data, representations, and models.

Data Leverage

Data leverage through data augmentation has become widely used for dealing with data sparsity and improving performance of acoustic deep learning models for Automatic Speech Recognition (ASR). The study of data augmentation in the context of Dysarthric Speech Recognition (DSR) [1] and older adult speech recognition are highly challenging areas of research. Dysarthric speech may be caused by strokes, neurological diseases, or other conditions that affect neuro-motor control. These conditions create a large mismatch against normal speech and lead to speech with nonstandard articulation, unstable prosody, and reduced intelligibility. Nonstable conditions with people suffering from speech impairment often led to difficulty in collecting a large amount of dysarthric data for deep neural network models. In such a scenario, data augmentation using disordered speech and normal speech are used in the augmentation process. Three common perturbation techniques for data augmentation include perturbing vocal tract length, tempo, and the speed of speech signals. Normal speech is transformed into disorder-like speech and existing disordered speech is used to generated more disordered speech. When data augmentation is used help train dysarthric speech recognition systems, a significant reduction in word error rate is achieved. We have also applied this approach in developing technologies for older adult speech recognition, to support the automatic (early) screening of Alzheimer’s disease. Preliminary results demonstrate the feasibility of a fully automatic pipeline [2], [3]).

Representation Leverage

Aside from leveraging from data, we can also leverage the representation of spoken language. One example is our recent approach for mispronunciation detection and diagnoses (MDD) of Chinese-accented English speech for computer-aided pronunciation training (CAPT). We have devoted much effort in collecting Chinese-accented English speech to support our research. In adapting ASR technologies for MDD, our baseline approach used CNN-RNN-CTC with a training set of 24 hours and a test set of seven hours of Chinese-accented English speech [6]. We also augment with five hours of native American English speech from the TIMIT corpus. We took the first step in in MDD research to use a self- supervised pre-trained speech representation – the FaceBook AI’s wav2vec2.0, which is trained on over 900 hours of unlabeled raw audio through self-supervised learning, and then further trained on the 24 hours of accented speech data mentioned above [7]. We found that leveraging the self-supervised pre-trained speech representation drastically reduced the phone error, false rejection, and diagnostic error rates approximately by 50%.

Another example of representation leverage as applied in speaker diarization, i.e., identifying who spoke when in a recorded exchange between a clinician and an older adult participant, where the dialog is targeted for cognitive assessment. At the start of our project, the training data for diarization comes from NIST SRE and Switchboard, and there is a serious lack of older adult speech data for training. So, we have adopted domain-adversarial training of neural networks to devise and adversarially trained age-invariant speaker representation, which has thus far, achieved the best performance in speaker diarization in our work.

Model Leverage

In addition to the use of data and representations, we can also leverage knowledge through modelling. For example, we have applied meta-learning a re-initialized based model using dysarthric data, so that the re-initialized model can fast adapt to unseen dysarthric speakers with improved recognition performance [9]. We have also worked on the special problem of dysarthric speech reconstruction, which aims to accept dysarthric speech as input and automatically transforms it to normal speech, using articulation reconstruction. Here, we apply the technique of knowledge distillation, where the text encoder in a text-to-speech (TTS) synthesis system trained on a sizeable normal speech corpus is used as the teacher neural network to guide the student neural network, which is a speech encoder that faces the challenge of scarcity of dysarthric speech training data. We also apply multi-task learning to further enhance the speech encoder by imparting knowledge in speech-to- character mappings from transcribed dysarthric speech, in terms of speech-to-character mappings. The speech encoder, enhanced with knowledge distillation and multi-task learning, can generate spectral embeddings from previously unseen dysarthric speech. The speech encoder can be further integrated with the generator from the TTS system to produce a normal-sounding version of the input dysarthric speech. Our experiments show that this approach outperforms a reference baseline approach that integrates direct dysarthric speech recognition with TTS for reconstruction [10].

Another interesting example of model leverage is applied to code-switched text-to-speech synthesis. Monolingual TTS synthesis has attained a high quality in the naturalness and intelligibility of synthetic speech. But because code-switching between languages often occurs in everyday usage, we wish to extend TTS technologies to handle code-switched speech well. In this regard, we need to find a proper speaker-independent phonetic representation for multiple languages in a compact feature space, and the representation needs to disentangle language and speaker characteristics; we consider that Phonetic Posteriorgrams can be such a representation. It is a time-versus-class matrix representing posterior probabilities of each phonetic class for each specific time frame of one utterance. We have found PPGs to be very effective in our work in voice conversion [11]. We reference Google’s work on Tacotron2- based cross-lingual voice cloning, but adapt it with only 10 hours each of English or Chinese monolingual speech data. This is made possible by the incorporation of Bilingual PPGs to capture the articulation of speech sounds in a speaker-independent form, as well as captures phonetic information of both languages in the same feature space [12].


Looking to the future, we must strive to develop speech and language technologies that can gracefully adapt and accommodate a diversity of users and usage contexts around the world. Additionally, we are mindful that a larger proportion of languages in the world suffer data scarcity and sparsity; and many languages even have no written forms. These endangered languages are disappearing, and we must try to preserve them. In the coming decades and beyond, we should develop our efforts to advance technological equity towards universal usability of speech and language technologies.


[1] Geng, Mengzhe, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, and Helen Meng, “Investigation of Data Augmentation Techniques for Disordered Speech Recognition,” presented at INTERSPEECH, 2020.

[2] Ye, Zi, Shoukang Hu, Jinchao Li, Xurong Xie, Mengzhe Geng, Jianwei Yu, Junhao Xu, Boyang Xue, Shansong Liu, Xunying Liu, Helen Meng, “Development of the CUHK Elderly Speech Recognition System for Neurocognitive Disorder Detection using the DementiaBank Corpus,” in Int. Conf. Acoust., Speech, Signal Processing (ICASSP), 2021, DOI: 10.1109/ICASSP39728.2021.9413634.

[3] Li, Jinchao, Jianwei Yu, Zi Ye, Simon Wong, Manwai Mak, Brian Mak, Xunying Liu, and Helen Meng, “A Comparative Study of Acoustic and Linguistic Features Classification for Alzheimer's Disease Detection,” in ICASSP, 2021, DOI: 10.1109/ICASSP39728.2021.9414147

[4] Xu, Sean Shensheng, Man-Wai Mak, Ka Ho Wong, Helen Meng, and Timothy CY Kwok, “Age- Invariant Speaker Embedding for Diarization of Cognitive Assessments,” in Proc. Int. Symp. Chinese Spoken Language Process. ISCSLP, 2021.

[5] K. Li and H. Meng, "Mispronunciation detection and diagnosis in l2 English speech using multi-distribution Deep Neural Networks," in Proc. 9th Int. Symp. Chinese Spoken Language Process., 2014, pp. 255-259, DOI: 10.1109/ISCSLP.2014.6936724.

[6] W. Leung, X. Liu and H. Meng, "CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis," in Proc. ICASSP 2019 IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), 2019, pp. 8132-8136, DOI: 10.1109/ICASSP.2019.8682654.

[7] Wu, Minglin, Kun Li, Wai-Kim Leung, and Helen Meng, “Transformer Based End-to-End Mispronunciation Detection and Diagnosis," presented at INTERSPEECH, 2021.

[8] D. Wang, J. Yu, X. Wu, L. Sun, X. Liu and H. Meng, "Improved End-to-End Dysarthric Speech Recognition via Meta-learning Based Model Re-initialization," 2021 12th Int. Symp. Chinese Spoken Language Process. (ISCSLP), 2021, pp. 1-5, DOI: 10.1109/ISCSLP49672.2021.9362068.

[9] Wang, Disong, Jianwei Yu, Xixin Wu, Songxiang Liu, Lifa Sun, Xunying Liu, and Helen Meng, "End-To-End Voice Conversion Via Cross-Modal Knowledge Distillation for Dysarthric Speech Reconstruction," in Proc. ICASSP 2020 IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), 2020, pp. 7744-7748, DOI: 10.1109/ICASSP40776.2020.9054596.

[10] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang and Helen Meng, “Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Data Training,” in Proc. IEEE Int. Conf. Multimedia Expo. (ICME 2016), Seattle, USA, 11-15 July, 2016 (Best Paper Award), DOI: 10.1109/ICME.2016.7552917.

[11] Cao, Yuewen, Songxiang Liu, Xixin Wu, Shiyin Kang, Peng Liu, Zhiyong Wu, Xunying Liu, Dan Su, Dong Yu, and Helen Meng, “Code-switched Speech Synthesis using Bilingual Phonetic Posterior Gram with Only Monolingual Corporation,” presented at IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), 2020.



SPS on Twitter

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar