Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

JSTSP Volume 16 Issue 6

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

JSTSP Articles

By:

Ewan Dunbar; Nicolas Hamilakis; Emmanuel Dupoux

Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

For a long time, language technology has been developed principally using large quantities of textual resources. This makes sense, since, as far as technological applications are concerned, language has primarily been used in written form. When it comes to dealing with spoken language, however, this has given rise to a division of labor between, on the one hand, speech components which aim at converting speech to text or text to speech (ASR, automatic speech recognition, and TTS, text-to-speech synthesis), and, on the other hand, components that perform a variety of language tasks based on text (language understanding, dialogue, language generation). As a result, even speech-first applications like speech-to-speech translation or speech assistants like Alexa or Siri are cobbled together in a Frankensteinian fashion, with some components trained on text and others trained on speech (see Fig. 1(a))—and with all the speech components trained using large amounts of supervision (textual transcription) so that they can communicate with the text-based components. But is this a necessity? Could we build spoken-language based applications directly from the audio stream without using any text?

Read on IEEE Xplore

Tags:

IEEE JSTSP Article

SPS Social Media

IEEE SPS Facebook Page https://www.facebook.com/ieeeSPS
IEEE SPS X Page https://x.com/IEEEsps
IEEE SPS Instagram Page https://www.instagram.com/ieeesps/?hl=en
IEEE SPS LinkedIn Page https://www.linkedin.com/company/ieeesps/
IEEE SPS YouTube Channel https://www.youtube.com/ieeeSPS

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel

© Copyright 2025 IEEE - All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

TMM.png

New Editor-in-Chief (EIC) of the IEEE Transactions on Multimedia (T-MM)

ICASSP 2026 Blog Header.png

(ICASSP 2026) 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

mentor_help_general_3.jpg

Call for Mentors: 2025 IEEE SPS SigMA Program - Signal Processing Mentorship Academy

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Journal of Selected Topics in Signal Processing

Publications & Resources

For Authors

TMM.png

mentor_help_general_3.jpg

general_get_involved_tc_article_full.jpg

Top Reasons to Join SPS Today!

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

SPS Social Media

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Search form

You are here

Journal of Selected Topics in Signal Processing

Publications & Resources

For Authors

Top Reasons to Join SPS Today!

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

SPS Social Media

IEEE SPS Educational Resources