The papers in this special section focus on self-supervised learning for speech and audio processing. A current trend in the machine learning community is the adoption of self-supervised approaches to pretrain deep networks. Self-supervised learning utilizes proxy-supervised learning tasks (or pretext tasks)—for example, distinguishing parts of the input signal from distractors or reconstructing masked input segments conditioned on unmasked segments—to obtain training data from unlabeled corpora. These approaches make it possible to use the tremendous amount of unlabeled data available on the web to train large neural models. Recent self-supervised approaches for speech and audio processing are also gaining attention.
A current trend in the machine learning community is the adoption of self-supervised approaches to pretrain deep networks. Self-supervised learning utilizes proxy-supervised learning tasks (or pretext tasks)—for example, distinguishing parts of the input signal from distractors or reconstructing masked input segments conditioned on unmasked segments—to obtain training data from unlabeled corpora. These approaches make it possible to use the tremendous amount of unlabeled data available on the web to train large neural models. Recent self-supervised approaches for speech and audio processing are also gaining attention. There are already special sessions in INTERSPEECH 2020 and relevant workshops in recent machine learning conferences, including ICML 2020, NeurIPS 2020, and AAAI 2022.