1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.
NeurIPS Workshop on IRASL: Interpretability and Robustness in Audio, Speech, and Language
Organizers: Mirco Ravanelli1 · Dmitriy Serdyuk1 · Ehsan Variani2 · Bhuvana Ramabhadran2
1 Montreal Institute for Learning Algorithms(MILA), University of Montreal
2 Google, Inc.
Spoken language processing has a rich history deeply rooted in information theory, statistical signal processing, digital signal processing and machine learning. With the rapid rise of deep learning (“deep learning revolution”), many of these systematic approaches have been replaced by variants of deep neural methods. With more and more of the spoken language processing pipeline being replaced by sophisticated neural layers, feature extraction, adaptation, noise robustness are learnt inherently within the network. More recently, end-to-end frameworks that learn a mapping from speech (audio) to target labels (intents, words, phones, graphemes, sun-word units, etc.) are becoming increasingly popular across the board in speech processing in tasks ranging from speech recognition, speaker identification, language/dialect identification, multilingual speech processing, code switching, natural language processing, speech synthesis and much much more. A key aspect behind the success of deep learning lies in the discovered low and high-level representations, that can potentially capture relevant underlying structure in the training data. In the NLP domain, for instance, researchers have mapped word and sentence embeddings to semantic and syntactic similarity and argued that the models capture latent representations of meaning.
Nevertheless, some recent works on adversarial examples have shown that it is possible to easily fool a neural network (such as a speech recognizer or a speaker verification system) by just adding a small amount of specially constructed noise. Such a remarkable sensibility towards adversarial attacks highlights how superficial the discovered representations could be, rising crucial concerns on the actual robustness, security, and interpretability of modern deep neural networks. Such weaknesses leads researchers to ask crucial questions on what these models are really learning, how we can interpret what they have learned, and how the representations provided by current neural networks can be revealed or explained in a fashion that enhances modeling power further.
These open questions were addressed in the NeurIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language https://irasl.gitlab.io/), held on December, 8, 2018 at the Palais des Congrès de Montréal, with over 300 attendees. The accepted papers covered a broad range of topics from audio and text processing and classification to machine learning algorithms and their interpretability. The program highlights included invited talks in machine interpretability from Dr. RIch Caruana (Microsoft) and Dr. Jason Yosinski (Uber), speech signal processing from Dr. Hynek Hermansky (Johns Hopkins University) and Dr. Michiel Bacchiani (Google), generative models and neural networks from Dr.Ralf Schlüter (RWTH, Aachen) and Dr. Erik McDermott (Google) and text generation and translation from Dr. Mike Schuster (Two Sigma), Dr. Alexander Rush (Harvard University) and Dr. Jason Eisner (Johns Hopkins University). The workshop also included demonstrations of visualization tools for neural networks and the benefits of their incorporation into modeling strategies.
A panel discussion that included researchers from the industry, academia and workshop participants was moderated by Dr. Eisner. The topics of discussion included connections between deep learning and conventional machine learning, linguistic analysis, signal processing, and speech recognition with open-ended, key questions, such as:
The workshop highlighted how human speech has evolved to be understood, to fit properties of human hearing and how models have followed suit by adapting their signal processing and modeling power to do the same. An emerging set of models presented at the workshop showed how to interpret and combine knowledge gained from statistical models and priors and combine them with the more recent neural approaches. These include the deep density neural network models for speech recognition that are interpretable and practical; Neural Hidden Markov Models which are similar to the popular attention mechanism based used in speech and language processing, conditional neural Hidden Semi Markov Models and neural FSTs for text generation and morphological reinflection.
© Copyright 2021 IEEE – All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.