NeurIPS Workshop on IRASL: Interpretability and Robustness in Audio, Speech, and Language
Organizers: Mirco Ravanelli1 · Dmitriy Serdyuk1 · Ehsan Variani2 · Bhuvana Ramabhadran2
1 Montreal Institute for Learning Algorithms(MILA), University of Montreal
2 Google, Inc.
Spoken language processing has a rich history deeply rooted in information theory, statistical signal processing, digital signal processing and machine learning. With the rapid rise of deep learning (“deep learning revolution”), many of these systematic approaches have been replaced by variants of deep neural methods. With more and more of the spoken language processing pipeline being replaced by sophisticated neural layers, feature extraction, adaptation, noise robustness are learnt inherently within the network. More recently, end-to-end frameworks that learn a mapping from speech (audio) to target labels (intents, words, phones, graphemes, sun-word units, etc.) are becoming increasingly popular across the board in speech processing in tasks ranging from speech recognition, speaker identification, language/dialect identification, multilingual speech processing, code switching, natural language processing, speech synthesis and much much more. A key aspect behind the success of deep learning lies in the discovered low and high-level representations, that can potentially capture relevant underlying structure in the training data. In the NLP domain, for instance, researchers have mapped word and sentence embeddings to semantic and syntactic similarity and argued that the models capture latent representations of meaning.
Nevertheless, some recent works on adversarial examples have shown that it is possible to easily fool a neural network (such as a speech recognizer or a speaker verification system) by just adding a small amount of specially constructed noise. Such a remarkable sensibility towards adversarial attacks highlights how superficial the discovered representations could be, rising crucial concerns on the actual robustness, security, and interpretability of modern deep neural networks. Such weaknesses leads researchers to ask crucial questions on what these models are really learning, how we can interpret what they have learned, and how the representations provided by current neural networks can be revealed or explained in a fashion that enhances modeling power further.
These open questions were addressed in the NeurIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language https://irasl.gitlab.io/), held on December, 8, 2018 at the Palais des Congrès de Montréal, with over 300 attendees. The accepted papers covered a broad range of topics from audio and text processing and classification to machine learning algorithms and their interpretability. The program highlights included invited talks in machine interpretability from Dr. RIch Caruana (Microsoft) and Dr. Jason Yosinski (Uber), speech signal processing from Dr. Hynek Hermansky (Johns Hopkins University) and Dr. Michiel Bacchiani (Google), generative models and neural networks from Dr.Ralf Schlüter (RWTH, Aachen) and Dr. Erik McDermott (Google) and text generation and translation from Dr. Mike Schuster (Two Sigma), Dr. Alexander Rush (Harvard University) and Dr. Jason Eisner (Johns Hopkins University). The workshop also included demonstrations of visualization tools for neural networks and the benefits of their incorporation into modeling strategies.
A panel discussion that included researchers from the industry, academia and workshop participants was moderated by Dr. Eisner. The topics of discussion included connections between deep learning and conventional machine learning, linguistic analysis, signal processing, and speech recognition with open-ended, key questions, such as:
- End-to-End deep learning systems are gaining popularity, but are making neural networks even more "black boxes". Is it possible to build interpretable end-to-end neural networks?
- How can we leverage the availability of unlabeled audio to improve the robustness of modern speech recognition systems?
- In order to create interpretable models (for hypothesis testing and discovery in science) should we should look more in the direction of causal inference?
- Deploying a system to a production setting involves a tradeoff between performance, robustness, and interpretability. Given a choice between systems (one that is best performing, one that is highly robust, and one that is highly interpretable) which one should be chosen? Is it application dependent?
The workshop highlighted how human speech has evolved to be understood, to fit properties of human hearing and how models have followed suit by adapting their signal processing and modeling power to do the same. An emerging set of models presented at the workshop showed how to interpret and combine knowledge gained from statistical models and priors and combine them with the more recent neural approaches. These include the deep density neural network models for speech recognition that are interpretable and practical; Neural Hidden Markov Models which are similar to the popular attention mechanism based used in speech and language processing, conditional neural Hidden Semi Markov Models and neural FSTs for text generation and morphological reinflection.