Interspeech 2019 Recap

September, 2019

Interspeech 2019 was held in Graz, Austria, 1.9km from the Alma Mater of Nicola Tesla. The conference honored Austria’s rich cultural and musical history, which followed the evolution of classical music across many great composers, a few of which include Mozart, Schubert, Mahler, and Schoenberg, with a great deal of music. This culminated in a sequence of original trombone fanfares which were composed specifically to honor the new ISCA Fellows, the winners of the Best Paper awards, and others. In so many ways, speech and music are close cousins. The satellite workshop for Speech, Music, and Mind explored the two modalities together by delving into the analysis of affective expression in speech, language, and music, and exploring the impact of these sonic expressions on the human. Shri Narayanan gave a keynote focusing on the analysis and modeling of multimodal, affective, and behavioral signals; while John Ashley Burgoyne gave a keynote which addressed the questions of what people hear musically and expressively, how what they hear influences them, and the problem of bridging the gap between traditional signal processing features and human perception and understanding.

The exploration of affective, expressive speech and language continued into at least nine of the main conference sessions, which was exciting to see. These sessions covered various aspects of emotion recognition, interpretation of social signals in speech and language, dynamics of emotion in communication, sentiment analysis, generation of expressive speech, and paralinguistic expression. Much of the work applied neural networks/deep learning techniques in novel ways to model a selected set of emotions and expressions. The work in emotion, however, frequently focused on a short list of basic emotions covered in a limited, acted speech corpus; and usually these emotions were selected from one of the theories in basic emotion. Natural speech differs from acted speech, however, and humans perceive emotion with much more subtlety. It would be great to see more exploration into spontaneous emotion and expression, across a more subtle range of expressivity. This remains very much an open challenge.

Much of the conference content, however, focused directly on topics directly related to ASR, TTS, NLP, and dialog. Plenary lectures covered speech synthesis, EMG, ECoG, aerodynamics, and natural language dialog. In the opening plenary lecture, ISCA Medalist Keiichi Tokuda described the progress in HMM-based and neural speech synthesis over the past two decades, concluding with examples in which the neural speech synthesizer outperforms concatenative synthesis on exactly the same task. On the second day of the conference, Tanja Schultz made recommendations to anybody who wants to hold a telephone conversation in the auditorium during her talk: she demonstrated an electromyographic system that can convert silent speech, locally, into audible speech at the remote end of the telephone. The question & answer session at the end of her talk emphasized the difference between "imagined speech" (a phonologically precise imagination, that can be read using electrocorticography) and "inner speech" (the usual inner voice, a sequence of concepts arranged into logical order but devoid of imagined phonation). In the third plenary session, Manfred Kaltenbacher showed, that with a sufficiently detailed simulation of the vocal folds, it is possible to predict the outcome of surgical interventions. On the last day of the conference, Professor Mirella Lapata from Edinburgh re-cast Spike Jonze's "Her" into a problem of representations, arguing, essentially, that with a sufficiently precise and accurate embedding of the meanings of natural language dialog segments, it becomes possible to decide what to say.

Librispeech continued to drive advances in automatic speech recognition. In his survey talk Monday morning, Ralf Schlüter discussed a wide variety of different methods for converting from the time axis of the audio signal to the time axis of the grapheme string, including connectionist temporal classification, attention-based methods, and even a two-dimensional LSTM. In the end, he showed experimentally that the TDNN-HMM is still the reigning monarch, by demonstrating word error rates on the large-vocabulary Librispeech task that are barely above 2%. Transformer models are not far behind, though: the next two papers, by Pham et al. and by Jason Li et al., showed very deep transformer models (48 layers, in Pham's model) with very low word error rates (2.8% on the Librispeech test-clean set, in Li's paper).

The Paralinguistic challenge is moving beyond emotion this year. Contestants attempted four really disparate tasks: recognizing Styrian dialects (dialects of German from neighboring towns in the Austrian province surrounding Graz), scoring the sleepiness of speech on a continuous scale (the 9-point Karolinska Sleepiness Scale), recognizing the meanings of infant vocalizations (laugh, cry, babble, coo), and recognizing the activity of Orcas recorded off the coast of Northern British Columbia in 2017 and 2018. The winning paper, for two of the four competitions, was a transformer model, which used novel methods of adversarial data augmentation.

This year it was wonderful to see an increasing range of papers covering new applications in speech analytics, particularly in the areas of health, wellness, and speech pathology. What if you could determine disease states, based on speech? In many cases you can. Some of the health topics covered included Parkinson’s disease, cognitive impairment, schizophrenia, bipolar disorder, depression, and child speech disorders. What if you could build assistive health care applications which extend the limitations of in-office doctor visits? We are beginning to do just that. This is an area to watch, which has a potentially large impact on society.

Generative models play a big role in various speech processing papers. Many new ideas have been proposed to improve the neural waveform generation process: the current autoregressive model-based vocoder can generate natural-sounding speech, while its computational complexity is a challenge in real-time systems. Neekhara et al.’s work showed that generative adversarial networks (GAN) can speed up this process by hundreds of times. Aside from neural speech synthesis, singing voice generation demo presented by Lee et al. showed cutting-edge adversarial training can create natural sounding singing voice while preserving formant information (best student paper). Last but not least, GAN-based speech enhancement system gained its generalization power to handle even more harsh recording environments as discussed in Pascual et al.’s work.

Another important area to watch includes the acquisition and annotation of new speech corpora. Many of the current limitations of speech recognition and synthesis systems stem from the limitations of the corpora which are used to train the models. We do not spontaneously speak like newscasters unless we are newscasting; nor is our speech expression limited to how we behave in business meetings. We don’t speak to our children in the same way we speak to our adult friends. A wider range of natural speech is required for model improvement. Some of the new corpora which offer a wider range of expression, emotion, accents, and disfluencies in natural speech include the MALACH corpus (Holocost oral histories), the NITK Kid’s Speech Corpus, and the VESUS corpus (crowd annotation for emotion).

Overall, the conference offered many opportunities for exchange of ideas and connection. The poster sessions were exceptionally well designed to facilitate discussion this year with the artistic and acoustically aware layout. Few conferences equalize poster and presentation sessions, but this one does. Not only did the poster sessions facilitate discussion with the authors, but also amongst the other visitors to a poster – a great opportunity for serendipitous connection with like-minded researchers. Some oral presentation sessions were overcrowded though, suggesting that some kind of attendance survey prior to the conference will help the organizers figure out a proper room arrangement (some large conferences started to do it). The many social events, particularly the gala and industry-hosted events, provided venues for meeting in a relaxed atmosphere. Frequently, the topic of remote attendance comes up, but these are the kinds of exchanges which would be missed for remote participants.

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

Interspeech 2019 Recap

IEEE Signal Processing Society on

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

About IEEE SPS

For Volunteers

Career & Industry

Education & Training