Interspeech 2016 Highlights

September, 2016

Interspeech 2016 took place in a chilly autumnal San Francisco from 8-12 September. For me it was a fun chance to welcome the broader speech community to my newly adopted home. The Hyatt Regency proved a beautiful, pleasantly compact venue, and the nearby Ferry building was a lovely place for lunch, etc. As we’ve come to expect in recent years, there was some overpacking of oral sessions on popular topics such as neural nets, but there was a definite improvement over previous speech conferences in this regard. As ever I found that despite my best planning I ended up doing something of a random walk between sessions, so let me prefix this by saying everything that follows is based on my own very biased conference experience!

The single most noticeable trend for me was the rise in popularity of deep convolutional nets / time-delay neural nets, as exemplified by Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention, Compact Feedforward Sequential Memory Networks for Large Vocabulary Continuous Speech Recognition, Stacked Long-Term TDNN for Spoken Language Recognition, Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition, Advances in Very Deep Convolutional Neural Networks for LVCSR, Context Adaptive Neural Network for Rapid Adaptation of Deep CNN Based Acoustic Models, and the extra-program news of WaveNet. This seems to be motivated in part by the potential advantages in the parallelization available during training compared to using recurrent neural nets, and in part by the success of Visual Geometry Group (VGG)-like models in computer vision.

Recent years have seen several variants on the idea of propagating information deeply within neural networks by having some form of linear value and gradient propagation across layers (much as long short-term memory (LSTM) and gated recurrent unit (GRU) architectures do across time), e.g. highway networks, residual networks, skip connections, grid LSTMs, linearly augmented networks. The interest in this area continued this year with papers such as Small-Footprint Deep Neural Networks with Highway Connections for Speech Recognition, Multidimensional Residual Learning Based on Recurrent Neural Networks for Acoustic Modeling, Exploiting Depth and Highway Connections in Convolutional Recurrent Deep Neural Networks for Speech Recognition and LSTM, GRU, Highway and a Bit of Attention: An Empirical Overview for Language Modeling in Speech Recognition.

I’ve always been interested in coherent approaches to training, so I was pleased to see lattice-free sequence training, which typically starts sequence training from scratch, gaining traction (Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI, GMM-Free Flat Start Sequence-Discriminative DNN Training). Also in the “coherent training” vein there were some “end-to-end” recognition papers (Segmental Recurrent Neural Networks for End-to-End Speech Recognition, Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks), though I felt this topic generated noticeably less buzz than in previous years. I was also personally happy to see it seems to now be widely understood that recurrent nets as typically used in acoustic modeling do nothing to alleviate the traditional conditional independence assumptions we’ve had since the days of discriminatively trained hidden Markov models, whereas the current crop of end-to-end models do.

Dan Jurafsky’s keynote on the last day deserves special mention for being yet another great example of the fact that compelling research can be presented, well, compellingly. Dan is an extremely entertaining speaker, and performs a very impressive balancing act between technical detail and easily understandable and entertaining high-level explanations. The topic of his talk was the spread of innovation through scientific communities, with examples drawn from the history of speech and language technology and (it seemed natural at the time!) restaurant menu vocabulary.

I've always enjoyed the social side of conferences immensely, especially the random conversations jumping from highly technical to light-hearted topics and back. I had more fun conversations than I can count over the weekend, and it was great to see old friends. The welcome reception on the first night was in a beautiful part of the hotel with retro futuristic elevators looming above a vast open area. Food distribution is always a challenge at the reception, and this year was no exception, but the waiters held their ground and coped admirably with hungry researchers! The banquet took place in the California Academy of Sciences, and offered a rainforest, an albino alligator and two large earthquake simulations for the adventurous. The format of lots of temporary tables I thought beautifully avoided the usual issue of spending the whole evening wherever you happen to first sit down. It was one of my favorite banquets of recent years, and nicely rounded off the social program.

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

Interspeech 2016 Highlights

IEEE Signal Processing Society on

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

About IEEE SPS

For Volunteers

Career & Industry

Education & Training