An Overview of ASRU 2015

February, 2016

The Automatic Speech Recognition and Understanding Workshop (ASRU) was recently hosted in Scottsdale, Arizona from December 9-13, 2015. The conference included lectures and poster sessions in a variety of speech and signal processing areas. The Best Paper award was given to Suman Ravuri of UC Berkley for his paper "Hybrid DNN-Latent Structured SVM Acoustic models for Continuous Speech Recognition". In this article, we go through the main themes during each day.

Day 1

The first day began with an interesting keynote by Rich Caruana of Microsoft. Caruana discussed different ideas behind knowledge distillation, which involves training a smaller, simpler model to mimic the behavior of a larger, more complex model. The work has many interesting ideas for applications which require shrinking models while preserving model efficiency. The first invited talk was given by Oriol Vinyals of Google, who spoke on “Recurrent Architectures”. His talk focused on some of the newer research areas such as sequence-to-sequence learning and attention models. Heiga Zen of Google also gave an invited talk on the newest research results in TTS.

The first poster session featured many interesting papers on ASR. One notable papers included Suman Ravuri’s paper “Hybrid Dnn-Latent Structured SVM Acoustic Models for Continuous Speech Recognition,” which looks at replacing a sequence-trained DNN hybrid system with an DNN-SVM model. Results on the ICSI meeting room corpus show over a 6% relative improvement over a sequence-trained DNN. In the second poster session on ASR, Yajie Miao’s paper “Eesen: End-to-end Speech Recognition Using Deep RNN Models and WFST-Based Decoding” explored Connectionist Temporal Classification on Switchboard, showing improvements consistent with that reported in the literature for larger tasks. This is one of the first papers that has shown CTC benefits on smaller tasks, and the code is also released for the community at https://github.com/srvk/eesen.

Day 2

Day 2 kicked off with a keynote by Hermann Ney of RWTH Aachen. Ney discussed open issues surrounding optimizing ASR systems with Bayes decision rule, and the discrepancies between this and WER. In addition, he also highlighted open questions regarding the link between language model perplexity and WER. Next, there was an invited talk discussing ASRU challenges. This included the CHiME3 challenge on distance microphone speech recognition in multisource noise environments, as well as the AsPIRE challenge on improving speech recognition quality in reverberant environments. The second invited talk, from Tomohiro Nakatani of NTT, discussed some of the latest advancements in signal processing and neural networks to improve speech recognition with multiple microphones.

The first poster session which followed this was related to papers from the challenge. One interesting paper was by Takuya Yoshioka et al., titled “The NTT Chime-3 System: Advances in Speech Enhancement and Recognition for Mobile Multi-Microphone Devices”. The paper highlighted the NTT system for the CHiME3 challenge, which allowed them to win the challenge with an impressive WER of 5.8%. Some main algorithmic improvements to their system included using masking to aid in beamforming, removing reverberation with WPE, and finally acoustic modeling with the “network in network” concept. One of the papers from the ASPiRE challenge, “Analysis of factors affecting system performance in the ASpIRE challenge,” presented by Melot et al., studied how SNR, room, distance from mic, microphone type, etc. affect WER. The analysis showed that even factors like the quality of speech activity detection and the angle between speaker and microphone have significant effects on WER. Given the success of multi-condition training for neural network based systems, such analysis will prove fruitful in improving the current state-of-the-art in real conditions.

The second poster session on ASR had many interesting papers, including a paper from Jinyu Li et al from Microsoft entitled “LSTM Time and Frequency Recurrent for Automatic Speech Recognition”. This paper looked at an alternative to convolution with a frequency LSTM that is unrolled across frequency steps, rather than time. Their proposed method gives an additional 3% relative improvement in WER on a WIndows Mobile Phone task.

Day 3

Jason Eisner from The Johns Hopkins University gave the keynote on the third day presenting techniques that model words from morphs and subword units using phonology. Such systems provide a convenient way to model unseen words. This was followed by Dan Bohus’s invited talk about building a spoken dialog system for a context-aware robot assistant. The talk was interspersed with interesting video snippets that showcased some of the challenges and the many potential points of failure for a real robot assistant. Apart from dealing with conversational speech, he showed how such systems are inherently multimodal and have to address various aspects of interaction like attention, uncertainty, state evolution and confidence. One of the challenges in this space is the lack of real data to train large-scale machine learning systems. The second invited talk was by Kai Yu, who presented recent improvements in dialog state tracking using hybrid systems that combine rule-based and data-driven techniques.

The subsequent demonstration session covered a wide range of topics like visualization of hidden states of a deep neural network acoustic models (“What the DNN heard? Dissecting the DNN for a better insight” by Ke Chai Sim), an online pronunciation feedback tool to help students learn (“NetProF iOS Pronunciation Feedback Demonstration” by Marius et al.), and novel ways of representing and visualizing acoustic signals via hilbert transforms (Visualization of the Hilbert transform by Sandoval and De Leon).

Day 4

The keynote for the final day was given by Jerry Chen from NVidia. Focusing on NVidia’s efforts on improving GPUs for scientific computing, Chen talked about some of the areas that their team is focusing on that are relevant to speech, like model compression and speeding up sequence modeling (RNN, LSTM) on GPUs. An interesting suggestion that came out during the talk was to organize information related to libraries and tools available through NVidia and the research community in a website with help and support from the IEEE SPL committee. The keynote was followed by Steve Renal’s invited talk about model adaptation for neural network acoustic models and speech synthesizers. Among other techniques, adaptation using auxiliary input, for example, using LDA based domain code, was shown to work well providing ~8% relative gains over the baseline for the MGB challenge task (an ASRU challenge task this year). Multi-task learning also seems to have started gaining popularity as a way to improve robustness. Strategies like predicting axillary monophone targets and clean features in addition CD state posteriors were shown to help improve generalization. The latter approach was presented in the final poster session by Qian et al. (“Multi-task joint-learning of deep neural networks for robust speech recognition”) where it showed substantial gains on a medium vocabulary robust ASR task (Aurora4).

Tara Sainath and Arun Narayanan are Research Scientists at Google in the acoustic modeling team. Emails: tsainath@google.com, arunnt@google.com

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

An Overview of ASRU 2015

IEEE Signal Processing Society on

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

About IEEE SPS

For Volunteers

Career & Industry

Education & Training