1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.
The 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2017) was held on December 16-20, 2017 in Okinawa, a southern island of Japan.
The workshop gathered the record number of participants (271 registrations, 48% from industry and 47% from academia). The workshop featured 4 keynote speech and 6 invited talks, some of which are outlined in more detail below.
Two talks were related to actual commercial services.
Dr. Tara N. Sainath from Google Research gave a keynote speech, "Multichannel Raw-Waveform Neural Network Acoustic Models toward Google HOME" detailing the research ideas in developing Google HOME, smart speaker scenarios where both additive and reverberant noises are challenges. The talk highlighted a number of specific technical approaches that Google used to address these challenges.
Series of neural-beamforming layers including "unfactored raw-waveform", "factored raw-waveform", "factored Complex Linear Prediction", and "Neural Adaptive Beamforming", that are jointly trained with the backend acoustic model were detailed and compared. In the actual Google HOME, "factored Complex Linear Prediction" that best balanced computation and WER were combined with the backend grid-LSTM acoustic model. WER of 4.9% on the real Google HOME traffic data was reported where the 16% relative WER reduction was obtained by the combination of Weighted Prediction Error for reverberation reduction and the "factored Complex Linear Prediction".
Dr. Mike Schuster from Google Brain gave an invited talk, "Moving to Neural Machine Translation at Google".
He first introduced a shift from the phrase-based translation to the neural machine translation including sequence-to-sequence models and attention models. He then addressed the wordpiece model. This facilitates processing of a huge vocabulary by segmenting words into pieces in a data-driven way. The talk then presented the model training and inference with detail and candor. For example, only 15% of available data is used for English to French translation and specialized hardware, the TPU, significantly reduced the latency from 10 sec./sentence to 0.2 sec./sentence with being combined with improved algorithms. Experimental results of multilingual model and zero-shot translation were presented and possibilities of mixed language both in source and target sides were also presetned.
To pursue a limit of the current speech recognition, Dr. George Saon from IBM Research AI gave an invited talk, "Man vs. Machine in Conversational Speech Recognition."
He presented a history of ASR improvement over the standard benchmark data for conversational telephone speech recognition, Switchboard, CallHome, and other test sets.
The latest system resulted in WERs of 5.1% and 9.9% respectively for Switchboard and CallHome while the human performance of both tasks were 5.1% and 6.8% by their investigation. On Switchboard, ASR matches human performance. Switchboard contains conversations between strangers on a preassigned topic. However, a significant gap exists in CallHome that included conversations between friends and family with no predefined topic and that has only a limited amount (18 hours) of matched training data. Dr. Saon emphasized the Importance of tackling test data other than Switchboard.
Two talks suggested directions of speech research toward its deployment to real-world applications:
First, Dr. Ishiguro gave a keynote speech, "Conversational Robots and the Fundamental Issues."
In addition to the introduction of his famous androids, he also discussed what "conversation" is and introduced a simplified conversation by using only touch-panels.
It was also suggested that two modalities are the minimum condition to feel human presence and more modalities including embodiment will be required for better services in future.
Additionally, Dr. Chieko Asakawa from Carnegie Melon University and IBM Research AI gave an invited talk, "Cognitive Assistant for the Blind."
"Cognitive Assistant" augments missing or diminished abilities with the power of "cognitive computing". To realize this, interaction with real-world based on knowledge, recognition, and localization is necessary. Various deployed techniques including localization and object recognition were presented. Necessity of further improved speech technology was also pointed out.
Apart from keynotes and invited talks, all accepted papers were presented in poster sessions.
The end-to-end speech recognition are still extensively studied and demonstrated by the work below.
Direct acoustic to word modeling with CTC are further advanced
 presented significant computational cost reduction by lower frame rates realized by Time Delay Neural Network (TDNN).
 presented a hybrid approach of word model and character model for better handling of Out-Of-Vocabulary (OOV) words.
Other than CTC,  introduced an extensive comparison of CTC, RNN transducer, and attention models.
The RNN transducer has not outperformed the CTC and attention under streaming constraints while the simplicity of the RNN transducer, especially in decoding, is still attractive.
As an emerging new research,  proposed "speech chain" mechanism based on end-to-end speech recognition and speech synthesis.
A promising result of improved ASR and TTS by teaching each other only with unpaired data were presented.
The speech chain paper obtained the best student paper award.
 Hagen Soltau, Hank Liau, and Hasim Sak, "Reducing the computational complexity for whole word models"
 Jinyu Li, Guoli Ye, Rui Zhao, Jasha Droppo, and Yifan Gong, "Acoustic-to-word model without OOV"
 Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu, "Exploring neural transducers for end-to-end-speech recognition"
 Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, "Listening while speaking: speech chain by deep learning"
© Copyright 2020 IEEE – All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.