Skip to main content

Man vs. Machine in Conversational Speech Recognition

SHARE:
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
We live in an era where more and more tasks, once thought to be impregnable bastions of human intelligence, succumb to AI. Are we at the cusp where ASR systems have matched expert humans in conversational speech recognition? We try to answer this question with some experimental evidence on the Switchboard English conversational telephony corpus. On the human side, we describe some listening experiments which established a new human performance benchmark. On the ASR side, we discuss a series of deep learning architectures and techniques for acoustic and language modeling that were instrumental in lowering the word error rate to record levels on this task.
Duration
0:59:34
Subtitles

Super Human Speech Analysis? Getting Broader, Deeper and Faster

SHARE:
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Human performance is often appearing as a glass ceiling when it comes to automatic speech and speaker analysis. In some tasks, such as health monitoring, however, automatic analysis has successfully started to break this ceiling. The field has benefited from more than a decade of deep neural learning approaches such as recurrent LSTM nets and deep RBMs by now; however, recently, a further major boost could be witnessed. This includes the injection of convolutional layers for end-to-end learning, as well as active and autoencoder-based transfer learning and generative adversarial network topologies to better cope with the ever-present bottleneck of severe data scarcity in the field. At the same time, multi-task learning allowed to broaden up on tasks handled in parallel and include the often met uncertainty in the gold standard due to subjective labels such as emotion or perceived personality of speakers. This talk highlights the named and further latest trends such as increasingly deeper nets and the usage of deep image nets for speech analysis on the road to 'holistic' superhuman speech analysis 'seeing the whole picture' of the person behind a voice. At the same time, increasing efficiency is shown for an ever 'bigger' data and increasingly mobile application world that requires fast and resource-aware processing. The exploitation in ASR and SLU is featured throughout.
Duration
1:01:36
Subtitles

Moving to Neural Machine Translation at Google

SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Machine learning and in particular neural networks have made great advances in the last few years for products that are used by millions of people, most notably in speech recognition, image recognition and most recently in neural machine translation. Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which addresses many of these issues. The model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To accelerate final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units for both input and output. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using human side-by-side evaluations it reduces translation errors by more than 60% compared to Google's phrase-based production system. The new Google Translate was launched in late 2016 and has improved translation quality significantly for all Google users.
Duration
1:10:54
Subtitles

Crossing Speaker and Language Barriers in Speech Processing

SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
A person’s speech is strongly conditioned by his own articulators and the language(s) he speaks, hence rendering speech in an inter-speaker or inter-language manner from a source speaker’s speech data collected in his native language is both academically challenging and technology/application desirable. The quality of the rendered speech is assessed in three dimensions: naturalness, intelligibility and similarity to the source speaker. Usually, the three criteria cannot be all met when rendering is done in both cross-speaker and cross-language ways. We will analyze the key factors of rendering quality in both acoustic and phonetic domains objectively. Monolingual speech databases but recorded by different speakers or bilingual ones recorded by the same speaker(s) are used. Measures in the acoustic space and phonetic space are adopted to quantify naturalness, intelligibility and speaker’s timber objectively. Our “trajectory tiling” algorithm-based, cross-lingual TTS is used as the baseline system for comparison. To equalize speaker difference automatically, DNN-based ASR acoustic model trained speaker independently is used. Kullback-Leibler Divergence is proposed to statistically measure the phonetic similarity between any two given speech segments, which are from different speakers or languages, in order to select good rendering candidates. Demos of voice conversion, speaker adaptive TTS, cross-lingual TTS will be shown either inter-speaker or inter-language wise, or both. The implications of this research on low-resourced speech research, speaker adaptation, “average speaker’s voice”, accented/dialectical speech processing, speech-to-speech translation, audio-visual TTS, etc. will be discussed.
Duration
1:01:08
Subtitles

HLT-MTSW