Skip to main content

End-to-End Speech Recognition Systems Explored

SHARE:
Category
Proficiency
Language
Media Type
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Duration
0:53:48
Subtitles

Multimedia and Autism

SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Date
Duration
1:05:58
Subtitles

Forty Years of Automatic Speech Recognition (ASR): From Statistical Decision Theory to Deep Learning

SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Duration
0:55:58
Subtitles

Striving for Computational and Physical Efficiency in Speech Enhancement

SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
As commonplace speech-enabled devices are getting smaller and lighter, we are faced with a need for simpler processing and simpler hardware. In this talk I will present some alternative ways to approach multi-channel and single-channel speech enhancement under these constraints. More specifically, I will talk about new ways to formulate beamforming that are numerically more lightweight, and operate best when using physically compact arrays, and then I will discuss single-channel approaches using a deep network which, in addition to imposing a lightweight computational load, are amenable to aggressive hardware optimizations that can result in massive power savings and reductions in hardware footprint.
Duration
1:03:44
Subtitles

Man vs. Machine in Conversational Speech Recognition

SHARE:
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
We live in an era where more and more tasks, once thought to be impregnable bastions of human intelligence, succumb to AI. Are we at the cusp where ASR systems have matched expert humans in conversational speech recognition? We try to answer this question with some experimental evidence on the Switchboard English conversational telephony corpus. On the human side, we describe some listening experiments which established a new human performance benchmark. On the ASR side, we discuss a series of deep learning architectures and techniques for acoustic and language modeling that were instrumental in lowering the word error rate to record levels on this task.
Duration
0:59:34
Subtitles

Representation, Extraction, and Visualization of Speech Information

SHARE:
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
The speech signal is complex and contains a tremendous quantity of diverse information. The first step of extracting this information is to define an efficient representation that can model as much information as possible and will facilitate the extraction process. The I-vector representation is a statistical data-driven approach for feature extraction, which provides an elegant framework for speech classification and identification in general. This representation became the state of the art in several speech processing tasks and has been recently integrated with deep learning methods. This talk will focus on presenting variety of applications of the I-vector representation for speech and audio tasks including speaker profiling, speaker diarization and speaker health analysis. We will also show the possibility of using this representation to model and visualize information present in deep neural network hidden layers.
Duration
1:02:30
Subtitles

Cognitive Assistance for the Blind

SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Computers have been changing the lives of the blind people. Voice synthesis technology has improved their educational environment and job opportunities by allowing them to access online services. Now, the new AI technologies are reaching the point where computers can help in sensing, recognizing, and understanding our living world, real-world. I will first introduce the concept of cognitive assistant for the blind, which will help blind and visually impaired to explore surroundings and enjoy city environment by assisting their missing visual sense by the power of integrated AI technologies. I will then introduce the latest technologies including the accurate indoor navigation system and the personal object recognition system, followed by the discussion of the role of the blind - how we can accelerate the advancement of AI technologies.
Duration
1:03:44
Subtitles

Super Human Speech Analysis? Getting Broader, Deeper and Faster

SHARE:
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Human performance is often appearing as a glass ceiling when it comes to automatic speech and speaker analysis. In some tasks, such as health monitoring, however, automatic analysis has successfully started to break this ceiling. The field has benefited from more than a decade of deep neural learning approaches such as recurrent LSTM nets and deep RBMs by now; however, recently, a further major boost could be witnessed. This includes the injection of convolutional layers for end-to-end learning, as well as active and autoencoder-based transfer learning and generative adversarial network topologies to better cope with the ever-present bottleneck of severe data scarcity in the field. At the same time, multi-task learning allowed to broaden up on tasks handled in parallel and include the often met uncertainty in the gold standard due to subjective labels such as emotion or perceived personality of speakers. This talk highlights the named and further latest trends such as increasingly deeper nets and the usage of deep image nets for speech analysis on the road to 'holistic' superhuman speech analysis 'seeing the whole picture' of the person behind a voice. At the same time, increasing efficiency is shown for an ever 'bigger' data and increasingly mobile application world that requires fast and resource-aware processing. The exploitation in ASR and SLU is featured throughout.
Duration
1:01:36
Subtitles

Multichannel Raw-Waveform Neural Network Acoustic Models

SHARE:
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Farfield speech recognition has become a popular research area in the past few years, from more research focused activities such as the CHiME Challenges, to the launches of Amazon Echo and Google Home. This talk will describe the research efforts around Google Home. Most multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this talk, we will introduce a framework to do multichannel enhancement jointly with acoustic modeling using deep neural networks. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain.
Duration
0:52:30
Subtitles

SPE-ANLS