As commonplace speech-enabled devices are getting smaller and lighter, we are faced with a need for simpler processing and simpler hardware. In this talk I will present some alternative ways to approach multi-channel and single-channel speech enhancement under these constraints. More specifically, I will talk about new ways to formulate beamforming that are numerically more lightweight, and operate best when using physically compact arrays, and then I will discuss single-channel approaches using a deep network which, in addition to imposing a lightweight computational load, are amenable to aggressive hardware optimizations that can result in massive power savings and reductions in hardware footprint.

DOI

https://dx.doi.org/10.17023/weq6-xx32

Duration

1:03:44

Subtitles

✖

Man vs. Machine in Conversational Speech Recognition

View on the SPS Resource Center

Category

Proficiency

Language

Media Type

EDICs

Intended Audience

Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Keywords

Automatic Speech Recognition (ASR)

Switchboard and CallHome datasets

Recurrent neural network

Convolutional neural network

Acoustic modeling

Speaker-adversarial training

Human speech recognition

Authors

George Saon

Date

13 January 2018

We live in an era where more and more tasks, once thought to be impregnable bastions of human intelligence, have succumb to AI. Are we at the cusp where ASR systems have matched expert humans in conversational speech recognition? We try to answer this question with some experimental evidence on the Switchboard English conversational telephony corpus. On the human side, we describe some listening experiments that established a new human performance benchmark. On the ASR side, we discuss a series of deep learning architectures and techniques for acoustic and language modeling that were instrumental in lowering the word error rate to record levels on this task.

DOI

https://dx.doi.org/10.17023/d235-0947

Duration

0:59:34

Subtitles

✖

Representation, Extraction, and Visualization of Speech Information

View on the SPS Resource Center

Category

Proficiency

Language

Media Type

EDICs

Intended Audience

Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Keywords

Speech representation

Gaussian Mixture model

2017 IEEE Automatic Speech Recognition and Understanding Workshop

ASRU 2017

Authors

Najim Dehak

Date

13 January 2018

The speech signal is complex and contains a tremendous quantity of diverse information. The first step of extracting this information is to define an efficient representation that can model as much information as possible and will facilitate the extraction process. The I-vector representation is a statistical data-driven approach for feature extraction, which provides an elegant framework for speech classification and identification in general. This representation became the state-of-the-art in several speech processing tasks and has been recently integrated with deep learning methods. This talk will focus on presenting variety of applications of the I-vector representation for speech and audio tasks including speaker profiling, speaker diarization and speaker health analysis. We will also show the possibility of using this representation to model and visualize information present in deep neural network hidden layers.

DOI

https://dx.doi.org/10.17023/ywf8-zh44

Duration

1:02:30

Subtitles

✖

Cognitive Assistance for the Blind

View on the SPS Resource Center

Category

Proficiency

Language

Media Type

EDICs

Intended Audience

Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Keywords

Assistive technology

Accessibility technology

Screen readers

Visual recognition systems

Personal object recognition

Visually impaired people

Vision impairments

2017 IEEE Automatic Speech Recognition and Understanding Workshop

ASRU 2017

Authors

Chieko Asakawa

Date

13 January 2018

Computers have been changing the lives of the blind people. Voice synthesis technology has improved their educational environment and job opportunities by allowing them to access online services. Now, the new AI technologies are reaching the point where computers can help in sensing, recognizing, and understanding our living world, real-world. I will first introduce the concept of cognitive assistant for the blind, which will help blind and visually impaired to explore surroundings and enjoy city environment by assisting their missing visual sense by the power of integrated AI technologies. I will then introduce the latest technologies including the accurate indoor navigation system and the personal object recognition system, followed by the discussion of the role of the blind - how we can accelerate the advancement of AI technologies.

DOI

https://dx.doi.org/10.17023/ymxx-2e44

Duration

1:03:44

Subtitles

✖

Super-Human Speech Analysis? Getting Broader, Deeper, and Faster

View on the SPS Resource Center

Category

Proficiency

Language

Media Type

EDICs

Intended Audience

Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Keywords

Automatic speech recognition (ASR)

Human-like speech

Speech analysis

Deep recurrent networks

Convolutional neural networks

Multi-task learning

Cross-task labelling

Deep paralinguistics

Speech emotion recognition (SER)

2017 IEEE Automatic Speech Recognition and Understanding Workshop

ASRU 2017

Authors

Bjorn Schuller

Date

13 January 2018

Human performance is often appearing as a glass ceiling when it comes to automatic speech and speaker analysis. In some tasks, such as health monitoring, however, automatic analysis has successfully started to break this ceiling. The field has benefited from more than a decade of deep neural learning approaches such as recurrent LSTM nets and deep RBMs by now; however, recently, a further major boost could be witnessed. This includes the injection of convolutional layers for end-to-end learning, as well as active and autoencoder-based transfer learning and generative adversarial network topologies to better cope with the ever-present bottleneck of severe data scarcity in the field. At the same time, multi-task learning allowed to broaden up on tasks handled in parallel and include the often met uncertainty in the gold standard due to subjective labels such as emotion or perceived personality of speakers. This talk highlights the named and further latest trends such as increasingly deeper nets and the usage of deep image nets for speech analysis on the road to "holistic" superhuman speech analysis "seeing the whole picture" of the person behind a voice. At the same time, increasing efficiency is shown for an ever "bigger" data and increasingly mobile application world that requires fast and resource-aware processing. The exploitation in ASR and SLU is featured throughout.

DOI

https://dx.doi.org/10.17023/d2bm-3019

Duration

1:01:36

Subtitles

✖

Multichannel Raw-Waveform Neural Network Acoustic Models

View on the SPS Resource Center

Category

Proficiency

Language

Media Type

EDICs

Intended Audience

Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Keywords

Automatic speech recognition (ASR)

Far-field automatic speech recognition

Multi-channel signal processing

Neural beamforming architecture

Unfactored raw-waveform architecture

time-frequence representation

Learned filters

Factored raw-waveform model

Neural adaptive beamforming

Factored Complex Linear Prediction

2017 IEEE Automatic Speech Recognition and Understanding Workshop

ASRU 2017

Authors

Tara Sainath

Date

13 January 2018

Farfield speech recognition has become a popular research area in the past few years, from more research focused activities such as the CHiME Challenges, to the launches of Amazon Echo and Google Home. This talk will describe the research efforts around Google Home. Most multichannel ASR systems commonly separate speech enhancement, including localization, beamforming, and postfiltering from acoustic modeling. In this talk, we will introduce a framework to do multichannel enhancement jointly with acoustic modeling using deep neural networks. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture that performs multichannel filtering in the first layer of the network and shows that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single-channel filterbank that computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain.

DOI

https://dx.doi.org/10.17023/zge2-2k82

Duration

0:52:30

Subtitles

✖

Subscribe to SPE-ANLS

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

SPE-ANLS

End-to-End Speech Recognition Systems Explored

Generative Adversarial Network and Its Applications on Speech Signal and Natural Language Processing

Multimedia and Autism

Forty Years of Automatic Speech Recognition (ASR): From Statistical Decision Theory to Deep Learning

Striving for Computational and Physical Efficiency in Speech Enhancement

Man vs. Machine in Conversational Speech Recognition

Representation, Extraction, and Visualization of Speech Information

Cognitive Assistance for the Blind

Super-Human Speech Analysis? Getting Broader, Deeper, and Faster

Multichannel Raw-Waveform Neural Network Acoustic Models

IEEE Signal Processing Society on

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

About IEEE SPS

For Volunteers

Career & Industry

Education & Training