Skip to main content

Super Human Speech Analysis? Getting Broader, Deeper and Faster

SHARE:
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Human performance is often appearing as a glass ceiling when it comes to automatic speech and speaker analysis. In some tasks, such as health monitoring, however, automatic analysis has successfully started to break this ceiling. The field has benefited from more than a decade of deep neural learning approaches such as recurrent LSTM nets and deep RBMs by now; however, recently, a further major boost could be witnessed. This includes the injection of convolutional layers for end-to-end learning, as well as active and autoencoder-based transfer learning and generative adversarial network topologies to better cope with the ever-present bottleneck of severe data scarcity in the field. At the same time, multi-task learning allowed to broaden up on tasks handled in parallel and include the often met uncertainty in the gold standard due to subjective labels such as emotion or perceived personality of speakers. This talk highlights the named and further latest trends such as increasingly deeper nets and the usage of deep image nets for speech analysis on the road to 'holistic' superhuman speech analysis 'seeing the whole picture' of the person behind a voice. At the same time, increasing efficiency is shown for an ever 'bigger' data and increasingly mobile application world that requires fast and resource-aware processing. The exploitation in ASR and SLU is featured throughout.
Duration
1:01:36
Subtitles

Multichannel Raw-Waveform Neural Network Acoustic Models

SHARE:
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
Farfield speech recognition has become a popular research area in the past few years, from more research focused activities such as the CHiME Challenges, to the launches of Amazon Echo and Google Home. This talk will describe the research efforts around Google Home. Most multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this talk, we will introduce a framework to do multichannel enhancement jointly with acoustic modeling using deep neural networks. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain.
Duration
0:52:30
Subtitles

SSP-SSAN