IEEE Transactions on Audio, Speech and Language Processing

Deep neural networks (DNNs) have been proven to be powerful models for acoustic scene classification tasks. State-of-the-art DNNs have millions of connections and are computationally intensive, making them difficult to deploy on systems with limited resources.

Tailoring an Interpretable Neural Language Model

Neural networks have shown great potential in language modeling. Currently, the dominant approach to language modeling is based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Nonetheless, it is not clear why RNNs and CNNs are suitable for the language modeling task since these neural models are lack of interpretability.

Robust Joint Estimation of Multimicrophone Signal Model Parameters

One of the biggest challenges in multimicrophone applications is the estimation of the parameters of the signal model, such as the power spectral densities (PSDs) of the sources, the early (relative) acoustic transfer functions of the sources with respect to the microphones, the PSD of late reverberation, and the PSDs of microphone-self noise.

Low Resource Keyword Search With Synthesized Crosslingual Exemplars

The transfer of acoustic data across languages has been shown to improve keyword search (KWS) performance in data-scarce settings. In this paper, we propose a way of performing this transfer that reduces the impact of the prevalence of out-of-vocabulary (OOV) terms on KWS in such a setting.

Subjective and Objective Assessment of Monaural and Binaural Aspects of Audio Quality

Recently, the binaural auditory-model-based quality prediction (BAM-Q) was successfully applied to predict binaural audio quality degradations, while the generalized power-spectrum model for quality (GPSM q ) has been demonstrated to account for a large variety of monaural signal distortions.

A Geometric Model for Prediction of Spatial Aliasing in 2.5D Sound Field Synthesis

The avoidance of spatial aliasing is a major challenge in the practical implementation of sound field synthesis. Such methods aim at a physically accurate reconstruction of a desired sound field inside a target region using a finite ensemble of loudspeakers. In the past, different theoretical treatises of the inherent spatial sampling process led to anti-aliasing criteria for simple loudspeaker array arrangements, e.g., lines and circles, and fundamental sound fields, e.g., plane and spherical waves. Many criteria were independent of the listener's position inside the target region.

GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis

Recently, generative neural network models which operate directly on raw audio, such as WaveNet, have improved the state of the art in text-to-speech synthesis (TTS). Moreover, there is increasing interest in using these models as statistical vocoders for generating speech waveforms from various acoustic features. However, there is also a need to reduce the model complexity, without compromising the synthesis quality.

Late Reverberation Cancellation Using Bayesian Estimation of Multi-Channel Linear Predictors and Student's t-Source Prior

Multi-channel linear prediction (MCLP) can model the late reverberation in the short-time Fourier transform domain using a delayed linear predictor and the prediction residual is taken as the desired early reflection component. Traditionally, a Gaussian source model with time-dependent precision (inverse of variance) is considered for the desired signal. In this paper, we propose a Student's t-distribution model for the desired signal, which is realized as a Gaussian source with a Gamma distributed precision.

Sound Event Detection in the DCASE 2017 Challenge