TASLPRO Articles

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in speech recognition and speaker verification tasks respectively. However, it is still an open challenging question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust speech recognition.

Speech Dereverberation With Frequency Domain Autoregressive Modeling

TASLPRO Articles

TASLP Volume 32 | 2024

Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performance using the approach of envelope-carrier decomposition provided by an autoregressive (AR) model.

Operation-Augmented Numerical Reasoning for Question Answering

TASLPRO Articles

TASLP Volume 32 | 2024

Question answering requiring numerical reasoning, which generally involves symbolic operations such as sorting, counting, and addition, is a challenging task. To address such a problem, existing mixture-of-experts (MoE)-based methods design several specific answer predictors to handle different types of questions and achieve promising performance. However, they ignore the modeling and exploitation of fine-grained reasoning-related operations to support numerical reasoning, encountering the inadequacy in reasoning capability and interpretability.

Statistical Analysis for Speaker Recognition Evaluation With Data Dependence and Three Score Distributions

TASLPRO Articles

TASLP Volume 32 | 2024

The speaker recognition evaluation is conducted in a framework in which three score distributions and two decision thresholds are employed, and the statistic of interest is an average of the two weighted sums of the probabilities of type I and type II errors at the two thresholds correspondingly. And data dependence caused by multiple use of the same subjects exists ubiquitously in order to generate more samples because of limited resources.

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

TASLPRO Articles

TASLP Volume 31 | 2023

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments.

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

TASLPRO Articles

TASLP Volume 31 | 2023

This paper introduces a new framework for non-parallel emotion conversion in speech. Our framework is based on two key contributions. First, we propose a stochastic version of the popular Cycle-GAN model. Our modified loss function introduces a Kullback–Leibler (KL) divergence term that aligns the source and target data distributions learned by the generators, thus overcoming the limitations of sample-wise generation. By using a variational approximation to this stochastic loss function, we show that our KL divergence term can be implemented via a paired density discriminator.

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

TASLPRO Articles

TASLP Volume 31 | 2023

In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding;

Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks

TASLPRO Articles

TASLP Volume 31 | 2023

Emotional voice conversion (VC) aims to convert a neutral voice to an emotional one while retaining the linguistic information and speaker identity. We note that the decoupling of emotional features from other speech information (such as content, speaker identity, etc.) is the key to achieving promising performance. Some recent attempts of speech representation decoupling on the neutral speech cannot work well on the emotional speech, due to the more complex entanglement of acoustic properties in the latter.

Clean vs. Overlapped Speech-Music Detection Using Harmonic-Percussive Features and Multi-Task Learning

TASLPRO Articles

TASLP Volume 31 | 2023

Detection of speech and music signals in isolated and overlapped conditions is an essential preprocessing step for many audio applications. Speech signals have wavy and continuous harmonics, while music signals exhibit horizontally linear and discontinuous harmonic patterns. Music signals also contain more percussive components than speech signals, manifested as vertical striations in the spectrograms.

Self-Attending RNN for Speech Enhancement to Improve Cross-Corpus Generalization

TASLPRO Articles

TASLP Volume 30 | 2022

Deep neural networks (DNNs) represent the mainstream methodology for supervised speech enhancement, primarily due to their capability to model complex functions using hierarchical representations. However, a recent study revealed that DNNs trained on a single corpus fail to generalize to untrained corpora, especially in low signal-to-noise ratio (SNR) conditions.

Subscribe to TASLPRO Articles

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

TASLPRO Articles

Disentangling Prosody Representations With Unsupervised Speech Reconstruction

Speech Dereverberation With Frequency Domain Autoregressive Modeling

Operation-Augmented Numerical Reasoning for Question Answering

Statistical Analysis for Speaker Recognition Evaluation With Data Dependence and Three Score Distributions

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks

Clean vs. Overlapped Speech-Music Detection Using Harmonic-Percussive Features and Multi-Task Learning

Self-Attending RNN for Speech Enhancement to Improve Cross-Corpus Generalization

IEEE Signal Processing Society on

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

About IEEE SPS

For Volunteers

Career & Industry

Education & Training