IEEE Transactions on Audio, Speech and Language Processing

Adaptive Multimodal Graph Integration Network for Multimodal Sentiment Analysis

TASLPRO Volume 33 | 2025

Most current models for analyzing multimodal sequences often disregard the imbalanced contributions of individual modal representations caused by varying information densities, as well as the inherent multi-relational interactions across distinct modalities. Consequently, a biased understanding of the intricate interplay among modalities may be fostered, limiting prediction accuracy and effectiveness.

Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

TASLPRO Volume 33 | 2025

Audio and visual signals complement each other in human speech perception, and the same applies to automatic speech recognition. The visual signal is less evident than the acoustic signal, but more robust in a complex acoustic environment, as far as speech perception is concerned.

Memory-Tuning: A Unified Parameter-Efficient Tuning Method for Pre-Trained Language Models

TASLPRO Volume 33 | 2025

Conventional fine-tuning encounters increasing difficulties given the size of current Pre-trained Language Models, which makes parameter-efficient tuning become the focal point of frontier research. Recent advances in this field is the unified tuning methods that aim to tune the representations of both multi-head attention (MHA) and fully connected feed-forward network (FFN) simultaneously, but they rely on existing tuning methods and do not explicitly model domain knowledge for downstream tasks.

Disentangling Prosody Representations With Unsupervised Speech Reconstruction

TASLP Volume 32 | 2024

TASLPRO Articles

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in speech recognition and speaker verification tasks respectively. However, it is still an open challenging question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust speech recognition.

Speech Dereverberation With Frequency Domain Autoregressive Modeling

TASLP Volume 32 | 2024

TASLPRO Articles

Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performance using the approach of envelope-carrier decomposition provided by an autoregressive (AR) model.

Operation-Augmented Numerical Reasoning for Question Answering

TASLP Volume 32 | 2024

TASLPRO Articles

Question answering requiring numerical reasoning, which generally involves symbolic operations such as sorting, counting, and addition, is a challenging task. To address such a problem, existing mixture-of-experts (MoE)-based methods design several specific answer predictors to handle different types of questions and achieve promising performance. However, they ignore the modeling and exploitation of fine-grained reasoning-related operations to support numerical reasoning, encountering the inadequacy in reasoning capability and interpretability.

Statistical Analysis for Speaker Recognition Evaluation With Data Dependence and Three Score Distributions

TASLP Volume 32 | 2024

TASLPRO Articles

The speaker recognition evaluation is conducted in a framework in which three score distributions and two decision thresholds are employed, and the statistic of interest is an average of the two weighted sums of the probabilities of type I and type II errors at the two thresholds correspondingly. And data dependence caused by multiple use of the same subjects exists ubiquitously in order to generate more samples because of limited resources.

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

TASLP Volume 31 | 2023

TASLPRO Articles

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments.

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

TASLP Volume 31 | 2023

TASLPRO Articles

This paper introduces a new framework for non-parallel emotion conversion in speech. Our framework is based on two key contributions. First, we propose a stochastic version of the popular Cycle-GAN model. Our modified loss function introduces a Kullback–Leibler (KL) divergence term that aligns the source and target data distributions learned by the generators, thus overcoming the limitations of sample-wise generation. By using a variational approximation to this stochastic loss function, we show that our KL divergence term can be implemented via a paired density discriminator.

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

TASLP Volume 31 | 2023

TASLPRO Articles

In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding;

congratulations.jpg

Congratulations to Signal Processing Society Members Elevated to Senior Members!

MLSP-2027.jpg

2027 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2027)

ISPA-2025.jpg

2025 14th International Symposium on Image and Signal Processing and Analysis (ISPA)

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

IEEE Transactions on Audio, Speech and Language Processing

Top Reasons to Join SPS Today!

Adaptive Multimodal Graph Integration Network for Multimodal Sentiment Analysis

Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Memory-Tuning: A Unified Parameter-Efficient Tuning Method for Pre-Trained Language Models

Disentangling Prosody Representations With Unsupervised Speech Reconstruction

Speech Dereverberation With Frequency Domain Autoregressive Modeling

Operation-Augmented Numerical Reasoning for Question Answering

Statistical Analysis for Speaker Recognition Evaluation With Data Dependence and Three Score Distributions

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

Pages

SPS Social Media

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

IEEE Transactions on Audio, Speech and Language Processing

Search form

You are here

Top Reasons to Join SPS Today!

Pages

SPS Social Media

IEEE SPS Educational Resources