Skip to main content

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

This paper introduces a new framework for non-parallel emotion conversion in speech. Our framework is based on two key contributions. First, we propose a stochastic version of the popular Cycle-GAN model. Our modified loss function introduces a Kullback–Leibler (KL) divergence term that aligns the source and target data distributions learned by the generators, thus overcoming the limitations of sample-wise generation. By using a variational approximation to this stochastic loss function, we show that our KL divergence term can be implemented via a paired density discriminator.

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding;

Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks

Emotional voice conversion (VC) aims to convert a neutral voice to an emotional one while retaining the linguistic information and speaker identity. We note that the decoupling of emotional features from other speech information (such as content, speaker identity, etc.) is the key to achieving promising performance. Some recent attempts of speech representation decoupling on the neutral speech cannot work well on the emotional speech, due to the more complex entanglement of acoustic properties in the latter. 

Clean vs. Overlapped Speech-Music Detection Using Harmonic-Percussive Features and Multi-Task Learning

Detection of speech and music signals in isolated and overlapped conditions is an essential preprocessing step for many audio applications. Speech signals have wavy and continuous harmonics, while music signals exhibit horizontally linear and discontinuous harmonic patterns. Music signals also contain more percussive components than speech signals, manifested as vertical striations in the spectrograms.

ET: Edge-Enhanced Transformer for Image Splicing Detection

A key challenge of image splicing detection is how to localize integral tampered regions without false alarm. Although current forgery detection approaches have achieved promising performance, the integrality and false alarm are overlooked. In this paper, we argue that the insufficient use of splicing boundary is a main reason for poor accuracy. To tackle this problem, we propose an Edge-enhanced Transformer (ET) for tampered region localization. Specifically, to capture rich tampering traces, a two-branch edge-aware transformer is built to integrate the splicing edge clues into the forgery localization network, generating forgery features and edge features.

Learn to Zoom in Single Image Super-Resolution

In this letter, we propose a novel solution to the problem of single image super-resolution at multiple scaling factors, with a single network architecture. In applications where only a detail needs to be super-resolved, traditional solutions must choose to use as input either the low-resolution detail, thus losing the information about the context, or the whole low-resolution image and then crop the desired output detail, which is quite wasteful in terms of computations and storage. 

Spatial Diversity in Radar Detection via Active Reconfigurable Intelligent Surfaces

Active reconfigurable intelligent surfaces (RISs) are a novel and promising technology that allows controlling the radio propagation environment while compensating for the product path loss along the RIS-assisted path. In this letter, we consider the classical radar detection problem and propose to use an active RIS to get a second independent look at a prospective target illuminated by the radar transmitter.

False Discovery Rate (FDR) and Familywise Error Rate (FER) Rules for Model Selection in Signal Processing Applications

Model selection is an omnipresent problem in signal processing applications. The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are the most commonly used solutions to this problem. These criteria have been found to have satisfactory performance in many cases and had a dominant role in the model selection literature since their introduction several decades ago, despite numerous attempts to dethrone them. Model selection can be viewed as a multiple hypothesis testing problem.

Natural Thresholding Algorithms for Signal Recovery With Sparsity

The algorithms based on the technique of optimal k -thresholding (OT) were recently proposed for signal recovery, and they are very different from the traditional family of hard thresholding methods. However, the computational cost for OT-based algorithms remains high at the current stage of their development. This stimulates the development of the so-called natural thresholding (NT) algorithm and its variants in this paper. The family of NT algorithms is developed through the first-order approximation of the so-called regularized optimal k -thresholding model, and thus the computational cost for this family of algorithms is significantly lower than that of the OT-based algorithms.