A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

By: 
Ravi Shankar; Hsi-Wei Hsieh; Nicolas Charon; Archana Venkataraman Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA

This paper introduces a new framework for non-parallel emotion conversion in speech. Our framework is based on two key contributions. First, we propose a stochastic version of the popular Cycle-GAN model. Our modified loss function introduces a Kullback–Leibler (KL) divergence term that aligns the source and target data distributions learned by the generators, thus overcoming the limitations of sample-wise generation. By using a variational approximation to this stochastic loss function, we show that our KL divergence term can be implemented via a paired density discriminator. We term this new architecture a variational Cycle-GAN (VCGAN). Second, we model the prosodic features of target emotion as a smooth and learnable deformation of the source prosodic features. This approach provides implicit regularization that offers key advantages in terms of better range alignment to unseen and out-of-distribution speakers. We conduct rigorous experiments and comparative studies to demonstrate that our proposed framework is fairly robust with high performance against several state-of-the-art baselines.

Speech is perhaps our primary mode of communication as humans. It is a rich medium, in the sense that both semantic information and speaker intent are intertwined together in a complex manner. The ability to convey emotion is an important yet poorly understood attribute of speech. Common work in speech analysis focuses on decomposing the signal into compact representations and probing their relative importance in imparting one emotion versus another. These representations can be broadly categorized into two groups: acoustic features and prosodic features. Acoustic features (e.g., spectrum) control resonance and speaker identity. Prosodic features (e.g., F0, energy contour) are linked to vocal inflections that include the relative pitch, duration, and intensity of each phoneme. Together, the prosodic features encode stress, intonation, and rhythm, all of which impact emotion perception. For example, expressions of anger often exhibit large variations in pitch, coupled with increases in both articulation rate and signal energy. In this paper, we develop an automated framework to transform an utterance from one emotional class to another. The problem, known as emotion conversion, is an important stepping stone to affective speech synthesis.

SPS Social Media

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel