A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

TASLP Volume 31 | 2023

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

TASLPRO Articles

By:

Ravi Shankar; Hsi-Wei Hsieh; Nicolas Charon; Archana Venkataraman Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA

This paper introduces a new framework for non-parallel emotion conversion in speech. Our framework is based on two key contributions. First, we propose a stochastic version of the popular Cycle-GAN model. Our modified loss function introduces a Kullback–Leibler (KL) divergence term that aligns the source and target data distributions learned by the generators, thus overcoming the limitations of sample-wise generation. By using a variational approximation to this stochastic loss function, we show that our KL divergence term can be implemented via a paired density discriminator. We term this new architecture a variational Cycle-GAN (VCGAN). Second, we model the prosodic features of target emotion as a smooth and learnable deformation of the source prosodic features. This approach provides implicit regularization that offers key advantages in terms of better range alignment to unseen and out-of-distribution speakers. We conduct rigorous experiments and comparative studies to demonstrate that our proposed framework is fairly robust with high performance against several state-of-the-art baselines.

Speech is perhaps our primary mode of communication as humans. It is a rich medium, in the sense that both semantic information and speaker intent are intertwined together in a complex manner. The ability to convey emotion is an important yet poorly understood attribute of speech. Common work in speech analysis focuses on decomposing the signal into compact representations and probing their relative importance in imparting one emotion versus another. These representations can be broadly categorized into two groups: acoustic features and prosodic features. Acoustic features (e.g., spectrum) control resonance and speaker identity. Prosodic features (e.g., F0, energy contour) are linked to vocal inflections that include the relative pitch, duration, and intensity of each phoneme. Together, the prosodic features encode stress, intonation, and rhythm, all of which impact emotion perception. For example, expressions of anger often exhibit large variations in pitch, coupled with increases in both articulation rate and signal energy. In this paper, we develop an automated framework to transform an utterance from one emotional class to another. The problem, known as emotion conversion, is an important stepping stone to affective speech synthesis.

Read on IEEE Xplore

Tags:

IEEE TASLP Article

SPS Social Media

IEEE SPS Facebook Page https://www.facebook.com/ieeeSPS
IEEE SPS X Page https://x.com/IEEEsps
IEEE SPS Instagram Page https://www.instagram.com/ieeesps/?hl=en
IEEE SPS LinkedIn Page https://www.linkedin.com/company/ieeesps/
IEEE SPS YouTube Channel https://www.youtube.com/ieeeSPS

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel

© Copyright 2025 IEEE - All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

2025 Certified Chapter Banner (iStock-861165876) (1).jpg

Call for Nominations for the SPS Chapter of the Year Award

general_get_involved_tc_article_full.jpg

Get Involved with Technical Committees by becoming a TC Affiliate

general_get_involved_tc_article_full.jpg

Get Involved with Technical Committees by becoming a TC Affiliate

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

Publications & Resources

For Authors

2025 Certified Chapter Banner (iStock-861165876) (1).jpg

general_get_involved_tc_article_full.jpg

short_course_general.jpg

Top Reasons to Join SPS Today!

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

SPS Social Media

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

Search form

You are here

Publications & Resources

For Authors

Top Reasons to Join SPS Today!

A Diffeomorphic Flow-Based Variational Framework for Multi-Speaker Emotion Conversion

SPS Social Media

IEEE SPS Educational Resources