Disentangling Prosody Representations With Unsupervised Speech Reconstruction

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

TASLP Volume 32 | 2024

Disentangling Prosody Representations With Unsupervised Speech Reconstruction

TASLP Articles

By:

Leyuan Qu; Taihao Li; Cornelius Weber; Theresa Pekarek-Rosin; Fuji Ren; Stefan Wermter

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in speech recognition and speaker verification tasks respectively. However, it is still an open challenging question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust speech recognition. The aim of this article is to address the disentanglement of emotional prosody based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain Prosody2Vec on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective and subjective evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations. Audio samples can be found on our demo website.

Read on IEEE Xplore

Tags:

IEEE TASLP Article

TASLP Articles

SPS on Twitter

DEADLINE EXTENDED: The 2023 IEEE International Workshop on Machine Learning for Signal Processing is now accepting… https://t.co/NLH2u19a3y
ONE MONTH OUT! We are celebrating the inaugural SPS Day on 2 June, honoring the date the Society was established in… https://t.co/V6Z3wKGK1O
The new SPS Scholarship Program welcomes applications from students interested in pursuing signal processing educat… https://t.co/0aYPMDSWDj
CALL FOR PAPERS: The IEEE Journal of Selected Topics in Signal Processing is now seeking submissions for a Special… https://t.co/NPCGrSjQbh
Test your knowledge of signal processing history with our April trivia! Our 75th anniversary celebration continues:… https://t.co/4xal7voFER

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel

© Copyright 2024 IEEE – All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

abstract_general_5.jpg

Learning From the Hidden Letters

congrats_celebrate_general.jpg

An Exciting Juncture for Signal Processing Research: On Building Bridges, Challenges, and Opportunities

newsletter_general.jpg

Statistical Principles of Time Reversal

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Disentangling Prosody Representations With Unsupervised Speech Reconstruction

TASLP Menu

Publications & Resources

For Authors

mentor_help_general_3.jpg

sergio_course_header.jpg

YuandZhangBlogImage_general.jpg

Top Reasons to Join SPS Today!

Disentangling Prosody Representations With Unsupervised Speech Reconstruction

Table of Contents:

TASLP Articles

SPS on Twitter

IEEE SPS Educational Resources

abstract_general_5.jpg

Learning From the Hidden Letters

congrats_celebrate_general.jpg

An Exciting Juncture for Signal Processing Research: On Building Bridges, Challenges, and Opportunities

newsletter_general.jpg

Statistical Principles of Time Reversal

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Disentangling Prosody Representations With Unsupervised Speech Reconstruction

Search form

You are here

TASLP Menu

Publications & Resources

For Authors

mentor_help_general_3.jpg

sergio_course_header.jpg

YuandZhangBlogImage_general.jpg

Top Reasons to Join SPS Today!

Disentangling Prosody Representations With Unsupervised Speech Reconstruction

Table of Contents:

TASLP Articles

SPS on Twitter

IEEE SPS Educational Resources