Skip to main content
SHARE:
Category
Proficiency
Language
Media Type
Intended Audience
Pricing

SPS Members $0.00
IEEE Members $11.00
Non-members $15.00

Authors
Date
A person’s speech is strongly conditioned by his own articulators and the language(s) he speaks, hence rendering speech in an inter-speaker or inter-language manner from a source speaker’s speech data collected in his native language is both academically challenging and technology/application desirable. The quality of the rendered speech is assessed in three dimensions: naturalness, intelligibility and similarity to the source speaker. Usually, the three criteria cannot be all met when rendering is done in both cross-speaker and cross-language ways. We will analyze the key factors of rendering quality in both acoustic and phonetic domains objectively. Monolingual speech databases but recorded by different speakers or bilingual ones recorded by the same speaker(s) are used. Measures in the acoustic space and phonetic space are adopted to quantify naturalness, intelligibility and speaker’s timber objectively. Our “trajectory tiling” algorithm-based, cross-lingual TTS is used as the baseline system for comparison. To equalize speaker difference automatically, DNN-based ASR acoustic model trained speaker independently is used. Kullback-Leibler Divergence is proposed to statistically measure the phonetic similarity between any two given speech segments, which are from different speakers or languages, in order to select good rendering candidates. Demos of voice conversion, speaker adaptive TTS, cross-lingual TTS will be shown either inter-speaker or inter-language wise, or both. The implications of this research on low-resourced speech research, speaker adaptation, “average speaker’s voice”, accented/dialectical speech processing, speech-to-speech translation, audio-visual TTS, etc. will be discussed.
Duration
1:01:08
Subtitles

Crossing Speaker and Language Barriers in Speech Processing