BLTRCNN-Based 3-D Articulatory Movement Prediction: Learning Articulatory Synchronicity From Both Text and Audio Inputs

You are here

IEEE Transactions on Multimedia

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

BLTRCNN-Based 3-D Articulatory Movement Prediction: Learning Articulatory Synchronicity From Both Text and Audio Inputs

By: 
Lingyun Yu; Jun Yu; Qiang Ling

Predicting articulatory movements from audio or text has diverse applications, such as speech visualization. Various approaches have been proposed to solve the acoustic-articulatory mapping problem. However, their precision is not high enough with only acoustic features available. Recently, deep neural network (DNN) has brought tremendous success in various fields, like speech recognition and image processing. To increase the accuracy, we propose a new network architecture for articulatory movement prediction with both text and audio inputs, called a bottleneck long-term recurrent convolutional neural network (BLTRCNN). To the best of our knowledge, it is the first time to predict articulatory movements based on DNN by fusing text and audio inputs. Our BLTRCNN consists of two networks. The first is the bottleneck network, generating a compact bottleneck features of text information for each frame independently. The second, including convolutional neural network, long short-term memory and skip connection, is called the long-term recurrent convolutional neural network (LTRCNN). LTRCNN is used for articulatory movement prediction when bottleneck features, acoustic features, and text features are integrated as inputs together. Experiments show that the proposed BLTRCNN achieves the state-of-the-art root-mean-square error (RMSE) 0.528 mm and the correlation coefficient 0.961. Moreover, we also demonstrate how text information complements acoustic features in this prediction task.

SPS Social Media

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel