BLTRCNN-Based 3-D Articulatory Movement Prediction: Learning Articulatory Synchronicity From Both Text and Audio Inputs

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

BLTRCNN-Based 3-D Articulatory Movement Prediction: Learning Articulatory Synchronicity From Both Text and Audio Inputs

Lingyun Yu; Jun Yu; Qiang Ling

Predicting articulatory movements from audio or text has diverse applications, such as speech visualization. Various approaches have been proposed to solve the acoustic-articulatory mapping problem. However, their precision is not high enough with only acoustic features available. Recently, deep neural network (DNN) has brought tremendous success in various fields, like speech recognition and image processing. To increase the accuracy, we propose a new network architecture for articulatory movement prediction with both text and audio inputs, called a bottleneck long-term recurrent convolutional neural network (BLTRCNN). To the best of our knowledge, it is the first time to predict articulatory movements based on DNN by fusing text and audio inputs. Our BLTRCNN consists of two networks. The first is the bottleneck network, generating a compact bottleneck features of text information for each frame independently. The second, including convolutional neural network, long short-term memory and skip connection, is called the long-term recurrent convolutional neural network (LTRCNN). LTRCNN is used for articulatory movement prediction when bottleneck features, acoustic features, and text features are integrated as inputs together. Experiments show that the proposed BLTRCNN achieves the state-of-the-art root-mean-square error (RMSE) 0.528 mm and the correlation coefficient 0.961. Moreover, we also demonstrate how text information complements acoustic features in this prediction task.

SPS on Twitter

  • THIS FRIDAY: Join our Vice President-Membership, K.V.S. Hari, and Membership Development Committee Chair, Arash Moh…
  • The SPACE webinar series continues tomorrow, Tuesday, 11 August at 11 AM ET with Dr. Xiao Xiang Zhu presenting "Dat…
  • now accepting submissions for special sessions, tutorials, and papers! The conference is set for June 2…
  • DEADLINE EXTENDED: The IEEE Journal of Selected Topics in Signal Processing is now accepting papers for a Special I…
  • NEW WEBINAR: Join us on Friday, 14 August at 11:00 AM ET for the 2021 SPS Membership Preview! Society leadership wi…

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar