Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception
Audio and visual signals complement each other in human speech perception, and the same applies to automatic speech recognition. The visual signal is less evident than the acoustic signal, but more robust in a complex acoustic environment, as far as speech perception is concerned.
