Deep-learning-based audio-visual speech enhancement

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Deep-learning-based audio-visual speech enhancement

Michlsanti_blog.jpg

Thursday, 5 January, 2023

By:

Dr. Daniel Michelsanti

Contributed by Dr. Daniel Michelsanti, based on the IEEEXplore® article, “An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation”, published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing in March 2021, and the SPS webinar, “Audio-visual Speech Enhancement and Separation Based on Deep Learning,” available on the SPS Resource Center.

We all experienced the discomfort of communicating with our friends at a cocktail party or in a pub with loud background music. When difficult acoustic scenarios like these occur, we tend to rely on several visual cues, such as lips and mouth movement of the speaker, in order to understand the speech of interest. In fact, visual information is essentially unaffected by the acoustic background noise. The design of an automatic system that can effectively extract the speech of interest from both acoustic and visual information is a challenging task that can benefit several applications.

Applications

Audio-visual speech enhancement and separation systems can be particularly useful in a range of different applications.

When using a videoconference system, users might be speaking from noisy environments (such as a cafe or a hall with talkers in the background). Adopting a speech enhancement method to suppress the background noise would benefit the communication among the users.

Audio-visual speech enhancement and separation may also be important for noise reduction in video post-production or in live videos (consider, for example, the scenario where a news correspondent is speaking from a busy square).

In the future, audio-visual speech enhancement systems can also be used in hearing aid applications, where multimodal wearable devices can be connected to a hearing instrument and improve its noise reduction capabilities.

Terminology

Let x[n] and d[n] indicate the clean speech of interest and an additive noise signal, respectively, where n denotes a discrete-time index. It is possible to model the acoustic speech signal as y[n]=x[n]+d[n]. The task of determining an estimate x^{^}[n] of x[n] given y[n] is known as audio-only speech enhancement. When a visual signal, generally consisting of video frames capturing the mouth region of the target speaker, is provided as input to the system, we talk about audio-visual speech enhancement. If the acoustic signal, y[n], is not accessible, then the task of estimating the target speech signal solely from visual information of the speaker is known as speech reconstruction from silent videos.

Sometimes, the observed acoustic signal is a mixture of several speech signals from different speakers. The task of extracting each of these speech signals from the acoustic mixture and visual information of the speakers is known as audio-visual speech separation.

A representation of the difference between the aforementioned tasks is shown in Figure 1.

Figure 1. Difference between audio-visual speech enhancement, speech reconstruction from silent videos and audio-visual speech separation. [1]

System Elements

Audio-visual speech enhancement and separation have recently been addressed with deep learning methods. In most cases, a supervised learning framework is adopted: a deep learning model is trained to find a mapping between pairs of degraded and clean speech signals, together with the video of the speakers. During inference, an audio-visual signal is used as input of the system and an estimated clean speech (or multiple clean speech signals in the case of speech separation) is provided as output. Deep-learning-based audio-visual speech enhancement and separation systems usually consists of six main elements: acoustic features; visual features; deep learning methods; fusion techniques; training targets; objective functions.

Figure 2 shows how these elements are interconnected in a typical audio-visual system. For further details, please refer to [2].

Figure 2. Main elements of a speech enhancement and separation system.

Audio-Visual Speech Datasets

Datasets have a central role in the development of deep learning-based audio-visual speech enhancement and separation systems. Most of the audio-visual speech datasets contain clean speech signals of several speakers, which are then used to create synthetic ones with overlapping speakers and/or acoustic background noise.

It is possible to divide the datasets into two categories: data collected in controlled environment and data gathered in the wild. For data collected in controlled, we usually have a complete control of the experimental setup in terms of recording equipment, speakers (language, gender etc.), environment, and linguistic content. This data is good to investigate specific topics such as the effect of different angles of view on the system performance, the impact of Lombard effect and how a system can preserve a speaker emotion through the enhancement or separation process. On the other hand, datasets in the wild are characterised by a vast variety of speakers, sentences, languages, and auditory/visual environments. They are particularly suitable to train large deep learning models that generalise well to real-world situations.

Figure 3 shows some of the datasets that can be used for audio-visual speech enhancement and separation. A more detailed list can be found in https://github.com/danmic/av-se.

Figure 3. Non-exhaustive list of datasets in a controlled environments and datasets in the wild.

Performance Assessment

The performance of speech enhancement and separation approaches are usually assessed using two perceptual aspects: speech quality, which consists of all the attributes concerning how a speaker produces an utterance, and speech intelligibility, which is related to what a speaker says.

Ideally, subjective listening tests with a panel of subjects are conducted to assess the speech quality and intelligibility performance of a system. However, they are often costly and time consuming. Therefore, objective measures are used to estimate the results of listening tests.

Some of the most used evaluation methods are reported in Figure 4. A more detailed list can be found in https://github.com/danmic/av-se.

Figure 4. Non-exhaustive list of evaluation methods for speech enhancement and separation.

References:

[1] D. Michelsanti (2021). Audio-Visual Speech Enhancement Based on Deep Learning. Aalborg Universitetsforlag. Ph.d.-serien for Det Tekniske Fakultet for IT og Design, Aalborg Universitet https://doi.org/10.54337/aau422930170.

[2] D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, & J. Jensen. (2021). An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1368-1396. https://doi.org/10.1109/TASLP.2021.3066303.

Tags:

SPS Blog Article

Dr. Daniel Michelsanti

Audio speech and language processing

SPS Social Media

IEEE SPS Facebook Page https://www.facebook.com/ieeeSPS
IEEE SPS X Page https://x.com/IEEEsps
IEEE SPS Instagram Page https://www.instagram.com/ieeesps/?hl=en
IEEE SPS LinkedIn Page https://www.linkedin.com/company/ieeesps/
IEEE SPS YouTube Channel https://www.youtube.com/ieeeSPS

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel

© Copyright 2025 IEEE - All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

ieeee-sps-logo-social.png

2030 IEEE International Conferences on Acoustics, Speech, and Signal Processing (ICASSP 2030)

Congratulations Image (1).png

SPS Members Receive 2026 IEEE Technical Field Awards!

congratulations.jpg

Congratulations to Signal Processing Society Members Elevated to Senior Members!

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Deep-learning-based audio-visual speech enhancement

Publications & Resources

For Authors

Congratulations Image (1).png

congratulations.jpg

Submit_Manuscript_pg.jpg

Top Reasons to Join SPS Today!

Deep-learning-based audio-visual speech enhancement

Michlsanti_blog.jpg

Applications

Terminology

System Elements

Audio-Visual Speech Datasets

Performance Assessment

References:

SPS Social Media

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Deep-learning-based audio-visual speech enhancement

Search form

You are here

Publications & Resources

For Authors

Top Reasons to Join SPS Today!

Deep-learning-based audio-visual speech enhancement

Applications

Terminology

System Elements

Audio-Visual Speech Datasets

Performance Assessment

References:

SPS Social Media

IEEE SPS Educational Resources