Skip to main content

Multimodal Information Based Speech Processing (MISP) 2023 Challenge: ICASSP 2024

Speech-enabled systems often experience performance degradation in real-world scenarios, primarily due to adverse acoustic conditions and interactions among multiple speakers. Enhancing the front-end speech processing technology is vital for improving the performance of the back-end systems. However, most existing front-end techniques are solely based on the audio modality and have reached performance plateaus. Building upon the observation that visual cues can aid human speech perception, the focus of Multimodal Information Based Speech Processing (MISP) 2023 Challenge is on the Audio-Visual Target Speaker Extraction (AVTSE) problem, which aims to extract the target speaker’s speech from mixtures containing various speakers and background noise. MISP 2023 challenge focuses explicitly on the problem under a real scenario with a complex acoustic environment. It provides a benchmark dataset collected from home TV environments, reflecting the challenges of such settings. In addition, to explore the impact of AVTSE on the back-end task, we use a pre-trained speech recognition model to evaluate the performance of the AVTSE.

Visit the Challenge website for details and more information!