Project Details
Projekt Print View

Audio-visual Speech Enhancement for Spatial Audio under Adverse Conditions

Subject Area Communication Technology and Networks, High-Frequency Technology and Photonic Systems, Signal Processing and Machine Learning for Information Technology
Term since 2025
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 554605289
 
Speech enhancement in real-world environments has been a significant research topic for the past few decades. Recent advancements in wearable devices bring new opportunities and challenges to this topic. Two particular challenges are increasingly complex environments with multiple speakers and background noise, and the growing need for spatial audio outputs for virtual and augmented reality applications. New opportunities arise from the availability of multi-channel audio recordings with video and from advances in data-driven processing techniques. However, two major gaps remain in the state-of-the-art of speech enhancement: the performance degrades significantly in very low Signal-to-Noise Ratios (SNRs), and the spatial cues are often not preserved in processed audio signals. This project aims to address these critical gaps by developing algorithms that recover speech from very low SNRs through the integration of visual information. Additionally, we will focus on preserving spatial information by dedicated spatial processing. Despite rapid progress in audiovisual (AV) speech enhancement and spatial audio, existing approaches fail to deliver high-quality binaural signals from AV multi-channel input. Specifically, current methods lack reliable performance in poor SNRs and a comprehensive approach that incorporates both intelligibility and spatial perception throughout the design and evaluation stages. The primary aim of this project is to develop generative and discriminative methods for AV speech enhancement. We will systematically evaluate, compare, and optimize these methods to ensure both the preservation of spatial information and speech intelligibility, and robustness under low SNRs. Expected contributions include the creation of a publicly available database of AV data, including single- and multi-channel audio annotated with speaker activity and identity, and reference transcripts for speech intelligibility assessment. We will also develop and provide openly available source code for generative and discriminative multi-channel AV speech enhancement, applicable to arbitrary arrays. Comprehensive performance evaluations of the new algorithms under various acoustic conditions will be conducted using instrumental metrics and compared to existing baselines. Additionally, we will organize a machine learning challenge focused on multi-channel AV speech enhancement to foster reproducible research in this area. Finally, we will study the effectiveness of discriminative versus generative models in AV speech enhancement, highlighting interactions between factors affecting intelligibility and spatial perception and providing insights into listener preferences across multiple dimensions. By addressing these gaps, our research will advance the field of speech enhancement, particularly in challenging real-world environments, and will support the development of natural and intelligible audio experiences in virtual and augmented reality applications.
DFG Programme Research Grants
International Connection Israel
International Co-Applicant Professor Dr. Boaz Rafaely
 
 

Additional Information

Textvergrößerung und Kontrastanpassung