Project Details
Projekt Print View

Automatic Transcription of Conversations

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Communication Technology and Networks, High-Frequency Technology and Photonic Systems, Signal Processing and Machine Learning for Information Technology
Term since 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 448568305
 
Multi-talker conversational speech recognition is concerned with the transcription of audio recordings of formal meetings or informal get-togethers in machine-readable form using distant microphones. Current solutions are far from reaching human performance. The difficulty of the task can be attributed to three factors. First, the recording conditions are challenging: The speech signal captured by microphones from a distance is noisy and reverberated and often contains nonstationary acoustic distortions, which makes it hard to decode. Second, there is a significant percentage of time with overlapped speech, where multiple speakers talk at the same time. Finally, the interaction dynamics of the scenario are challenging because speakers articulate themselves in an intermittent manner with alternating segments of speech inactivity, single-, and multi-talker speech. We aim to develop a transcription system that is able to operate on arbitrary length input, correctly handles segments of overlapped as well as non-overlapped speech, and transcribes the speech of different speakers consistently into separate output streams. While existing approaches using separately trained subsystems for diarization, separation, and recognition are by far not able to reach human performance, we believe that the missing piece is a formulation which encapsulates all aspects of meeting transcription and which allows to design a joint approach under a single optimization criterion. This project is aimed at such a coherent formulation. We are going to develop an integrated solution for multi-talker conversational speech recognition using distant microphones, where the number of speakers who are active at a time and the extent of speaker overlap is unknown and time-varying. The tasks of diarization, enhancement and recognition will be addressed under a common objective function, ultimately approaching end-to-end training. Complementarily, we intend to avoid premature decisions with the goal of end-to-end recognition. We will derive and consider different architectures (all-neural, hybrid, cascaded, integrated), and assess them w.r.t. diarization and transcription performance, as well as w.r.t. the interpretability of intermediate results and the ease of use of the corresponding recipes.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung