Bayesian Learning of a Hierarchical Representation of Language from Raw Speech

Applicant Professor Dr.-Ing. Reinhold Häb-Umbach

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing

Term from 2014 to 2018

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 260050394

Final Report Year 2019

Final Report Abstract

The goal of this project was the development of unsupervised learning techniques to extract a hierarchical model comprising of phonetic and lexical building blocks of a language, from spoken input only. The phonetic building blocks, so-called acoustic units, should be learnt at the lower level of the hierarchy, while lexical building blocks, i.e., word-like entities, were targeted at the upper level of the hierarchy. Finally, even semantic categories should be extracted which comprise multiple words. Concerning the ﬁrst goal, the extraction of acoustic units, we initially intended to apply nonparametric Bayesian methods to learn hidden Markov Models (HMMs) with Gaussian mixture model (GMM) emission distributions. This approach, however, was also pursued by parallel work at Brno University, with who we entered into a cooperation within the framework of an international summer workshop (the 2016 Frederick Jelinek Memorial Workshop on ”Building speech recognition systems from untranscribed speech”). We therefore decided to approach the unit discovery problem from a diﬀerent direction, built on the concept of structured variational autoencoders (VAE). This is a class of generative neural models, which at that time had not been developed and applied for these kind of problems in speech before. We developed a VAE with HMMs in the latent space for acoustic unit discovery, which clearly outperformed the GMM-HMM approach. It was then extended to a non-parametric Bayesian model in the latent space, to be able to infer the size of the acoustic unit inventory autonomously. The unsupervised word discovery/segmentation component developed in the predecessor project was the lexicon discovery component used in the above summer workshop. We developed a lattice based interface to the acoustic unit discovery resulting in a hierarchical model for joint acoustic and lexical unit discovery. The word discovery system employed a Bayesian language model, a so-called nested hierarchical Pitman-Yor model, by which the lexicon size need not be known a priori. The initially formulated problem of unit discovery from speech was slightly modiﬁed in the course of the project by assuming that, in addition to speech, also unrelated text data is available from the target language. This appeared to be no major restriction to the intended use cases. In order to relate the discovered phonetic symbols to the text data an acoustic unit-to-grapheme converter was developed, whose training, however, required some speech utterances transcribed at the word level. With this component the hierarchical acoustic unit and word discovery system could beneﬁt from the given text data: The text data was used to initialize the language model used in the word discovery, leading to signiﬁcantly better overall performance. Finally, semantic inference was studied by learning a mapping from speech to semantic categories. First, acoustic units were learnt in an unsupervised manner on the given speech. The speech signal was then transcribed as a sequence of acoustic units, and these units were then mapped to semantic categories. This mapping was realized by Markov Logic Networks. Its training required only a transcription of utterances with semantic categories, but not a transcription at the word or even phoneme level. The system thus bypasses an explicit word discovery, mapping acoustic unit sequences directly to semantic categories. However, it has only been tested on fairly small tasks, and a transfer to larger tasks appears challenging. To conclude, we believe that we have made some interesting contributions to the state of the art in unsupervised learning from speech by demonstrating how concepts like non-parametric Bayesian methods and generative neural networks, which have been known in other domains, can be succesfully applied to acoustic unit or word discovery tasks from spoken input.

Publications

Building speech recognition systems from untranscribed speech. Report on Third Frederick Jelinek Memorial Summer Workshop
L. Burget, S. Khudanpur, N. Dehak, J. Trmal, R. Haeb-Umbach, G. Neubig, S. Watanabe, D. Mochihashi, T. Shinozaki, M. Sun, C. Liu, M. Wiesner, R. Pappagari, L. Ondel, M. Hannemann, S. Kesiraju, T. Glarner, L. Sari, J. Yang, O. Cifka, Y. Yang
Unsupervised Word Discovery from Speech using Bayesian Hierarchical Models, in 38th German Conference on Pattern Recognition (GCPR), sep 2016
O. Walter, R. Haeb-Umbach
Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery, in INTERSPEECH 2017, Stockholm, Schweden, August 2017 [Best Student Paper Award]
J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, B. Raj
Leveraging Text Data for Word Segmentation for Underresourced Languages, in INTERSPEECH 2017, Stockholm, Schweden, August 2017
T. Glarner, B. Boenninghoﬀ, O. Walter, R. Haeb-Umbach
Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery, in INTERSPEECH 2018, Hyderabad, India, September 2018
T. Glarner, P. Hanebrink, J. Ebbers, R. Haeb-Umbach
Machine learning techniques for semantic analysis of dysarthric speech: An experimental study, Speech Communication 99 (2018) 242-251 (Elsevier B.V.), April 2018
R. H.-U. Vladimir Despotovic, Oliver Walter

Servicenavigation

Hauptnavigation

Bayesian Learning of a Hierarchical Representation of Language from Raw Speech

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

Bayesian Learning of a Hierarchical Representation of Language from Raw Speech

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung