Bayesian Learning of a Hierarchical Representation of Language from Raw Speech
Final Report Abstract
The goal of this project was the development of unsupervised learning techniques to extract a hierarchical model comprising of phonetic and lexical building blocks of a language, from spoken input only. The phonetic building blocks, so-called acoustic units, should be learnt at the lower level of the hierarchy, while lexical building blocks, i.e., word-like entities, were targeted at the upper level of the hierarchy. Finally, even semantic categories should be extracted which comprise multiple words. Concerning the first goal, the extraction of acoustic units, we initially intended to apply nonparametric Bayesian methods to learn hidden Markov Models (HMMs) with Gaussian mixture model (GMM) emission distributions. This approach, however, was also pursued by parallel work at Brno University, with who we entered into a cooperation within the framework of an international summer workshop (the 2016 Frederick Jelinek Memorial Workshop on ”Building speech recognition systems from untranscribed speech”). We therefore decided to approach the unit discovery problem from a different direction, built on the concept of structured variational autoencoders (VAE). This is a class of generative neural models, which at that time had not been developed and applied for these kind of problems in speech before. We developed a VAE with HMMs in the latent space for acoustic unit discovery, which clearly outperformed the GMM-HMM approach. It was then extended to a non-parametric Bayesian model in the latent space, to be able to infer the size of the acoustic unit inventory autonomously. The unsupervised word discovery/segmentation component developed in the predecessor project was the lexicon discovery component used in the above summer workshop. We developed a lattice based interface to the acoustic unit discovery resulting in a hierarchical model for joint acoustic and lexical unit discovery. The word discovery system employed a Bayesian language model, a so-called nested hierarchical Pitman-Yor model, by which the lexicon size need not be known a priori. The initially formulated problem of unit discovery from speech was slightly modified in the course of the project by assuming that, in addition to speech, also unrelated text data is available from the target language. This appeared to be no major restriction to the intended use cases. In order to relate the discovered phonetic symbols to the text data an acoustic unit-to-grapheme converter was developed, whose training, however, required some speech utterances transcribed at the word level. With this component the hierarchical acoustic unit and word discovery system could benefit from the given text data: The text data was used to initialize the language model used in the word discovery, leading to significantly better overall performance. Finally, semantic inference was studied by learning a mapping from speech to semantic categories. First, acoustic units were learnt in an unsupervised manner on the given speech. The speech signal was then transcribed as a sequence of acoustic units, and these units were then mapped to semantic categories. This mapping was realized by Markov Logic Networks. Its training required only a transcription of utterances with semantic categories, but not a transcription at the word or even phoneme level. The system thus bypasses an explicit word discovery, mapping acoustic unit sequences directly to semantic categories. However, it has only been tested on fairly small tasks, and a transfer to larger tasks appears challenging. To conclude, we believe that we have made some interesting contributions to the state of the art in unsupervised learning from speech by demonstrating how concepts like non-parametric Bayesian methods and generative neural networks, which have been known in other domains, can be succesfully applied to acoustic unit or word discovery tasks from spoken input.
Publications
- Building speech recognition systems from untranscribed speech. Report on Third Frederick Jelinek Memorial Summer Workshop
L. Burget, S. Khudanpur, N. Dehak, J. Trmal, R. Haeb-Umbach, G. Neubig, S. Watanabe, D. Mochihashi, T. Shinozaki, M. Sun, C. Liu, M. Wiesner, R. Pappagari, L. Ondel, M. Hannemann, S. Kesiraju, T. Glarner, L. Sari, J. Yang, O. Cifka, Y. Yang
- Unsupervised Word Discovery from Speech using Bayesian Hierarchical Models, in 38th German Conference on Pattern Recognition (GCPR), sep 2016
O. Walter, R. Haeb-Umbach
- Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery, in INTERSPEECH 2017, Stockholm, Schweden, August 2017 [Best Student Paper Award]
J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, B. Raj
(See online at https://doi.org/10.21437/Interspeech.2017-1160) - Leveraging Text Data for Word Segmentation for Underresourced Languages, in INTERSPEECH 2017, Stockholm, Schweden, August 2017
T. Glarner, B. Boenninghoff, O. Walter, R. Haeb-Umbach
(See online at https://doi.org/10.21437/Interspeech.2017-1262) - Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery, in INTERSPEECH 2018, Hyderabad, India, September 2018
T. Glarner, P. Hanebrink, J. Ebbers, R. Haeb-Umbach
(See online at https://doi.org/10.21437/Interspeech.2018-2148) - Machine learning techniques for semantic analysis of dysarthric speech: An experimental study, Speech Communication 99 (2018) 242-251 (Elsevier B.V.), April 2018
R. H.-U. Vladimir Despotovic, Oliver Walter
(See online at https://doi.org/10.1016/j.specom.2018.04.005)