Computerlinguistische Implementierung einer großen, robusten Grammatik für Urdu/Hindi im Kontext paralleler Grammatikentwicklung
Final Report Abstract
The project created a large computational grammar for the South Asian language Urdu. Urdu is closely related to Hindi and is predominantly spoken in in India and Pakistan, as well as world-wide as part of the South Asian diaspora. Despite being a language with millions of speakers, theoretical linguistic research and concomitant computational resources have been lacking. The work in the project contributed to a linguistic understanding of the structure of Urdu on the one hand and created various computational resources for the Natural Language Processing (NLP) of Urdu on the other hand. We achieved new linguistic insights in particular with respect to the structure of the noun phrase and the verbal complex. We uncovered new distributional patterns which pose interesting problems for our theoretical linguistic understanding of language structure. We were able to achieve a deeper understanding of the lexical semantics of the language, particularly with respect to the meaning associated with verbs and nouns. This allowed us to build new computational lexical resources for Urdu, a cornerstone for NLP applications. We also developed a finite-state morphological analyzer for Urdu. This morphological analyzer is integrated into the Urdu grammar, but can also be used independently of the grammar. The morphological analyzer works together with a finite-state transliterator, which is capable of transliterating Urdu script to a Latin script representation (and vice versa). This design decision enables us to process Hindi as well as Urdu. Although Urdu and Hindi are languages that are structurally almost identical, they are written in very different scripts. The Urdu script is based on Perso-Arabic conventions, while Hindi is written in Devanagari. The use of the different scripts opens up a divide between the two versions of the language. This divide is particularly problematic for NLP and our design decision and computational resources aim at overcoming the divide. The Urdu grammar proper was implemented to the standards of an international collaborative effort, the ParGram (Parallel Grammars) initiative. This initiative has been underway since 1996. Its goal is the development of computationally viable yet theoretically well informed deep precision grammars based on the same underlying theoretical framework (Lexical Functional Grammar) and the same grammar development platform (XLE). ParGram grammars aim at producing analyses that are as parallel as possible across languages, i.e., treating phenomena such as passivization or relative clauses via a crosslinguistically informed perspective. This has the advantage of avoiding the implementation of inefficient, language particular solutions that may have no crosslinguistic validity. The parallel analyses as well as a version of the Urdu grammar are being made available via the INESS treebanking infrastructure, which is in turn part of the European CLARIN effort at providing infrastructures for research in the humanities. The grammar consists of hand-crafted precision rules but also incorporates knowledge based on preferential usages in order to resolve cases of ambiguity as far as possible. The grammar is able to both parse and generate Urdu texts and also includes the beginnings of a semantic parser, which remains to be expanded as more lexical semantic resources become available for Urdu/Hindi. Potential applications of the Urdu grammar and its attendant resources include its use to extract information useful for sentiment analysis from social media or to act as the core engine for ICALL (Intelligent Computer-Assisted Language Learning).
Publications
- 2010. Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), pages 2921–2927
Malik, Muhammad Kamran, Tafseer Ahmed, Sebastian Sulger, Tina Bögel, Atif Gulzar, Ghulam Raza, Sarmad Hussain and Miriam Butt
- 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. Proceedings of the International Conference on Computational Semantics (IWCS 2011), pages 305–309
Ahmed, Tafseer and Miriam Butt
- 2012. Adding an Annotation Layer to the Hindi/Urdu Treebank. Linguistic Issues in Language Technology 7(3), pages 1–18
Hautli, Annette, Sebastian Sulger and Miriam Butt
- 2012. Identifying Urdu Complex Predication via Bigram Extraction. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Technical Papers, pages 409–424, Mumbai, India
Butt, Miriam, Tina Bögel, Annette Hautli, Sebastian Sulger and Tafseer Ahmed
- 2012. Urdu–Roman Transliteration via Finite State Transducers. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing (FSMNLP), pages 25–29, Donostia - San Sebastian, Spain
Bögel, Tina
- 2013. ParGram- Bank: The ParGram Parallel Treebank. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), pages 550-560, Sofia, Bulgaria
Sulger, Sebastian, Miriam Butt, Tracy Holloway King, Paul Meurer, Tibor Laczkó , György Rákosi, Cheikh Bamba Dione, Helge Dyvik, Victoria Rosén, Koenraad De Smedt, Agnieszka Patejuk, Ozlem Cetinoglu, I Wayan Arka and Meladel Mistica
- 2013. Possessive Clitics and Ezafe in Urdu. In K. Börjars, D. Denison, and A. Scott (eds), Morphosyntactic Categories and the Expression of Possession. Amsterdam: John Benjamins, pages 291-322
Bögel, Tina and Miriam Butt