Efficient statistical parsing and decoding for expressive grammar formalisms based on tree automata

Applicant Professor Dr. Alexander Koller

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages

Term from 2014 to 2023

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 252303250

Final Report Abstract

The goal of this project was to develop a fast algorithm for broad-coverage parsing that works across a wide range of grammar formalisms. When the project started, computational linguists were using a large variety of different mathematical formalisms to describe languages of grammatically correct sentences, or to translate such sentences into trees and graphs. In this project, we took a generalizing perspective over this landscape of grammar formalisms by translating many of them into our own formalism, called Interpreted Regular Tree Grammars (IRTGs). IRTG takes an algebraic approach to describing languages and relations, by generating derivation trees with a grammar and then evaluating them in appropriate algebras of strings, trees, graphs, or other objects. An IRTG parser can read a sentence as input and translate it, for instance, into a tree that describes the sentence’s grammatical structure or into a graph that describes its meaning, all with the same algorithm. Research in the project evolved in two distinct phases. In the ﬁrst phase, we worked out our novel grammar formalism of IRTG, developed a number of algebras that allowed us to capture grammars from other formalisms as IRTG grammars, and developed highly efﬁcient parsing algorithms for IRTG. This solved a decades-old challenge in making the algebraic approach to parsing efﬁcient, even with the very large grammars that are being used in computational linguistics. Our implementation of these algorithms, in the Alto tool, is the fastest parser in the world for certain grammar formalisms, including feature-based tree-adjoining grammars (FTAG) and hyperedge replacement grammars (HRG). In the second phase, the project’s focus shifted to the development of the AM parser, which predicts a derivation tree for a given input sentence and evaluates it algebraically to a graph. In contrast to work from the ﬁrst phase, the AM parser no longer uses a grammar; it instead predicts the tree using neural models. While the AM parser is limited to describing relations between strings and graphs, it has set new states of the art in parsing accuracy across a number of semantic parsing tasks; it is also extremely fast and can parse up to 10,000 words per second. This shows that the algebraic and neural perspectives can be fruitfully combined. The overall landscape of computational linguistics and NLP changed drastically during the project’s ten-year lifespan. When the project started in 2014, grammar-based methods were a popular approach to parsing; when it ended in 2023, these methods had been almost completely replaced by neural models that no longer used grammars. The project under report was originally intended to be mostly grammar-based, but it ended up radically changing its focus to neural models; this allowed us to incorporate modern methods and conveniently sidestep certain technical challenges that made learning broad-coverage grammars inconvenient, and to maintain our ability to produce impactful research. The project has demonstrated conclusively that algebraic parsing methods can be made accurate and efﬁcient, and they can be combined with both grammar-based and neural methods. Alto and the AM parser are both available open-source. Alto is still in active use in research and teaching, and the very fast AM parser is one of the very few semantic parsers in the world that works well both broad-coverage and in “compositional generalization” setups, where the parser has to generalize from very simple training sentences. At a larger scale, the success of the AM parser is a compelling demonstration of the value of neurosymbolic models, which combine neural and symbolic elements, and has been one of the seeds from which the RTG 2853 Neuroexplicit Models of Language, Vision, and Action evolved.

Publications

A constrained graph algebra for semantic parsing with AMRs. In Proceedings of the 12th International Conference on Computational Semantics (IWCS)
Groschwitz, Jonas; Fowlie, Meaghan; Johnson, Mark & Koller, Alexander
AMR dependency parsing with a typed semantic algebra. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018), 1831-1841. Association for Computational Linguistics.
Groschwitz, Jonas; Lindemann, Matthias; Fowlie, Meaghan; Johnson, Mark & Koller, Alexander
Generalized chart constraints for efficient PCFG and TAG parsing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2018), 626-631. Association for Computational Linguistics.
Grünewald, Stefan; Henning, Sophie & Koller, Alexander
Compositional Semantic Parsing across Graphbanks. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Association for Computational Linguistics.
Lindemann, Matthias; Groschwitz, Jonas & Koller, Alexander
Fast semantic parsing with well-typedness guarantees. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), 3929-3951. Association for Computational Linguistics.
Lindemann, Matthias; Groschwitz, Jonas & Koller, Alexander
Learning compositional structures for semantic graph parsing. Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021) (2021), 22-36. Association for Computational Linguistics.
Groschwitz, Jonas; Fowlie, Meaghan & Koller, Alexander
Structural generalization is hard for sequence-to-sequence models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022), 5048-5062. Association for Computational Linguistics.
Yao, Yuekun & Koller, Alexander
Compositionality in Computational Linguistics. Annual Review of Linguistics, 9(1), 463-481.
Donatelli, Lucia & Koller, Alexander
SLOG: A Structural Generalization Benchmark for Semantic Parsing. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023), 3213-3232. Association for Computational Linguistics.
Li, Bingzhi; Donatelli, Lucia; Koller, Alexander; Linzen, Tal; Yao, Yuekun & Kim, Najoung
Simple and effective data augmentation for compositional generalization. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (2024), 434-449. Association for Computational Linguistics.
Yao, Yuekun & Koller, Alexander

Servicenavigation

Hauptnavigation

Efficient statistical parsing and decoding for expressive grammar formalisms based on tree automata

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

Efficient statistical parsing and decoding for expressive grammar formalisms based on tree automata

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung