Project Details
Projekt Print View

Efficient statistical parsing and decoding for expressive grammar formalisms based on tree automata

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term from 2014 to 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 252303250
 

Final Report Abstract

The goal of this project was to develop a fast algorithm for broad-coverage parsing that works across a wide range of grammar formalisms. When the project started, computational linguists were using a large variety of different mathematical formalisms to describe languages of grammatically correct sentences, or to translate such sentences into trees and graphs. In this project, we took a generalizing perspective over this landscape of grammar formalisms by translating many of them into our own formalism, called Interpreted Regular Tree Grammars (IRTGs). IRTG takes an algebraic approach to describing languages and relations, by generating derivation trees with a grammar and then evaluating them in appropriate algebras of strings, trees, graphs, or other objects. An IRTG parser can read a sentence as input and translate it, for instance, into a tree that describes the sentence’s grammatical structure or into a graph that describes its meaning, all with the same algorithm. Research in the project evolved in two distinct phases. In the first phase, we worked out our novel grammar formalism of IRTG, developed a number of algebras that allowed us to capture grammars from other formalisms as IRTG grammars, and developed highly efficient parsing algorithms for IRTG. This solved a decades-old challenge in making the algebraic approach to parsing efficient, even with the very large grammars that are being used in computational linguistics. Our implementation of these algorithms, in the Alto tool, is the fastest parser in the world for certain grammar formalisms, including feature-based tree-adjoining grammars (FTAG) and hyperedge replacement grammars (HRG). In the second phase, the project’s focus shifted to the development of the AM parser, which predicts a derivation tree for a given input sentence and evaluates it algebraically to a graph. In contrast to work from the first phase, the AM parser no longer uses a grammar; it instead predicts the tree using neural models. While the AM parser is limited to describing relations between strings and graphs, it has set new states of the art in parsing accuracy across a number of semantic parsing tasks; it is also extremely fast and can parse up to 10,000 words per second. This shows that the algebraic and neural perspectives can be fruitfully combined. The overall landscape of computational linguistics and NLP changed drastically during the project’s ten-year lifespan. When the project started in 2014, grammar-based methods were a popular approach to parsing; when it ended in 2023, these methods had been almost completely replaced by neural models that no longer used grammars. The project under report was originally intended to be mostly grammar-based, but it ended up radically changing its focus to neural models; this allowed us to incorporate modern methods and conveniently sidestep certain technical challenges that made learning broad-coverage grammars inconvenient, and to maintain our ability to produce impactful research. The project has demonstrated conclusively that algebraic parsing methods can be made accurate and efficient, and they can be combined with both grammar-based and neural methods. Alto and the AM parser are both available open-source. Alto is still in active use in research and teaching, and the very fast AM parser is one of the very few semantic parsers in the world that works well both broad-coverage and in “compositional generalization” setups, where the parser has to generalize from very simple training sentences. At a larger scale, the success of the AM parser is a compelling demonstration of the value of neurosymbolic models, which combine neural and symbolic elements, and has been one of the seeds from which the RTG 2853 Neuroexplicit Models of Language, Vision, and Action evolved.

Publications

  • A constrained graph algebra for semantic parsing with AMRs. In Proceedings of the 12th International Conference on Computational Semantics (IWCS)
    Groschwitz, Jonas; Fowlie, Meaghan; Johnson, Mark & Koller, Alexander
  • AMR dependency parsing with a typed semantic algebra. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018), 1831-1841. Association for Computational Linguistics.
    Groschwitz, Jonas; Lindemann, Matthias; Fowlie, Meaghan; Johnson, Mark & Koller, Alexander
  • Generalized chart constraints for efficient PCFG and TAG parsing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2018), 626-631. Association for Computational Linguistics.
    Grünewald, Stefan; Henning, Sophie & Koller, Alexander
  • Compositional Semantic Parsing across Graphbanks. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Association for Computational Linguistics.
    Lindemann, Matthias; Groschwitz, Jonas & Koller, Alexander
  • Fast semantic parsing with well-typedness guarantees. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), 3929-3951. Association for Computational Linguistics.
    Lindemann, Matthias; Groschwitz, Jonas & Koller, Alexander
  • Learning compositional structures for semantic graph parsing. Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021) (2021), 22-36. Association for Computational Linguistics.
    Groschwitz, Jonas; Fowlie, Meaghan & Koller, Alexander
  • Structural generalization is hard for sequence-to-sequence models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022), 5048-5062. Association for Computational Linguistics.
    Yao, Yuekun & Koller, Alexander
  • Compositionality in Computational Linguistics. Annual Review of Linguistics, 9(1), 463-481.
    Donatelli, Lucia & Koller, Alexander
  • SLOG: A Structural Generalization Benchmark for Semantic Parsing. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023), 3213-3232. Association for Computational Linguistics.
    Li, Bingzhi; Donatelli, Lucia; Koller, Alexander; Linzen, Tal; Yao, Yuekun & Kim, Najoung
  • Simple and effective data augmentation for compositional generalization. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (2024), 434-449. Association for Computational Linguistics.
    Yao, Yuekun & Koller, Alexander
 
 

Additional Information

Textvergrößerung und Kontrastanpassung