Project Details
Lifelong Multimodal Language Learning by Explaining and Exploiting Compositional Knowledge
Applicants
Dr. Jae Hee Lee; Professor Dr. Stefan Wermter
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term
since 2025
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 551629603
Learning and using language for understanding, conceptualizing, and communicating is a particular hallmark of humans. This has motivated the development of multimodal deep learning models that learn and think like humans. Existing multimodal models, however, have problems in a lifelong language learning setting, where they are confronted with changing tasks while having to retain previously learned knowledge. This has been considered a challenging obstacle to their application to real-world scenarios. The goal of the proposed project LUMO is to explore the important yet challenging research question of how to make multimodal models robust against task changes (or distribution shifts) by explaining and exploiting compositional knowledge. In devising such Lifelong Learning Multimodal Models (LLMMs), our first objective is to develop datasets and environments for two representative multimodal language learning tasks, i.e., vision and language integration and language-conditioned robotic manipulation, where our focus is on concepts, relations, and actions that can be combined in novel ways across changing tasks. Our second objective is understanding why certain approaches lead to more robust LLMMs. We will address this by scrutinizing how concepts and relations emerge inside an LLMM using concept-based XAI methods. We also aim to understand the training dynamics of the formation of concepts and relations in an LLMM to elucidate compositional generalization on the one hand and catastrophic forgetting on the other. Our third objective is to develop a neuro-symbolic approach which is tightly integrated with the model and to improve the model's lifelong learning performance. We observe that the inner interpretability not only helps us understand the reason for the robustness or brittleness of an approach, but has also the potential to debug spurious correlations in an LLMM. We hypothesize that the features of a concept form a region in the embedding space, such that one can apply symbolic constraints using vector space semantics to those regions to improve the robustness of an LLMM. The insights obtained from the research will be examined for real language-conditioned robotic manipulation scenarios, where we aim at a sim2real transfer, i.e., transferring skills from simulation to the real world.
DFG Programme
Research Grants