Project Details
SFB 1404: FONDA - Foundations of Workflows for Large-Scale Scientific Data Analysis
Subject Area
Computer Science, Systems and Electrical Engineering
Biology
Geosciences
Materials Science and Engineering
Medicine
Physics
Biology
Geosciences
Materials Science and Engineering
Medicine
Physics
Term
since 2020
Website
Homepage
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 414984028
Scientific discoveries in the natural sciences rely on the computational analysis of large data sets, which are carried out by complex data analysis workflows (DAWs) executed on a distributed infrastructure. Most research in DAWs focuses on techniques for minimizing their runtime on a specific infrastructure, which leads to solutions that are difficult to maintain and dependent on the involvement of highly specialized and scarce data engineers. However, in most data science projects, runtime is not the decisive factor; instead, it is its development time. FONDA set out in 2020 to address this long-lasting and increasingly pressing problem. Our overarching research goal is to research languages, technologies, and algorithms to increase human productivity when designing, maintaining, or reusing DAWs for large-scale scientific data analysis. Within its first funding period, FONDA focused on three specific properties of DAWs that are directly linked to human productivity, namely portability, adaptability, and dependability. FONDA achieved groundbreaking results in these regards, such as improved portability through flexible interfaces between infrastructure components, improved adaptability via intelligent scheduling, and improved dependability through contract-driven DAW development. In its second phase, FONDA will further develop its research topics by lifting three restrictions we imposed on ourselves in phase I. First, we break the assumption that DAWs are executed in a single data center hosting all necessary data and will study multi-site DAWs, i.e., DAWs whose sub-workflows are executed in different data centers. Second, we extend our scope in terms of the DAW lifecycle by addressing usability of DAW systems, i.e., empirical investigations of hu-man-computing interfaces and a systematic approach to DAW design. Third, we generalize from single workflows to workflow reuse by researching the technical sustainability of DAWs. Furthermore, as human productivity in data analysis is increasingly threatened by excessive energy costs, we take improvements to environmental sustainability in focus. Besides its scientific results, FONDA’s first phase also excelled in several overachieving topics. With the recent founding of the new HPC@HU service, it had a long-lasting structural impact on the speaker university. The recognition of its highly important research topic at the interface be-tween computer science and the natural sciences is reflected by many recent appointments in the region, which allowed a perfectly matching extension of our PI group. We are proud to have achieved an outstanding high percentage of female PhD students (38%), and we are looking forward to the new edited book on “Workflows for Large-Scale Scientific Data Analysis”, for which more than 100 authors from 15 countries have confirmed contributions and that will ap-pear in summer 2024 in the newly created Open Access publisher BerlinUP.
DFG Programme
Collaborative Research Centres
Current projects
- A01 - Query-driven Validation of Distributed DAWs (Project Heads Schweikardt, Nicole ; Weidlich, Matthias )
- A02 - Energy-Aware Optimization of DAWs in Bioinformatics (Project Heads Leser, Ulf ; Reinert, Knut )
- A03 - Hardening Computational Materials-Science Workflows against Human Errors (Project Heads Draxl, Claudia ; Grunske, Lars ; Pavone, Pasquale )
- A05 - DAWs for Annotation Efficient Machine Learning in Biomedical Imaging Research (Project Heads Kainmüller, Dagmar ; Ritter, Kerstin )
- A07 - Semantic Composition and Validation of Interacting DAWs in Computational Materials Science (Project Heads Grunske, Lars ; Hickel, Tilmann ; Lamprecht, Anna-Lena )
- B01 - Carbon-aware Multi-Site Workflow Scheduling Under Uncertainty (Project Heads Kao, Odej ; Meyerhenke, Henning )
- B04 - Proactive Network, I/O and Storage Steering for Multiple DAWs on Shared Infrastructures (Project Heads Reinefeld, Alexander ; Scheuermann, Björn ; Schintke, Florian )
- B05 - Transparent Multi-Site Data Analysis Workflows for Earth Observation (Project Heads Hostert, Patrick ; Leser, Ulf )
- B06 - End-to-end Energy Profiles of Data Analysis Workflows (Project Heads Böhm, Matthias ; Grunske, Lars ; Rabl, Tilmann )
- B07 - Efficient DAW Execution Using Incremental Data for Forest Disturbances (Project Heads Herold, Martin ; Hostert, Patrick ; Kao, Odej )
- C01 - Collaborative Design of Exploratory DAWs in Neuroimaging (Project Heads Deniz, Ph.D., Fatma ; Kehr, Birte ; Weidlich, Matthias )
- C02 - Early Workflow Design: From Collaborative Scientific Problem-Solving to DAW Specifications (Project Heads Lamprecht, Anna-Lena ; Mendling, Jan ; Weidlich, Matthias )
- C03 - User-Centered Design for DAW Definition Languages (Project Heads Grunske, Lars ; Kosch, Thomas )
- MGKS02 - Integrated Research Training Group (Project Heads Grunske, Lars ; Mendling, Jan ; Reinert, Knut )
- S01 - Testbeds and Repositories (Project Heads Dreyer, Malte ; Kao, Odej ; Leser, Ulf )
- Z - Central Administrative Project (Project Head Leser, Ulf )
Completed projects
Applicant Institution
Humboldt-Universität zu Berlin
Participating University
Charité - Universitätsmedizin Berlin; Freie Universität Berlin; Technische Universität Berlin; Technische Universität Darmstadt; Universität Potsdam
Participating Institution
Bundesanstalt für Materialforschung und -prüfung (BAM); Hasso-Plattner-Institut für Digital Engineering gGmbH; Helmholtz-Zentrum Potsdam - Deutsches GeoForschungsZentrum (GFZ); Max-Delbrück-Centrum für Molekulare Medizin (MDC); Zuse-Institut Berlin (ZIB)
Spokesperson
Professor Dr. Ulf Leser