Project Details
FFMK - A fast and fault tolerant microkernel-based system for exa-scale computing
Applicants
Professor Amnon Barak, Ph.D.; Professor Dr. Hermann Härtig; Professor Dr. Alexander Reinefeld
Subject Area
Security and Dependability, Operating-, Communication- and Distributed Systems
Term
from 2012 to 2021
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 230674776
FFMK intends to continue with the approach used in phase 1 to design, build and evaluate a software system architecture to address the challenges expected in an exascale system. These are especially performance losses caused by much larger impact of application, hardware and operating system runtime variability and by vulnerability to failures.The architecture builds upon a node-local small operating system kernel supporting a combination of specialized runtimes with a full blown general purpose operating system, and and upon global platform management supporting dynamically changing partitions. Applications and run-times act in split operation, i.e. combining processes running on the node-local kernel and proxy processes running on the general purpose OS.For FFMK, we have instantiated the architecture components with the L4 microkernel, a virtualized Linux, bulk-synchronous applications based on MPI as applicationruntime, on-node checkpointing based on XTreemFS, and a combination of Gossip with decentralized decision making algorithms as global platform management. The integration of the components into the architecture is almost complete.In phase 1, we did not encounter fundamental obstacles questioning the design and our selection of components and approaches. As can be expected, we did encounter difficulties causing delays, especially mastering the complexity of network hardware, analyzing the characteristics of applications, and exposing their dynamic character. In phase 2, we plan to complete and tune an architecture prototype and then explore its potentials. Especially thanks to the relationships developed with severalapplication projects of the priority program and to colleagues with deeper knowledge in network hardware, we will make use of a deeper understanding of application characteristics and interface to explore the prototype. We plan use the system to address research questions such as: feasibility of global platform management using decentralized decision making, scalability of gossip algorithms, usage of prediction for platform management, scrutinizing the promises regarding deterministic execution of a microkernel-based system, analyzing the required and possible frequency of load changes and checkpoints, obtain a holistic view of fault tolerance making use of componentized systems software, scalability of algorithms.
DFG Programme
Priority Programmes
Subproject of
SPP 1648:
Software for Exascale Computing
International Connection
Israel
Co-Investigator
Professor Dr. Wolfgang E. Nagel