Project Details
Fault Tolerance and Elasticity for Global Task Pools
Applicant
Professorin Dr. Claudia Fohry
Subject Area
Software Engineering and Programming Languages
Term
from 2015 to 2020
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 279378532
Parallel computing is facing challenges such as the increase of programmer productivity, error handling, and flexible resource management.Novel parallel programming systems adopt the Partitioned Global Address Space (PGAS) programming model, and deploy tasks as their central construct for specifying parallelism. Tasks are units of computation that are generated at runtime, maintained in a task pool, and assigned to free computing resources when available. In parallel programming, tasks can typically generate new tasks. Tasks are often managed automatically by the runtime system of a single computing node. Globally, task pools can be implemented in user programs or frameworks, if required for load balancing of irregular applications.Fault tolerance is frequently implemented at system level, but application-specific techniques may be more efficient. Research on the latter is restrained by limited language support. Recently, the PGAS language X10 has incorporated exception handling for hardware failures. The same language also offers two more interesting concepts: a framework for global task pools, and elasticity, i.e., the opportunity to increase or decrease resources at runtime.Our project will bring these concepts together. The goal is development of task pools that are both fault-tolerant and elastic. Foremost, this requires algorithmic techniques to deal with errors and dynamic resources. During the project, we will consider different task pool schemes of increasing complexity.Beyond algorithms, major techniques will be programming and experiments. First, we will implement our algorithms efficiently in X10. Later, a second programming system will be selected, and the techniques transferred. As a side effect, our work will further thedevelopment of the programming systems. Results will be prototypes of frameworks, whose source code will be published. For validation and time measurements, experiments will refer to simple benchmarks, branch-and-bound methods, as well as a simulation program from environmental sciences.
DFG Programme
Research Grants