Project Details
A Scalable, Massively-Parallel Runtime System with Predictable Performance
Applicant
Professor Dr. Odej Kao
Subject Area
Security and Dependability, Operating-, Communication- and Distributed Systems
Term
from 2013 to 2017
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 132320961
The goal of project B within the Stratosphere II Research Unit is to research and develop the runtime system “Aura” to execute multiple, concurrent data analysis programs in a massively parallel fashion on a distributed cloud or cluster infrastructure. Project B is divided into two primary research areas RA-B.1 and RA-B.2 that enhance the runtime environment with capabilities to handle evolving datasets and to represent and manage the resulting distributed state with the goal of supporting novel iterative data analysis algorithms in a massively parallel, fault-tolerant way. We plan to establish a novel execution model, the so-called Supremo Execution Plan (SEP), which models the framework’s workload in the form in which it is executed. A SEP is a restricted cyclic graph that combines evolving datasets in the form of physical views, with semantically rich operators. Physical views can represent traditional data sources and sinks but will also be able to hold the state in iterative data analysis, as well as the state occurring in stateful operators on infinite data, e.g. windowed operators. In contrast to the UDF black-boxes of Stratosphere I’s Nephele, the added knowledge about operator characteristics in combination with workload-aware (re-)scheduling policies allows the runtime core’s scheduler to provide predictable runtime behavior of individual deployed data analysis programs. In particular, this project aims at answering following questions:1. How must a runtime system be architected to optimize for the execution of iterative data analysis programs on various hardware architectures, exploiting the advantages of a virtualized hardware?2. How can we efficiently maintain state and provide fault-tolerant execution of programs with iterations on large-compute clusters?3. How can we adapt to the characteristics of virtualization methods to achieve predictable performance in terms of low-latency bounds and resource guarantees? How do we provide up- and down-scaling based on computational needs or on ingestion rates, assuming on-demand elasticity of Cloud systems?4. How can large, complex models be shared and distributed between workloads of concurrent queries?
DFG Programme
Research Units