Project Details
Web Data Analytics and Scientific Workflows
Applicant
Professor Dr. Ulf Leser
Subject Area
Security and Dependability, Operating-, Communication- and Distributed Systems
Term
from 2013 to 2017
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 132320961
The main objective of this project is to enhance Stratosphere’s abilities to quickly analyze evolving, large datasets in problems that require iterative analytical programs. Therein, we focus on two demanding areas: Web data and Scientific Workflows (SWfs). The first area deals with the analysis of unstructured, heterogeneous, and distributed web content. It builds on work performed in the first phase, in which our subproject was concerned with research in declarative information extraction. In the second phase, we focus on the analysis of textual web data including the dynamic acquisition of such data through focused web crawls. This requires a number of enhancements to Stratosphere, such as specific operators to deal with web data, methods for using advanced information extraction to improve accuracy of focused crawling, and a novel execution model supporting the inherently iterative nature of focused crawling. The second research area targets the implementation, optimization, and efficient execution of SWf as dataflow programs, in particular analysis workflows for next generation sequencing data in the Life Sciences. The cost of producing genome data has fallen so steeply that it is short before becoming standard in clinical practice, creating an avalanche of data that must be analyzed by a multitude of programs running in pipelines. SWfs essentially are dataflow programs and thus generally well suited to be managed by a system as Stratosphere. However, especially genome analysis has particular properties that must be considered during dataflow optimization and parallelization. In particular, large-scale sequencing is becoming routine even outside large centers, which means that analysis must be performed efficiently also on small clusters and should optimally exploit multi-core machines. Another property of this area is that there exists no single, standardized analysis pipeline; rather, genomes are subjected to various independent analyses, resulting in complex workloads of partly overlapping functionality leaving much room for holistic workload optimization.
DFG Programme
Research Units