Project Details
Scalable Information Extraction in Stratosphere
Applicant
Professor Dr. Ulf Leser
Subject Area
Security and Dependability, Operating-, Communication- and Distributed Systems
Term
from 2010 to 2015
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 132320961
The main objective of this project is to enable query-based analysis of large quantities of unstructured text. We envision users to formulate IE tasks with the Stratosphere query language. Such a query is parsed, optimized, parallelized, executed, and re-optimized on a Cloud infrastructure by methods developed in projects A, B, and C by Markl, Freytag, and Kao. The IE-specific operators, which crunch text into structured representations, are developed in this project. Furthermore, we develop, in cooperation with the Project E, operators for a systematic aggregation of extracted information that fully take the uncertainty of extracted information into account. All IE operators will be configurable to embrace different IE strategies, either geared towards high throughput, high precision / low uncertainty, or high recall. The high-level operator interfaces must be domain independent, while their concrete instantiations need to be easily adaptable to the text-domain at hand. These requirements call for a carefully balanced mixture of simple IE techniques, advanced NLP, and Machine Learning. All methods developed within this project will be evaluated on large and realistic IE tasks in the biomedical domain.
DFG Programme
Research Units