Project Details
HySim: Hybrid-parallel similarity search for the analysis of big genomic and proteomic data
Subject Area
Bioinformatics and Theoretical Biology
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term
from 2016 to 2021
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 329350978
Recent years have seen a tremendous increase in the volume of data generated in the life sciences. The analysis of these data sets poses difficult computational challenges and is an active field of research. Currently, a popular strategy in data rich scenarios across many areas of science and industry is to adopt big data technologies. However, characteristics of typical biological data sets and their intended uses differ significantly from most other big data application areas. Biological data processing often requires more complex analysis techniques than can be afforded by big data technology, which is often constrained to algorithms or heuristics with linear or sublinear complexity. In many application scenarios, rough approximations of true outcomes are perfectly acceptable, but in the life sciences, this is rarely the case. A biomedical application will typically be unable to tolerate even moderate numbers of classification mistakes. Consequently, computational life sciences today tend to rely on a different computational model for large scale applications, namely high performance computing (HPC). However, HPC is tailored more towards problems with a significant amount of computational work (big compute) than at those with enormous storage requirements (big data). The peculiarities of biological data sets and the complexity of the required data analysis pose challenges that neither of the two approaches is perfectly suited to overcome. Instead, a hybrid approach, combining ideas from big data with HPC methodologies, might be preferable, as ideas from big data algorithms can help flexible and highly performant HPC methods to scale towards data sets that would otherwise be too large for them.In this project, we propose to study such hybrid methods in order to meet the challenge of processing large scale genomic and proteomic data sets efficiently yet accurately. Our particular focus is similarity search; an important algorithmic technique for a number of applications in both genomics and proteomics. Corresponding data sets are produced by two types of high throughput technologies: Next Generation Sequencers (NGS) and Mass Spectrometers (MS).Our specific project goals are threefold: (i) Design of efficient and accurate big data algorithms for similarity search in NGS data with applications to metagenomics and read error correction based on locality sensitive hashing (LSH) techniques. (ii) Design of efficient and accurate big data algorithms for similarity search in MS raw data with applications to proteomics based on LSH techniques. (iii) Development of efficient implementations of these new algorithms on a hybrid big data/HPC platform that provide strong scalability for large scale NGS and MS data sets.
DFG Programme
Research Grants