Detailseite
Projekt Druckansicht

HySim: Hybrid-parallele Ähnlichkeitssuche für die Analyse großer genomischer und proteomischer Daten

Fachliche Zuordnung Bioinformatik und Theoretische Biologie
Datenmanagement, datenintensive Systeme, Informatik-Methoden in der Wirtschaftsinformatik
Förderung Förderung von 2016 bis 2021
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 329350978
 
Erstellungsjahr 2021

Zusammenfassung der Projektergebnisse

Recent years have seen a tremendous increase in the volume of data generated in the life sciences. The analysis of these datasets poses difficult computational challenges. The objective of the HySim project has been the combination of methods from big data analytics and high performance computing to develop new algorithmic approaches to similarity search for the analysis of large-scale genomic and proteomic data that are computationally efficient and potentially more accurate than the current state-of-the-art. The corresponding data sets are produced by two types of high-throughput technologies: Next Generation Sequencing (NGS) and Mass spectrometry (MS). The main algorithmic approach is based on Locality Sensitive Hashing (LSH) with efficient parallelization on Big Data clusters and multi-GPU systems. Our corresponding initial research questions were: 1. How can we develop and implement novel algorithms for the classification of metagenomic reads, metagenomic abundance estimation and read error correction based on an LSH approach? How do we evaluate and compare their performance to the state-of-the-art in terms of accuracy, runtime and memory consumption? 2. How can we develop and implement novel algorithms for feature detection and database search in MS data based on an LSH approach? How do we evaluate and compare their performance to the state-of-the-art in terms of accuracy, runtime and memory consumption? 3. How do we parallelize these algorithms on big data clusters and HPC systems to scale towards large-scale datasets? How do these approaches compare? Our research resulted in a number of key findings that were published in leading journals and conferences: Metagenomics: We have shown that an LSH-based approach can outperform the state-of-the-art in metagenomics. Our corresponding new tools MetaCache and AFS-MetaCache are able to achieve significantly better performance than popular tools such as Kraken2 for simulated as well as for real-world data for both metagenomic classification and abundance estimation. In addition, our scalable GPU-accelerated version (MetaCache-GPU) and cluster-based versions (MetaCache-Spark and MPI-MetaCache) achieve order-ofmagnitude speedup compared to existing tools. In addition, we have shown how this approach can be successfully adapted to other tasks such as fast mapping of RNA-Seq Reads to transcriptomes and rapid activation matrix computation for single-cell RNA-seq reads. Error Correction: We have shown that an LSH-based approach can be used to design the first highly accurate yet scalable error correction method based on multiple seqeunce aligments (MSAs) by developing CARE. CARE can reduce the amount of false positive corrections by at least an order-of-magnitude compared to existing approaches while delivering similar amounts of true positives. It can also scale efficiently towards billions of reads sequenced from complex genomes (such as human) on both CPUs and GPUs. The multi-GPU versions designed in HySim are based on our new Gossip and WarpCore libraries, that can also be applied to a variety of other problem in bionformatics and beyond. Mass Spectrometry: We have developed file formats and infrastructure to support mass spectrometry analysis on big data clusters. Based on these underlying structures, we have developed a feature detection approach for raw data processing in mass spectrometry that is suited to multi-dimensional data sets. We have designed, implemented, and tested a parametric approach that uses knowledge about expected isotopic distributions. This approach was shown to yield similar results than established techniques. Based on this work, we have created a fully non-parametric feature detection technique based on locality sensitive hashing which is the first of its kind. Finally, we have designed and validated LSH-based techniques for database search in mass spectrometry and have concluded that the technique is applicable, but only has significant advantages in situations where annotations are intrinsically unreliable.

Projektbezogene Publikationen (Auswahl)

 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung