HySim: Hybrid-parallel similarity search for the analysis of big genomic and proteomic data
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Final Report Abstract
Recent years have seen a tremendous increase in the volume of data generated in the life sciences. The analysis of these datasets poses difficult computational challenges. The objective of the HySim project has been the combination of methods from big data analytics and high performance computing to develop new algorithmic approaches to similarity search for the analysis of large-scale genomic and proteomic data that are computationally efficient and potentially more accurate than the current state-of-the-art. The corresponding data sets are produced by two types of high-throughput technologies: Next Generation Sequencing (NGS) and Mass spectrometry (MS). The main algorithmic approach is based on Locality Sensitive Hashing (LSH) with efficient parallelization on Big Data clusters and multi-GPU systems. Our corresponding initial research questions were: 1. How can we develop and implement novel algorithms for the classification of metagenomic reads, metagenomic abundance estimation and read error correction based on an LSH approach? How do we evaluate and compare their performance to the state-of-the-art in terms of accuracy, runtime and memory consumption? 2. How can we develop and implement novel algorithms for feature detection and database search in MS data based on an LSH approach? How do we evaluate and compare their performance to the state-of-the-art in terms of accuracy, runtime and memory consumption? 3. How do we parallelize these algorithms on big data clusters and HPC systems to scale towards large-scale datasets? How do these approaches compare? Our research resulted in a number of key findings that were published in leading journals and conferences: Metagenomics: We have shown that an LSH-based approach can outperform the state-of-the-art in metagenomics. Our corresponding new tools MetaCache and AFS-MetaCache are able to achieve significantly better performance than popular tools such as Kraken2 for simulated as well as for real-world data for both metagenomic classification and abundance estimation. In addition, our scalable GPU-accelerated version (MetaCache-GPU) and cluster-based versions (MetaCache-Spark and MPI-MetaCache) achieve order-ofmagnitude speedup compared to existing tools. In addition, we have shown how this approach can be successfully adapted to other tasks such as fast mapping of RNA-Seq Reads to transcriptomes and rapid activation matrix computation for single-cell RNA-seq reads. Error Correction: We have shown that an LSH-based approach can be used to design the first highly accurate yet scalable error correction method based on multiple seqeunce aligments (MSAs) by developing CARE. CARE can reduce the amount of false positive corrections by at least an order-of-magnitude compared to existing approaches while delivering similar amounts of true positives. It can also scale efficiently towards billions of reads sequenced from complex genomes (such as human) on both CPUs and GPUs. The multi-GPU versions designed in HySim are based on our new Gossip and WarpCore libraries, that can also be applied to a variety of other problem in bionformatics and beyond. Mass Spectrometry: We have developed file formats and infrastructure to support mass spectrometry analysis on big data clusters. Based on these underlying structures, we have developed a feature detection approach for raw data processing in mass spectrometry that is suited to multi-dimensional data sets. We have designed, implemented, and tested a parametric approach that uses knowledge about expected isotopic distributions. This approach was shown to yield similar results than established techniques. Based on this work, we have created a fully non-parametric feature detection technique based on locality sensitive hashing which is the first of its kind. Finally, we have designed and validated LSH-based techniques for database search in mass spectrometry and have concluded that the technique is applicable, but only has significant advantages in situations where annotations are intrinsically unreliable.
Publications
- “MetaCache: context-aware classification of metagenomic reads using minhashing”. In: Bioinformatics 33.23 (2017), pp. 3740–3748
A. Müller, C. Hundt, A. Hildebrandt, T. Hankeln, and B. Schmidt
(See online at https://doi.org/10.1093/bioinformatics/btx520) - “Gossip: Efficient Communication Primitives for Multi-GPU Systems”. In: Proceedings of the 48th International Conference on Parallel Processing. 2019, pp. 1–10
R. Kobus, D. Jünger, C. Hundt, and B. Schmidt
(See online at https://doi.org/10.1145/3337821.3337889) - “Suffix Array Construction on Multi-GPU Systems”. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019, pp. 183–194
F. Büren, D. Jünger, R. Kobus, C. Hundt, and B. Schmidt
(See online at https://doi.org/10.1145/3307681.3325961) - “A big data approach to metagenomics for all-food-sequencing”. In: BMC bioinformatics 21.1 (2020), pp. 1–15
R. Kobus, J. M. Abuín, A. Müller, S. L. Hellmann, J. C. Pichel, T. F. Pena, A. Hildebrandt, T. Hankeln, and B. Schmidt
(See online at https://doi.org/10.1186/s12859-020-3429-6) - “Big Data in metagenomics: Apache Spark vs MPI”. In: PLoS One 15.10 (2020), e0239741
J. M. Abuín, N. Lopes, L. Ferreira, T. F. Pena, and B. Schmidt
(See online at https://doi.org/10.1371/journal.pone.0239741) - “RainDrop: Rapid activation matrix computation for droplet-based single-cell RNA-seq reads”. In: BMC bioinformatics 21.1 (2020), pp. 1–14
S. Niebler, A. Müller, T. Hankeln, and B. Schmidt
(See online at https://doi.org/10.1186/s12859-020-03593-4) - “WarpCore: A Library for fast Hash Tables on GPUs”. In: 27th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2020, Pune, India, December 16-19, 2020. IEEE, 2020, pp. 11–20
D. Jünger, R. Kobus, A. Müller, C. Hundt, K. Xu, W. Liu, and B. Schmidt
(See online at https://doi.org/10.1109/HiPC50609.2020.00015) - “CARE: context-aware sequencing read error correction”. In: Bioinformatics 37.7 (2021), pp. 889–895
F. Kallenborn, A. Hildebrandt, and B. Schmidt
(See online at https://doi.org/10.1093/bioinformatics/btaa738) - “Locality-sensitive hashing enables signal classification in high-throughput mass spectrometry raw data at scale”. 2021
K. Bob, D. Teschner, T. Kemmer, D. Gomez-Zepeda, S. Tenzer, B. Schmidt, and A. Hildebrandt
(See online at https://doi.org/10.1101/2021.07.01.450702) - “MetaCache-GPU: Ultra-Fast Metagenomic Classification”. 2021
R. Kobus, A. Müller, D. Jünger, C. Hundt, and B. Schmidt
(See online at https://doi.org/10.48550/arXiv.2106.08150) - “RNACache: Fast Mapping of RNA-Seq Reads to Transcriptomes Using Min-Hashing”. In: Computational Science - ICCS 2021 - 21st International Conference, Krakow, Poland, June 16-18, 2021, Proceedings, Part I. Ed. by M. Paszynski, et al. Vol. 12742. Lecture Notes in Computer Science. Springer, 2021, pp. 367–381
J. Cascitti, S. Niebler, A. Müller, and B. Schmidt
(See online at https://doi.org/10.1007/978-3-030-77961-0_31)