Project Details
Compute and Storage Cluster
Subject Area
Basic Research in Biology and Medicine
Term
Funded in 2021
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 469073465
In order to store and efficiently process data from modern high-throughput methods, a memory and a CPU compute cluster are required, among other things. The cluster applied for will mainly store and process metagenomic data (the entire genome of all bacteria in a given sample) and single cell research data (single-cell RNA seq and spatial single cell transcriptomics). These two data types are currently among the largest volume types in the molecular data environment. The original compute cluster acquired in 2013 is designed to process molecular data obtained from microarray technology. This technology is now no longer used in the participating groups and has been completely replaced by sequencing. However, sequencing data requires significantly more memory and computational effort. Together with its partners, including the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS) and several departments of Saarland University Hospital (UKS), the bioinformatics department generates about 2,000 metagenomes per year with a depth of 15 gigabases per sample. The human genetics and bioinformatics departments of Saarland University, together with their partners, generate and process an additional approximately 2 million RNA single-cell profiles annually using so-called Drop-seq methods. In pilot projects, ATAC-seq is currently performed in addition to RNA-seq and RNA profiles are collected at subcellular resolution. One single-cell experiment with 50,000 cells - sequenced in 2 days - requires about 3 TB of storage and 20 days of pure primary data analysis time. During the analysis of the data, intermediate results are sometimes generated that are more comprehensive than the primary data itself. The requested large-scale system should be able to store at least 100 experiments in parallel and redundantly, reducing the processing time from 20 days to about 3 days. To achieve this, the system requires at least 1,700 TB of gross storage capacity (for example, 6 x 16 x 18 GB HDDs) and at least 512 computational cores (for example, 16 x 32-core processors) with a clock frequency between 2.5 and 3 GHz. Since the analyses performed are generally memory-intensive, 8 TB of RAM should be available. A decisive factor is to avoid so-called swapping, i.e. the frequent copying of data between the RAM and the hard disk. Therefore, a total of at least 16 TB of buffer memory and 100 TB of fast data storage (solid state disks; SSDs) for all processors together are required. Furthermore, a 100 Gb network - consisting of network cards and a corresponding Gb switch - is essential so that copying the data does not become a bottleneck. In addition, a so-called metadata server is needed, which can optimally distribute jobs and processes to the individual components. The metadata server should be equipped with four 32-core processors and 2 TB RAM.
DFG Programme
Major Research Instrumentation
Major Instrumentation
Compute- und Storage Cluster
Instrumentation Group
7000 Datenverarbeitungsanlagen, zentrale Rechenanlagen
Applicant Institution
Universität des Saarlandes
Leader
Professor Dr. Andreas Keller