Project Details
Trustworthy multi-scale manifold learning for genomic and transcriptomic data
Applicant
Dr. Dmitry Kobak
Subject Area
Bioinformatics and Theoretical Biology
Term
since 2021
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 471473934
In recent years, large high-dimensional datasets have become commonplace in biology. For example, single-cell transcriptomics routinely produces datasets with sample sizes in hundreds of thousands of cells and dimensionality in tens of thousands of genes. Similarly, genomic datasets can encompass hundreds of thousands of people’s genomes, profiled using millions of single-nucleotide polymorphisms. One defining feature of such datasets is their hierarchical organization, with biologically meaningful structure present on several levels. Such datasets require adequate computational methods for data analysis, including unsupervised data exploration, to allow researchers to compactly represent and make sense of their data. It is commonplace in single-cell transcriptomics to generate low-dimensional embeddings of the data, using algorithms such as e.g. t-SNE or UMAP, but the existing methods fall short of representing the hierarchical structure of the data. Whereas they excel at preserving local structure, they are unable to recapitulate larger-scale global structure often present in the data, making it difficult to interpret the embedding correctly. In this project, our first aim is to develop a dimensionality reduction method able to preserve crucial properties of high-dimensional data, such as the local cluster structure, continuous trajectories, and global hierarchical organization. The second aim is to develop a suite of quality metrics that will allow us to benchmark existing and novel algorithms on a range of challenging datasets. Finally, the third aim is to adapt this machinery to ultra-high-dimensional data from population genomics. On the technical level, we are going to rely on the k-nearest-neighbour graphs and graph coarse-graining. Our work will be useful in practical applications in biology and bioinformatics, while at the same time being of high interest for the manifold learning part of the machine learning community.
DFG Programme
Research Grants