Project Details
Projekt Print View

Improving variant effect predictions of regulatory sequences in human disease using Machine Learning and High-throughput Assays

Applicant Dr. Max Schubach
Subject Area Bioinformatics and Theoretical Biology
Term since 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 464313370
 
The majority of variants associated with common diseases and an unknown proportion of causal variants for rare diseases fall in non-coding regions of the genome. Although catalogs of regulatory elements are steadily improving, we have a limited understanding of functional effects of variants within them. In the context of precision medicine, machine learning (ML) methods are developed and applied to prioritize and implicate deleterious variants in human disease. Their major focus has been on coding s on coding sequence and the much larger non-coding part of the human genome remains under explored. We believe that ML methods can create valuable models of regulatory variant, but obtaining comprehensive training and validation datasets remains a major challenge. Massively parallel reporter assays (MPRAs) can overcome this shortage, but are limited in their throughput given the large universe of hundreds of millions potential variants. Here, we aim to develop improved predictors of regulatory function using ML and high-throughput assays.We propose an innovative variant selection approach on a genome-wide scale that will select more than 120,000 variants across more than 60,000 regions for MPRA testing in multiple cell-types. We will use deep neural networks trained on active and non-active open chromatin sequences from multiple cell-types to select potential high-effect and no-effect changes. This initial model will provide insights into the sequence encoding of regulatory variant effects in different cell-types and resulting MPRA datasets will provide a better understanding of regulatory sequence function when analyzed in the context of available epigenomic datasets. Further, obtained readouts can be the basis of iterative improvements to the selection strategy and will profit future MPRA studies.The derived training dataset will constitute a genome-wide gold standard of quantitative variant effects urgently needed by modeling groups. By integrating comprehensive sets of publicly available datasets across multiple cell-types and tissues, it will enable us to establish a new generation of regulatory variant predictors. The integration of the new predictions in a genome-wide variant effect prediction framework (CADD) will improve prioritizing disease causing variants and make these predictions widely accessible to the clinical community. While initially limited to a small number of well-studied cell-types with comprehensive experimental data, we are confident that principles identified from our analyses and models will transfer to other cell-types, for which datasets are getting available with recent single cell epigenetic assays.We previously developed ML methods for different variant classes and have a long-standing interest in addressing the variant interpretation problem. With our established collaborations and expertise, we are uniquely positioned to develop improved variant effect predictors of regulatory sequences using MPRAs and ML methods.
DFG Programme Research Grants
International Connection USA
 
 

Additional Information

Textvergrößerung und Kontrastanpassung