AI-GUSTUS: a cloud-native pipeline for accurate genome annotation

Subject Area Bioinformatics and Theoretical Biology
Term since 2024
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 552910312
The structural annotation of protein-coding genes is a challenging problem in eukaryotic genomes. The Earth BioGenome Project intends to sequence up to 1.7 million eukaryotic genomes within this decade. To date, the majority of the already existing eukaryotic genomes lack an annotation of protein-coding genes, and this problem will amplify by orders of magnitude if methods are not improved. This research project aims to improve the fully automated structural annotation of protein-coding genes in eukaryotes by connecting a promising deep learning approach with the established software framework surrounding the state-of-the-art gene finder AUGUSTUS. This connection is expected to establish a new state-of-the-art in terms of accuracy and flexibility. We will tackle several open problems in the area of automatic compilation of balanced training sets for clades, in the integration of extrinsic evidence into the deep learning architecture, and in the prediction of alternative splicing isoforms with a deep learning-based gene finder. Building on the existing code base of AUGUSTUS will be an advantage for connecting to evidence-generating tools, such as spliced alignment tools. The resulting software, AI-GUSTUS, will be available as a user-friendly pipeline, thus directly supporting the global scientific community in their research. With this research project, I address several challenges that have previously been identified by the EBP Committee on Annotation Standards. With the development of this project, large-scale genome annotation within and beyond the EBP will become more efficient and accurate. Indirectly, this project will contribute to the conservation of biodiversity, to monitoring and preventing the spread of pathogens, and to enhancing ecosystem services.
