Project Details
Automated characterization of microbial genomes and metagenomes by collection and verification of association rules
Applicant
Dr. Giorgio Gonnella
Subject Area
Bioinformatics and Theoretical Biology
Term
from 2019 to 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 421071204
Thanks to the dropping costs of DNA sequencing, the number of available genomes and metagenomes will likely continue to grow exponentially in the next years. To cope with this large amount of data, the analyses are increasingly automated. Despite this, identifying unexpected or untypical results from the sequence and annotation of genomes and metagenomes still require a considerable manual analysis effort.This project aims at developing a system for the automatic verification of rules, which describe the typical or expected contents of a genome or metagenome. These rules consist in associations of features (e.g. sequence statistics or gene contents) to metadata (e.g. habitat characteristics or taxonomic classifications). Associations are often informally mentioned in scientific literature describing genomes and metagenomes, in sentences such as "the genome of a microbe living in symbiosis usually contains a reduced set of genes" or "obligate photosynthetic bacteria are usually not found next to deep-sea hydrothermal vents". When analyzing new datasets, such associations are often verified by ad-hoc manual analyses. These have different purposes: assessing the quality of data (e.g. identifying probable contaminations) and investigating untypical or unexpected results. The latter may offer intriguing explanations of known peculiarities and help formulating new hypotheses.The project will consist in three parts. First, conventions for the representation of associations will be developed (definition of a file format for their storage; integration of existing ontology systems for ensuring lexical consistency). Second, databases of rules will be prepared using different approaches (data mining on sequences, annotations and metadata; text-mining and manual collection from scientific literature; collaborative definition by the scientific community). Third, a modular software system will be implemented, for the verification of associations in genomic and metagenomic data. In order to offer automated analyses, the databases of rules will be exploited. However, it will be also possible to directly specify rules, which shall be verified.The main goal will be to characterize genomes of microbial isolates and metagenomes, by finding untypical, possibly unexpected and thus potentially scientifically interesting results. Furthermore, by complementing existing software for phenotype prediction, it will be possible to apply the system also for the characterization of genomes of uncultured organisms. Finally, the identification of unlikely results will help assessing the quality of assemblies of microbial genomes.
DFG Programme
Research Grants