Project Details
Transferable retention time prediction for Liquid Chromatography-Mass Spectrometry-based metabolomics
Subject Area
Bioinformatics and Theoretical Biology
Term
from 2019 to 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 425789784
Metabolite identification still represent the major bottleneck in metabolomics. Liquid Chromatography-Mass Spectrometry (LC-MS) is the currently most employed analytical technique in untargeted metabolomics. Currently, less than 10% of spectra in a typical untargeted experiment can be annotated. Therefore, there is a strong need for improved tools for metabolite identification. While mass alone cannot identify molecules, tandem MS yields fragmentation spectra which can be used for structural elucidation. Recently, in silico approaches have been developed and are increasingly used by the metabolomics community, that allow to search in molecular structure databases such as PubChem and ChemSpider. Such structure databases are many orders of magnitude larger than any spectral library and, hence, have a much wider coverage of molecular structures. But even identification by tandem MS will result in numerous spurious identifications. To improve identification quality, two independent parameters, e.g. mass and retention time of a chemical reference standard have to be reported. Today, retention time is mainly used at a later stage of the identification pipeline, and mainly based on comparison with chemical reference standards. However, it would clearly be beneficial if retention times were used at an early stage, in particular for in silico methods; here, we could filter candidates or, even better, modify the scores of candidates based on comparing predicted and observed retention times.This project aims to make better use of retention times for the identification of small biomolecules in LC-MS based untargeted metabolomics, using transferable retention time prediction. Prediction will be based on a two-step approach. First, Machine Learning will be used to predict retention order numbers for give molecular structures; training will be based on an extensively curated collection of retention time data from public available datasets, as well as systematic in-house measurements for reference metabolite standards. In contrast to its mass, retention time is not a feature of a metabolite, but of the combination of metabolite, stationary and mobile phase. Therefore, we will use properties of the employed chromatographic system in addition to molecular fingerprints of metabolites for machine learning. In the second step, retention order numbers will be mapped to retention times, using known and identified substances as anchors of the mapping. Retention order and retention time prediction will be used to filter false positive reaction pairs, and applied to an independent biological dataset from C. elegans secondary metabolism.All curated and acquired data, open-source software for prediction of retention order and retention times will be made freely available to the metabolomics community. Finally, retention time prediction will be integrated into the CSI:FingerID scoring in order to improve its metabolite identification rates.
DFG Programme
Research Grants