Project Details
Lifespan AI - Project M1: Normalizing Flows for Lifespan Health Data
Subject Area
Epidemiology and Medical Biometry/Statistics
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term
since 2022
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 459360854
Health sciences use statistical models to quantify health status and outcome as it evolves over time and is influenced by risk factors and/or treatments. Hereby the quantification of uncertainty is a crucial methodological contribution of statistics. Since trend and uncertainty are best described in terms of distributions, a model for the joint distribution of all variables and features over all time points under consideration provides the most complete statistical picture. This permits – at least formally – all conclusions statistics can provide without untestable assumptions.Such an approach is not straightforward and requires machine learning methods when the data are high dimensional. Further challenges arise in lifespan data, where the data are of different scales, are measured at individual specific time points and come from different data sources with only partial overlap or from study cohorts whose variable sets change over time. In this project, we tackle these challenges by the development of normalizing flows that utilise invertible residual neural networks and use generalised linear mixed models (GLMM) for the base distributions. The use of GLMMs is particularly attractive for health scientists because they are frequently used in applications. Normalizing flows based on invertible residual networks have the major advantage of providing analytical expressions for the joint distribution that can be well utilised for the anticipated statistical and scientific conclusions. The key idea of our method is to start modelling the conditional distribution of each variable given the other variables by nonlinear transformations of GLMMs and to combine them to an global joint distribution by an average of randomly selected sequential factorisations which can be achieved by sequentially reducing the set of variables we condition on in the conditional distributions (“reverse marginalisations”). After deriving an approach to fit a joint model with complete observations, we will develop an algorithm for updating the model with incomplete observations and extending it by additional variables. The joint distribution will then be utilised to obtain (overall regularised) estimates of scientifically interesting conditional distributions from which point and interval predications can be derived. We will also develop methods to interpret the overall (black box) model and to investigate its internal and external validity. While the joint model already accounts for aleatoric uncertainty (by modelling distributions) we will work out methods to account also for epistemic uncertainty, i.e. approaches for quantifying the model fit and to account for the model uncertainty in prediction intervals. The new methods will be applied to and illustrated with data and variables from the IDEFICS/I.Family cohort, the NAKO Health Study and GePaRD data which are all collected and/or managed by BIPS.
DFG Programme
Research Units