Statistical theory on finite alphabet structures: inference, algorithms, and applications
Final Report Abstract
Modern data problems consist of increasingly complex, often high-dimensional data sources. In order to extract meaningful and interpretable information it is therefore often necessary to incorporate additional structural assumption or pre-knowledge into the analysis. Arguably, one of the most widely used assumptions in this context is sparsity. This means that, although, a model can include potentially many different variables, only a few of them have a non-zero influence in the model. In this project we focused on a related alternative simplifying assumption, namely, a given finite alphabet (FA) assumption. This means that, although many different variables can enter the model simultaneously, each of them can only take a few different values, represented via the giving FA. There are various domains where such a FA assumption appears naturally. A particularly suitable area of application for FA structures is the field of genetics, which we also mainly, but not exclusively, focused on in this project. For instance, the finitely many base pairs of the DNA constitute a natural FA, but there are various other examples, such as copy-number-variations and single-nucleotide polymorphisms, on both a diploid or haploid level. In the first part of this project we analyzed FA structures for specific matrix factorization problems, which are particularly relevant for genetic data from pooled sequencing. We discovered that FA’s are extremely powerful in this setting. In particular, we were able to show that the optimal statistical accuracy in such estimation problems is essentially the same as if the FA matrix was known exactly. We also derived a computationally efficient algorithm which can achieve this optimal estimation rate. Based on this, we develop a completely new approach for haplotype reconstruction from pool sequencing data of multiple samples, which is both, more accurate as well as computationally more efficient than other methods. In the second part of this project we focused on a high-dimensional Boolean interaction model (where the Boolean components relate to a FA structure), which can be used, for example, to describe higher-order gene interactions. In recent years, there have been several empirical works which found that tree based ensemble methods, such as the random forest algorithm, can be very successful in term of interaction discovery, in particular, for genetics data. To the best of our knowledge, we were the first to obtain a theoretical foundation for these empirical findings. In particular, we showed that consistent interaction discovery of such Boolean interaction components is feasible with random forest. In the third part of this projects we focused on FA structures that correspond to the finite set of inner nodes in a tree structure. On the one hand, we considered the situation where the tree structure is fixed and given. There we were the first to derived a precise statistical testing procedure, treeSeg, to detect sub-branches simultaneously on all hierarchy levels in the tree. treeSeg is applicable in various domains, where detection of sub-groups is of interest. For example, treeSeg can be applied to detect sub-lineages within phylogentic trees (e.g., of virus or bacteria strains), which show a significantly different outcome distribution. It allows for precise uncertainty quantification and therefore, increases interpretability and reproducibility across many fields of science. Besides settings where the tree structure is given in advance (independent of the outcome variables), we also considered settings where the tree is learned directly from the data in a supervised setting via decision tree algorithms. In this context we derived a pipeline, epiTree, which uses such tree structures in order to provide an estimation and testing procedure for gene interactions.
Publications
- Modeling epistatic polygenic contributions to phenotypes using machine learning with boolean interactions
M. Behr, K. Kumbier, M. Aguirre, R. Arnaout, E. Ashley, B. Brown, J. Priest, B. Yu
- Testing for dependence on tree structures, Proceedings of the National Academy of Sciences (PNAS), 117(18):9787-9792 (2020)
M. Behr, M. Azim Ansari, A. Munk, C. Holmes
(See online at https://doi.org/10.1073/pnas.1912957117) - Multiple haplotype reconstruction from allele frequency data, Nature Computational Science (2021)
M. Pelizzola, M. Behr, H. Li, A. Munk, A. Futschik
(See online at https://doi.org/10.1038/s43588-021-00056-5) - Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests
M. Behr, H. Wang, X. Li, B. Yu
(See online at https://doi.org/10.48550/arXiv.2102.11800)