Project Details
Data and Dimensionality Reduction for Large Scale Statistical and Machine Learning Problems
Applicant
Dr. Alexander Munteanu
Subject Area
Theoretical Computer Science
Mathematics
Mathematics
Term
since 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 535889065
The vision of this project is to develop and obtain a unified view on data reduction algorithms for statistical learning and machine learning problems. Many statistical learning algorithms have been designed for an efficient data analysis in a classical sense. In the advent of massive data amounts, even linear time algorithms can not be considered efficient. Clearly the data must be read once, but all subsequent steps of a learning algorithm are severely constrained in their memory as well as computational resources. To enable viable statistical machine learning for massive data without discarding or reinventing the progress made in the past decades or even half a century, our goal is to develop so called \emph{coresets} and \emph{sketches}, which compress the data to a smaller amount that can be handled efficiently. Hereby, we maintain their statistical structure in such a way that the result of analyzing the small data summary is provably very close to the result that we would obtain from analyzing the full original data. Coreset constructions and sketching algorithms have been developed for basic statistical problems such as linear regression and were optimized in their key complexity measures, for instance their target size and update time on insertions of new points. Comparably little is known on how to construct coresets and sketches for generalized linear models, Bayesian regression models and even more complex models, which are less understood. This leaves a huge gap between the problems that are efficiently sketchable and the complex statistical machine learning methods that people actually employ in modern data analysis tasks. The simple models often occur as basic building blocks, but we know that even slight generalizations of the standard linear problems already run into impossibility results. There is still a lot of work to do towards overcoming these limitations by fine grained parameterized theoretical analyses beyond the worst case. To this end we pursue the development and analysis of coresets and sketches 1) for generalized and Bayesian regression models, 2) for high dimensional problems and kernel methods, 3) for large scale copula models. The project will study these aspects that will ultimately lead to a broad perspective on the possibilities and limitations of coreset and sketching methods. This will provide a unified view on the plethora of methodological approaches, and allow us to develop guidelines for identifying methods that work well in different situations and problem settings.
DFG Programme
Research Grants