FADeBaC Sentiment Analysis - Fully Automatic DEnsity-BAsed Clustering applied to Sentiment Analysis
Final Report Abstract
The project focused on three aspects of density-based clustering, namely: • Development and statistical analysis of adaptive density-based clustering algorithms • Foundations of clusterings with ground truth • Implementation of density-based clustering algorithms. In view of the first aspect we achieved a significant break-through by designing the very first densitybased clustering algorithm for which it can be proven that the algorithm is adaptive to various unknown and non-parametric properties of the data-generating distribution. In other words, we did not only establish the best possible convergence rates for a class of clustering algorithms that are given these properties of the distribution, but we also designed a hyper-parameter selection strategy for these algorithms that achieves the same rates without knowing these properties. These algorithms are based on a generic clustering algorithm, which only requires estimates of the density level sets with a certain uncertainty control. It turned out that a variety of density estimation methods enjoy such a control. In view of the second focus, we found a set of axioms, which on the one-hand side makes it possible to consider various geometric notions of clusterings for simple sets, and on the other hand guarantee that each such notion of clustering can be uniquely extended to a axiom-preserving clustering notion for a large set of complicated distributions. As a consequence, we could not only give an axiomatic foundation of density-based clustering, but we also identified several other notions of clustering that enjoy such an axiomatic foundation describing infinite-sample ground truth. Finally, we implemented a new density-based clustering package in C/C++ that does not only follow the statistical insights of the first aspect, but is also orders of magnitude faster than existing densitybased clustering packages. Moreover, it contains a fully automated hyper-parameter selection routine and bindings to standard languages such as Python, R, and Matlab are currently being written.
Publications
- Fully adaptive density-based clustering. Ann. Statist., 43:2132–2167, 2015. + 2 Supplements of together 52 pages
I. Steinwart
(See online at https://doi.org/10.1214/15-AOS1331) - Towards an axiomatic approach to hierarchical clustering of measures. J. Mach. Learn. Res., 16:1949–2002, 2015
P. Thomann, I. Steinwart, and N. Schmid
- Kernel density estimation for dynamical systems. Technical Report, Fakultät für Mathematik und Physik, Universität Stuttgart, 2016
H. Hang, I. Steinwart, Y. Feng, and J.A.K. Suykens
- Adaptive clustering using kernel density estimators. Technical report, Fakultät für Mathematik und Physik, Universität Stuttgart, 2017
I. Steinwart, B.K. Sriperumbudur, and P. Thomann
- Sobolev norm learning rates for the regularized least-squares algorithm. Technical report, Fakultät für Mathematik und Physik, Universität Stuttgart, 2017
S. Fischer and I. Steinwart