A Library for Scalable Analytics and Mining in Stratosphere
Final Report Abstract
Year after year, the amount of documents in the Web, as well as in undisclosed systems for private or business context grows significantly. This leads to a considerable increase of valuable textual information. Harvesting knowledge from these large text collections is a major challenge. In contrast to structured data, text lacks inherent metadata. Hence, it is necessary to further analyze and characterize it in order to locate information that satisfy a specific demand or need. The characterization requires various perspectives in differing granularities on the text data, e.g., identifying document types, discovering topic categories, extracting mentioned locations or persons, etc. In order to target an information need, relevance has to be defined, so that an algorithm can be modeled to identify relevant texts. In the context of the Stratosphere II project, we utilize the knowledge represented in various social media platforms with user-generated content to target different information needs. The popularity of online platforms, such as Twitter, Wikipedia, and StackOverflow, grows continuously and the inherent knowledge is ready to be harvested. In many cases, an essential feature of such platform is to share information with a peer group that is relevant for the author. Thus, being human-generated, the information on these platforms is clean, focused, and already disambiguated. We examined how this knowledge can be used for three different sub projects: We introduced an algorithm to tackle the entity linking problem, a necessity for harvesting entity knowledge from large text collections. The goal is the linkage of mentions within the documents with their real-world entities. We showed that when searching with ambiguous person names, the information from Wikipedia can be bootstrapped to group the results according to the individuals occurring in them. And we explored how the categorization of texts according to community-generated folksonomies that underlie constant change helps users to identify new information related to their interests.