Classification and Intelligent Search on Information in XML
Final Report Abstract
The project has made major contributions to advancing the state of the art in XML IR. The most prominent achievements are the following. • Foundations of query models and languages for ranked retrieval (WPla, WP D): The models that underlie the query languages of XIRQL (the model of the HyREX engine at Duisburg), XXL (the first XML IR prototype at MPII), and TopX (the second XML IR prototype at MPII) have been fully formalized and become mature. Extensions that go beyond XML trees and capture interlinked graphs of XML data have been addressed as well. The collaboration between the Duisburg and MPII groups helped to identify crucial key issues and better understand particularly important subtleties, most notably, the notion of relevance weights for XML subtrees and their propagation among indexing units (a key issue for query-result scoring). • Efficiency and scalability (WPlb, WPlc, WPld, WP C): We have developed novel techniques for indexing and query processing that can accelerate ranked retrieval of XML data. The indexing methods make use of data structures that encode transitive paths in trees and graphs; these methods provide a good balance between space and time efficiency. For efficient query processing various methods for topk queries have been developed, including approximation techniques with probabilistic guarantees and methods that can flexibly trade off efficiency gains for a small loss in effectiveness (recall and precision). These methods have shown excellent performance in large-scale experiments on a variety of data collections. • Coping with heterogeneous sources (WP2a, WP2b, WP3a, WP B): A highly versatile ontology service has been developed that supports a variety of ontological relationships between concepts and quantifies them, for scoring in queries, based on corpus statistics. This service can be used for query expansion and classification. Queries are expanded in order to relax tag names and content terms by ontologically strongly related concepts. The expansion is performed in an incremental on-demand manner, in order to minimize the run-time overhead of complex queries and to make the expansion self-throttling for robustness. • Interactive retrieval (WP A): For interactive retrieval and exploration, various forms of guided browsing and relevance feedback have been developed. Relevance feedback can be gathered for different granularities like entire documents or specific subtrees, and it can be used to strengthen positively perceived paths in automatically refined queries or to generate path conditions from simpler content-only queries. Clustering approaches to support fc-means clustering and scatter/gather browsing have been implemented and evaluated - Some studies have been conducted on the design of user interfaces for interactive retrieval of structured documents. Prototype implementation and experimentation (WP5, WP E): The prototype software of both groups, the XML IR engines HyREX and TopX (and XXL, the predecessor of TopX), have been fully completed and stress-tested. The software was demonstrated at various first-rate conferences (inch SIGMOD and VLDB), and both groups participated in the INEX benchmark series. Both HyREX and TopX have been made available as open-source software. TopX has served as a reference engine for topic development in INEX 2006. With POLAR, the Duisburg group proposed and implemented a first prototype of a new system based on probabilistic logics which is capable of retrieving annotated and interlinked structured documents. The Duisburg group has played a key role in coordinating the INEX series and leading several INEX tracks over several years, and has provided various software tools for this purpose. The two project partners have also successfully cooperated on coupling two of their software prototypes, the BINGO! focused crawler and the DAFFODIL search tool for federations of digital libraries.
Publications
- Martin Theobald, Glaus-Peter Klas: BINGO! And Daffodil: Web Recommendation for Digital Libraries. 7th International Conference on Computer- Assisted Information Retrieval (RIAO), Avignon, France, 2004
- Martin Theobald, Ralf Schenkel, Gerhard Weikum: An Efficient and Versatile Query Engine for TopX Search. VLDB 2005: 625-636
- Martin Theobald, Ralf Schenkel, Gerhard Weikum: Efficient and selftuning incremental query expansion for top-k query processing. SIGIR 2005: 242-249
- ngo Frommholz, Norbert Fuhr: Probabilistic, object-oriented logics for annotation-based retrieval in digital libraries. JCDL 2006: 55-64. (Siehe online unter: http://news.zdnet.com/ )
- Norbert Fuhr, Norbert Gövert; Retrieval quality vs. effectiveness of specificity-oriented search in XML collections. Inf. Retr. 9(1): 55-70 (2006)
- Norbert Gövert, Norbert Fuhr, Mounia Lalmas, Gabriella Kazai: Evaluating the effectiveness of content-oriented XML retrieval methods. Inf. Retr. 9(6): 699-722 (2006)
- Ralf Schenkel, Martin Theobald: Feedback-Driven Structural Query Expansion for Ranked Retrieval of XML Data. EDBT 2006: 331-348