Project Details
Smart Harvesting 2
Subject Area
Security and Dependability, Operating-, Communication- and Distributed Systems
Term
from 2012 to 2020
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 217852844
Automatically extracting and editing bibliographic data is one of the major problems associatedwith the maintenance of bibliographic databases. The successor project Smart Harvesting II aims to intensify the prolific collaboration between the database providers dblp (computer science) and GESIS (social science) in order to solve common problems.The predecessor project focused on the development of a learning wrapper which uses the current database to automatically generate extraction rules. However, due to the multitude of technologies used on the web, this is not generally applicable. Especially, dynamically generated and updated contents (e.g., via AJAX calls) still pose a substantial challenge.Therefore, the current project prioritizes the development of a wrapper framework for a rule-based data extraction which can be handled by non-computer scientists by means of simple extraction rules. Navigation as well as extraction shall be effected by parsing the underlying DOM trees of the HTML pages. In cooperation with the University of Oxford, we intend to integrate their addressing scheme OXPath (an extension of XPath) into the wrapper. Furthermore, we plan to create monitoring tools enabling non-programmers (e.g., librarians) to oversee the complete data extraction process and tap new data sources.At the same time, we shall revise and edit the existing data pool by means of authordisambiguation in order to guarantee a more solid data base. The disambiguation software for new data already established during the predecessor project shall be enhanced by a further component that is to detect homonyms and synonyms in the existing data. Above all, the project is to respond to the discrepancies between the different publication cultures (computer science social science), which have been revealed in the predecessor project, because they require the use of disparate strategies.At GESIS, author pages shall be created, whereas at dblp, existing pages shall be revised and additional information shall be integrated into the disambiguation process.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)
International Connection
United Kingdom
Co-Investigators
Professor Dr. Georg Gottlob; Dr. Cornelia Hedeler