Project Details
Projekt Print View

Smart Harvesting 2

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems
Term from 2012 to 2020
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 217852844
 
Automatically extracting and editing bibliographic data is one of the major problems associatedwith the maintenance of bibliographic databases. The successor project Smart Harvesting II aims to intensify the prolific collaboration between the database providers dblp (computer science) and GESIS (social science) in order to solve common problems.The predecessor project focused on the development of a learning wrapper which uses the current database to automatically generate extraction rules. However, due to the multitude of technologies used on the web, this is not generally applicable. Especially, dynamically generated and updated contents (e.g., via AJAX calls) still pose a substantial challenge.Therefore, the current project prioritizes the development of a wrapper framework for a rule-based data extraction which can be handled by non-computer scientists by means of simple extraction rules. Navigation as well as extraction shall be effected by parsing the underlying DOM trees of the HTML pages. In cooperation with the University of Oxford, we intend to integrate their addressing scheme OXPath (an extension of XPath) into the wrapper. Furthermore, we plan to create monitoring tools enabling non-programmers (e.g., librarians) to oversee the complete data extraction process and tap new data sources.At the same time, we shall revise and edit the existing data pool by means of authordisambiguation in order to guarantee a more solid data base. The disambiguation software for new data already established during the predecessor project shall be enhanced by a further component that is to detect homonyms and synonyms in the existing data. Above all, the project is to respond to the discrepancies between the different publication cultures (computer science social science), which have been revealed in the predecessor project, because they require the use of disparate strategies.At GESIS, author pages shall be created, whereas at dblp, existing pages shall be revised and additional information shall be integrated into the disambiguation process.
DFG Programme Research data and software (Scientific Library Services and Information Systems)
International Connection United Kingdom
 
 

Additional Information

Textvergrößerung und Kontrastanpassung