Project Details
Training of machine-learning based procedures for automated postcorrection of OCRed historical printings
Applicant
Professor Dr. Klaus U. Schulz
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term
from 2020 to 2022
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 431091758
OCR-results for historical printings typically contain many recognition errors. Hence postcorrection methods play an important role in this field. Some automated postcorrection systems ``individually'' developed for a particular historical OCR-corpus have shown good results. However, the development of an ``omnipotent'' general system for automated postcorrection of OCR-results, offering good results for distinct OCR engines and arbitrary historical printings, is an ambitious future goal. In the framework of the OCR-D initiative currently OCR postcorrection systems are being developed that are based on supervised machine learning. In the ideal case these systems should be applicable to arbitrary OCR engines and historical texts. In this project we want to systematically study the influence of training data and -methods on the quality of the correction results achieved. The long-term ultimate goal is the development of an ``omnipotent'' (s.a.) postcorrection model. As a first step we look for training data and feature systems that lead to optimal correction results for specific OCR engines and classes of historical printings, analyzing correction problems arising for other OCRs and corpora. Using these results as a starting point we search for methods to minimize the additional effort needed (in terms of ground truth preparation and posttraining) for developing correction models for larger and inhomogeneous corpora. Specific points to be investigated are, among others, the combination of postcorrection models and the automated selection of a correction model for a given new OCR corpus.
DFG Programme
Research Grants