Project Details
Projekt Print View

InVenod - interaktive Verarbeitung nicht OCR-geeigneter Dokumente

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term from 2008 to 2014
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 62297683
 
Final Report Year 2014

Final Report Abstract

During the last years, numerous projects dealt with the topic of digitisation. The focus thereby was mainly on keeping the cultural heritage. In order to enable a world wide access, books and collections of libraries have been digitized. Books that have been transformed into a digital format are actually a collection of images. However, in order to have also access to textual parts that are contained in these images, it is necessary to extract text passages. Here, the method of image processing reaches its limits: complex layouts or fonts, different languages that use different symbol systems and other components that can be found within just one document. As a particular challenge can be seen books that have been published before the twentieth century. These documents contain complex fonts, many special characters, and even symbols that are unknown today as well as non-standardised layouts which makes it almost impossible to get access by means of search engines. Such aging phenomena as yellowed, distorted or blotted pages also cause problems. Currently established optical character recognition software is not able to successfully process these documents. Thus, not all books from the past that are available today can be digitized. In order to make the contents of such books available it is essential do develop new ways and means that enable also the processing of historic fonts. The basic idea for the realization is rather simple. For each document, which might be a single certificate, a letter, or a whole book, a document specific font will be generated out of that document. This specific font derives from visual features that can be extracted out of document images. They enable the grouping of characters and symbols, as at this stage no assumptions are made concerning any underlying languages. Referring to such visual features, mainly arbitrary complex shape features, the font of a historic document can be arbitrarily complex itself. For the InVenod project, shapes are the lingua franca. The recognition of the underlying characters is not included in the genuine InVenod workflow. However, a subsequent mapping process of characters to glyph clusters makes recognition of the extracted symbol possible regarding a particular alphabet. Thus, the InVenod project represents the very first step towards an application of advanced technologies to ancient documents as for instance a digital content analysis. The main goal of the InVenod project, the reproduction of ancient books, shows that the general approach is working. The performance has been demonstrated with success by analysing printings of a Gothic type for several excerpts from Gutenberg-Bible to documents of the 19th century as well as by applying it to typewriter fonts of the twentieth century.

 
 

Additional Information

Textvergrößerung und Kontrastanpassung