Project Details
Projekt Print View

A Multilayer Corpus for Ancient Greek and Latin

Subject Area Applied Linguistics, Computational Linguistics
Term since 2018
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 408121292
 
Opera Graeca Adnotata (OGA) and Opera Latina Adnotata (OLA) are the largest open access and scalable morphosyntactically annotated corpora for Ancient Greek and Latin, respectively. They both adopt a standoff annotation approach whereby tokens and morphological and syntactic labels are connected to each other in a graph structure. The corpora build on the data of the Ancient Greek and Latin Dependency Treebank, which have been used to train a neural parser (COMBO) and subsequently automate the morphosyntactic annotation of (most of) the Ancient Greek and Latin texts of the Perseus Digital Library. Currently, OGA contains 489 annotated files (6,488,472 tokens and 347,517 sentences), while OLA 316 (6,755,191 tokens and 411,329 sentences). The present project aims to enrich these corpora with three further annotation layers, which are considered to be basic to any literary corpus: (i) an orthographic normalization layer, (ii) a phonemic transcription layer, and (iii) a full lemma layer. Both Ancient Greek and Latin orthographies have considerably varied over time because of differences in spelling conventions and dialects. This calls for addition of an orthographic normalization layer that allows tokens spelled differently to be grouped by a common form, thereby establishing a link between them, which facilitates their retrieval. A phonemic transcription layer associates each token with a phonemic transcription. Since orthographic systems contain idiosyncrasies, phonemic transcriptions serve the purpose of enabling reliable comparison between words along both the synchronic and diachronic axis. A full lemma layer pairs a token with a dictionary lemma consisting of its full paradigm and not only its first component, as is the current practice in most treebanks. Only a full lemma provides complete information as to a token's morphology, in that it allows fast retrieval or generation of its related inflected word forms and avoidance of ambiguities that one word lemmas may raise.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung