Welcome to the site of the Greek Dependency Treebank
This website contains information, publications, annotation guidelines and tools related to the Greek Dependency Treebank, a resource for Modern Greek manually annotated for morphology, syntax and semantics. GDT is an ongoing project led by researchers at the Institute for Language and Speech Processing, with the help of students from the Texnoglwssia postgraduate program and the Univ. of Athens.
GDT includes texts from open-content sources and from corpora collected at ILSP in the framework of research projects aiming at multilingual, multimedia information extraction. The initial GDT edition (2005-2007) contained circa 70K tokens, while with the addition of new annotated material (2011-2014), this number rose to 175+K tokens in approximately 7000 sentences. The texts include
- manually normalized transcripts of European parliamentary sessions
- articles from the Greek Wikipedia and
- web documents pertaining the politics, health, and travel domains
The dependency-based annotation scheme used for the syntactic layer of the GDT allows for intuitive representations of structures common in languages with flexible word order. The current version of the resource is being harmonized with the Universal Dependencies guidelines. A subset of GDT (derived from primary texts that are in the public domain) can be downloaded from the UD_Greek page.
GDT has been used at ILSP in training dependency parsers for Greek. The most recent one is described in the paper: Prokopidis and Papageorgiou (2017): Universal dependencies for Greek. The parser is also available as a web service.
Automatic preprocessing of GDT documents included sentence splitting, POS tagging and lemmatization with a suite of natural language processing tools developed at ILSP. The POS tags, which follow a tagset presented here, have been manually validated for all GDT texts. The manual annotation of dependency relations is accompanied, for GDT subsets, by annotation of semantic roles (70K tokens) and event annotation based on a shallow domain specific ontology (31K tokens).