This website contains information, publications, annotation guidelines and tools related to the Greek Dependency Treebank, a resource for Modern Greek manually annotated for morphology, syntax and semantics. GDT is an ongoing project led by researchers at the Institute for Language and Speech Processing, with the help of students from the Texnoglwssia postgraduate program and the Univ. of Athens.
GDT includes texts from open-content sources and from corpora collected at ILSP in the framework of research projects aiming at multilingual, multimedia information extraction. The initial GDT edition (2005-2007) contained circa 70K tokens, while with the addition of new annotated material (2011-2014), this number rose to 160+K tokens in approximately 7000 sentences. The texts include
- manually normalized transcripts of European parliamentary sessions
- articles from the Greek Wikipedia and
- web documents pertaining the politics, health, and travel domains
The dependency-based annotation scheme used for the syntactic layer of the GDT allows for intuitive representations of structures common in languages with flexible word order. The annotation scheme is based on an adaptation of the guidelines for the Prague Dependency Treebank.
Automatic preprocessing of GDT documents included sentence splitting, POS tagging and lemmatization with a suite of natural language processing tools developed at ILSP. The POS tags, which follow a tagset presented here, have been manually validated for all GDT texts. The manual annotation of dependency relations is accompanied, for GDT subsets, by annotation of semantic roles (70K tokens) and event annotation based on a shallow domain specific ontology (31K tokens).
GDT has been used at ILSP in training dependency parsers for Greek. You can use the most recent one to process your own texts from this web service.