EconPapers    
Economics at your fingertips  
 

Annotation tools for syntax and named entities in the National Corpus of Polish

Jakub Waszczuk, Katarzyna Głowińska, Agata Savary, Adam Przepiórkowski and Michał Lenart

International Journal of Data Mining, Modelling and Management, 2013, vol. 5, issue 2, 103-122

Abstract: The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.

Keywords: corpus annotation; National Corpus of Polish; shallow parsing; chunking grammars; named entity recognition; NER; syntax; named entities; linguistic annotation; syntactic words; syntactic groups; parser grammar; XML converters; customised archiving repository; automatic data flow; file management. (search for similar items in EconPapers)
Date: 2013
References: Add references at CitEc
Citations:

Downloads: (external link)
http://www.inderscience.com/link.php?id=53691 (text/html)
Access to full text is restricted to subscribers.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:5:y:2013:i:2:p:103-122

Access Statistics for this article

More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().

 
Page updated 2025-03-19
Handle: RePEc:ids:ijdmmm:v:5:y:2013:i:2:p:103-122