Some Challenges of the West Circassian Polysynthetic Corpus
Timofey Arkhangelskiy () and
Yury Lander ()
Additional contact information
Timofey Arkhangelskiy: National Research University Higher School of Economics
Yury Lander: National Research University Higher School of Economics
HSE Working papers from National Research University Higher School of Economics
Abstract:
Although there exist comprehensive morphologically annotated corpora for many morphologically rich languages, there have been no such corpora for any polysynthetic language so far. Polysynthetic languages raise a variety of theoretical and practical challenges for corpus linguistics. Some of these challenges have been partly addressed when developing corpora for e. g. Turkic or Uralic languages, while others are unique for this kind of languages. Our paper identifies the most prominent challenges that we are facing in the course of development of West Circassian (Adyghe) corpus, and offer possible solutions. These include the tokenization problem, which involves delimiting morphology from syntax, the problem with lemmatization and part-of-speech tagging, and a number of glossing and search problems.
Keywords: language corpora; polysynthesis; West Circassian (search for similar items in EconPapers)
JEL-codes: Z (search for similar items in EconPapers)
Pages: 21 pages
Date: 2015
New Economics Papers: this item is included in nep-cis
References: View complete reference list from CitEc
Citations:
Published in WP BRP Series: Linguistics / LNG, December 2015, pages 1-21
Downloads: (external link)
http://www.hse.ru/data/2015/12/29/1136293697/37LNG2015.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:hig:wpaper:37/lng/2015
Access Statistics for this paper
More papers in HSE Working papers from National Research University Higher School of Economics
Bibliographic data for series maintained by Shamil Abdulaev () and Shamil Abdulaev ().