A HYBRID LEMMATISER FOR OLD CHURCH SLAVONIC
Ilia Afanasev ()
Additional contact information
Ilia Afanasev: National Research University Higher School of Economics
HSE Working papers from National Research University Higher School of Economics
Abstract:
The article considers a lemmatiser that is developed specifically for Old Church Slavonic (OCS). The introduction underlines the problem of the lack of lemmatisers that might deal with different datasets of the OCS. The review gives a short description of previous attempts and current trends in lemmatisation. The lemmatiser is hybrid-based and uses the advantages of linguistic rules for specific cases (fragmentary tokens, punctuation, or digits), a dictionary for the most common tokens, and a sequence-to-sequence (seq2seq) neural network with an attention mechanism for the rest of material. The model achieves an 85% overall accuracy score, which is lower than one of the previous models for the Universal Dependencies(UD) dataset. However, when specific tokens are taken into consideration, the model outperforms the previous ones with the help of its rule-based part. Possible further directions of the research include the use of more sophisticated architectures, such as BART.
Keywords: lemmatisation; Old Church Slavonic; hybrid approach; natural language processing; seq2seq. (search for similar items in EconPapers)
JEL-codes: Z (search for similar items in EconPapers)
Pages: 19 pages
Date: 2021
References: View complete reference list from CitEc
Citations:
Published in WP BRP Series: Linguistics / LNG, February 2021, pages 1-19
Downloads: (external link)
https://wp.hse.ru/data/2021/02/18/1393879077/106LNG2021.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:hig:wpaper:106/lng/2021
Access Statistics for this paper
More papers in HSE Working papers from National Research University Higher School of Economics
Bibliographic data for series maintained by Shamil Abdulaev () and Shamil Abdulaev ().