Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

Rehman, Zobia; Anwar, Waqas; Bajwa, Usama Ijaz; Xuan, Wang; Chaoying, Zhou

Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

Zobia Rehman, Waqas Anwar, Usama Ijaz Bajwa, Wang Xuan and Zhou Chaoying

PLOS ONE, 2013, vol. 8, issue 8, 1-8

Abstract: Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.

Date: 2013
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (2)

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0068178 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 68178&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0068178

DOI: 10.1371/journal.pone.0068178

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().