Arabic word tokenization system using the maximum matching model
Shahab Ahmad Almaaytah ()
Edelweiss Applied Science and Technology, 2024, vol. 8, issue 6, 3210-3217
Abstract:
Word tokenization is the first stage for higher-order Natural Language Processing (NLP) tasks like Part-of-Speech (PoS) tagging, parsing, and named entity recognition. The amount of text on the World Wide Web is growing daily in the present era of technology, necessitating the use of advanced instruments. Since more and more people speak Arabic around the world, Arabic language processing systems must be improved. Due to the writing style of Arabic with a lack of support for capitalization features and the use of compound words, it is difficult to perform word tokenization. This research paper proposes a novel Arabic word tokenization system based on the knowledge. To develop this system, a maximum matching model with its two variations, namely forward and reverse maximum matching is used. The proposed system is implemented in Python. The results produced during system evaluation report high performance.
Keywords: Arabic language processing; Arabic word tokenization; Maximum matching model; Natural language processing; PoS tagging. (search for similar items in EconPapers)
Date: 2024
References: Add references at CitEc
Citations:
Downloads: (external link)
https://learning-gate.com/index.php/2576-8484/article/view/2682/1014 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ajp:edwast:v:8:y:2024:i:6:p:3210-3217:id:2682
Access Statistics for this article
More articles in Edelweiss Applied Science and Technology from Learning Gate
Bibliographic data for series maintained by Melissa Fernandes ().