EconPapers    
Economics at your fingertips  
 

Feature reduction techniques for Arabic text categorization

Rehab Duwairi, Mohammad Nayef Al‐Refai and Natheer Khasawneh

Journal of the American Society for Information Science and Technology, 2009, vol. 60, issue 11, 2347-2352

Abstract: This paper presents and compares three feature reduction techniques that were applied to Arabic text. The techniques include stemming, light stemming, and word clusters. The effects of the aforementioned techniques were studied and analyzed on the K‐nearest‐neighbor classifier. Stemming reduces words to their stems. Light stemming, by comparison, removes common affixes from words without reducing them to their stems. Word clusters group synonymous words into clusters and each cluster is represented by a single word. The purpose of employing the previous methods is to reduce the size of document vectors without affecting the accuracy of the classifiers. The comparison metric includes size of document vectors, classification time, and accuracy (in terms of precision and recall). Several experiments were carried out using four different representations of the same corpus: the first version uses stem‐vectors, the second uses light stem‐vectors, the third uses word clusters, and the fourth uses the original words (without any transformation) as representatives of documents. The corpus consists of 15,000 documents that fall into three categories: sports, economics, and politics. In terms of vector sizes and classification time, the stemmed vectors consumed the smallest size and the least time necessary to classify a testing dataset that consists of 6,000 documents. The light stemmed vectors superseded the other three representations in terms of classification accuracy.

Date: 2009
References: Add references at CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://doi.org/10.1002/asi.21173

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamist:v:60:y:2009:i:11:p:2347-2352

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1532-2890

Access Statistics for this article

More articles in Journal of the American Society for Information Science and Technology from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamist:v:60:y:2009:i:11:p:2347-2352