Albanian Text Classification: Bag of Words Model and Word Analogies
Abazi Lejla and
Additional contact information
Kadriu Arbana: SEE University, Tetovo, Macedonia
Abazi Lejla: SEE University, Tetovo, Macedonia
Abazi Hyrije: SEE University, Tetovo, Macedonia
Business Systems Research, 2019, vol. 10, issue 1, 74-87
Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.
Keywords: data mining; text classification; news articles; machine learning (search for similar items in EconPapers)
References: Add references at CitEc
Citations: Track citations by RSS feed
Downloads: (external link)
https://www.degruyter.com/view/j/bsrj.2019.10.issu ... -0006.xml?format=INT (text/html)
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
Persistent link: https://EconPapers.repec.org/RePEc:bit:bsrysr:v:10:y:2019:i:1:p:74-87:n:6
Access Statistics for this article
Business Systems Research is currently edited by Mirjana Pejić Bach
More articles in Business Systems Research from Sciendo
Bibliographic data for series maintained by Peter Golla ().