Economics at your fingertips  

Albanian Text Classification: Bag of Words Model and Word Analogies

Kadriu Arbana, Abazi Lejla and Abazi Hyrije
Additional contact information
Kadriu Arbana: SEE University, Tetovo, Macedonia
Abazi Lejla: SEE University, Tetovo, Macedonia
Abazi Hyrije: SEE University, Tetovo, Macedonia

Business Systems Research, 2019, vol. 10, issue 1, 74-87

Abstract: Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.

Keywords: data mining; text classification; news articles; machine learning (search for similar items in EconPapers)
Date: 2019
References: Add references at CitEc
Citations: Track citations by RSS feed

Downloads: (external link) ... -0006.xml?format=INT (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link:

Access Statistics for this article

Business Systems Research is currently edited by Mirjana Pejić Bach

More articles in Business Systems Research from Sciendo
Bibliographic data for series maintained by Peter Golla ().

Page updated 2019-06-07
Handle: RePEc:bit:bsrysr:v:10:y:2019:i:1:p:74-87:n:6