Dimensionality reduction in text classification using scatter method
Jyri Saarikoski,
Jorma Laurikkala,
Kalervo Järvelin,
Markku Siermala and
Martti Juhola
International Journal of Data Mining, Modelling and Management, 2014, vol. 6, issue 1, 1-21
Abstract:
Preprocessing of data is a vital part of any task involving machine learning. In the classification of text documents, the most important aspect of preprocessing is usually the dimensionality reduction of data vectors. This paper focuses on the use of a recent scatter method in the dimensionality reduction of text documents. The effectiveness of the method was tested with the classification of two datasets, the Reuters news collection and the Spanish CLEF 2003 news collection. The classification methods used were self-organising maps, Naïve Bayes method, k nearest neighbour searching and classification tree. For comparison, we also conducted the dimensionality reduction of the data with document frequency and mutual information approaches. The scatter method proved to be an effective dimensionality reduction method for text document data. The suggested approach outperformed the document frequency reduction and scored comparably against the mutual information method, except when only very small set of features was selected where mutual information was better, especially in the CLEF collection.
Keywords: text documents; dimensionality reduction; classification; mutual information; self-organising maps; SOMs; naïve Bayes; k nearest neighbour; kNN; classification tree; scatter method; machine learning. (search for similar items in EconPapers)
Date: 2014
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.inderscience.com/link.php?id=59978 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:6:y:2014:i:1:p:1-21
Access Statistics for this article
More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().