Economics at your fingertips  

A New Variables Selection And Dimensionality Reduction Technique Coupled with Simca Method for the Classification of text Documents

Ahmed Abdelfattah Saleh and Li Weigang
Additional contact information
Ahmed Abdelfattah Saleh: University of Brasilia, Brasil
Li Weigang: University of Brasilia, Brasil

from ToKnowPress

Abstract: Classification of text documents is of significant importance in the field of data mining and machine learning. However, the vector representation of documents, in classification problems, results in a highly sparse data with immense number of variables. This necessitates applying an efficient variables selection and dimensionality reduction technique that ensures model’s selectivity, accuracy and robustness with fewer variables. This paper introduces a new coefficient, the Variables Strength Coefficient (VSC), which permits retaining variables with strong Modeling and Discriminatory powers. A variable with VSC greater than a predefined threshold is considered to have strong power in both modeling data and discriminating classes and thus retained, while weaker variables are discarded. This straightforward technique results in maximizing the differences between classes while preserving the modeling power of variables. This paper also proposes applying a classification technique that is widely used in chemical analysis domain; the supervised learning algorithm SIMCA. The soft and independent nature of SIMCA allows multi-labeling of text documents, in addition to, the ability to include new classes later on without affecting the created model. VSC-SIMCA was applied on the data set ‘CNAE-9’ and the results obtained were compared to classification and dimensionality reduction work done on the same data set in the literature. VSC-SIMCA technique shows superior performance over other techniques, both in the amount of dimensionality reduction, as well as, the classification performance. The improved classification precision, with substantial fewer variables, demonstrates the contribution of the proposed approach of this research.

Keywords: VSC; SIMCA; text classification; variables selection; supervised learning (search for similar items in EconPapers)
Date: 2015
References: Add references at CitEc
Citations: Track citations by RSS feed

Downloads: (external link) full text (application/pdf) Conference Programme (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link:

Access Statistics for this chapter

More chapters in Managing Intellectual Capital and Innovation for Sustainable and Inclusive Society: Managing Intellectual Capital and Innovation; Proceedings of the MakeLearn and TIIM Joint International Conference 2015 from ToKnowPress
Bibliographic data for series maintained by Alen Jezovnik ().

Page updated 2020-06-23
Handle: RePEc:tkp:mklp15:583-591