Feature selection methods for document clustering: a comparative study and a hybrid solution
Asmaa Benghabrit,
Brahim Ouhbi,
Bouchra Frikh,
El Moukhtar Zemmouri and
Hicham Behja
International Journal of Data Analysis Techniques and Strategies, 2019, vol. 11, issue 3, 246-272
Abstract:
The web proliferation makes the exploration and the use of the huge amount of available unstructured text documents challenged, which drives the need of document clustering. Hence, improving the performances of this mechanism by using feature selection seems worth investigation. Therefore, this paper proposes an efficient way to highly benefit from feature selection for document clustering. We first present a review and comparative studies of feature selection methods in order to extract efficient ones. Then we propose a sequential and hybrid combination modes of statistical and semantic techniques in order to benefit from crucial information that each of them provides for document clustering. Extensive experiments prove the benefit of the proposed combination approaches. The performance of document clustering is highest when the measures based on Chi-square statistic and the mutual information are linearly combined. Doing so, it avoids the unwanted correlation that the sequential approach creates between the two treatments.
Keywords: document clustering; feature selection; statistical and semantic data analysis; chi-square statistic; mutual information; k-means algorithm; comparative study; hybrid solution. (search for similar items in EconPapers)
Date: 2019
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.inderscience.com/link.php?id=101154 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:injdan:v:11:y:2019:i:3:p:246-272
Access Statistics for this article
More articles in International Journal of Data Analysis Techniques and Strategies from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().