Automatic generation of initial value k to apply k-means method for text documents clustering
Namita Gupta,
P.C. Saxena and
J.P. Gupta
International Journal of Data Mining, Modelling and Management, 2011, vol. 3, issue 1, 18-41
Abstract:
Retrieving relevant text documents on a topic from a large document collection is a challenging task. Different clustering algorithms are developed to retrieve relevant documents of interest. Hierarchical clustering shows quadratic time complexity of O(n²) for n text documents. K-means algorithm has a time complexity of O(n) but it is sensitive to the initial randomly selected cluster centres, giving local optimum solution. Global k-means employs the k-means algorithm as a local search procedure to produce global optimum solution but shows polynomial time complexity of O(nk) to produce k clusters. In this paper, we propose an approach of clustering text documents that overcomes the drawback of k-means and global k-means and gives global optimal solution with time complexity of O(lk) to obtain k clusters from initial set of l starting clusters. Experimental evaluation on Reuters newsfeeds (Reuters-21578) shows clustering results (entropy, purity, F-measure) obtained by proposed method comparable with k-means and global k-means.
Keywords: document clustering; global k-means; information retrieval; data mining; text documents. (search for similar items in EconPapers)
Date: 2011
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.inderscience.com/link.php?id=38810 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:3:y:2011:i:1:p:18-41
Access Statistics for this article
More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().