EconPapers    
Economics at your fingertips  
 

Automatic generation of initial value k to apply k-means method for text documents clustering

Namita Gupta, P.C. Saxena and J.P. Gupta

International Journal of Data Mining, Modelling and Management, 2011, vol. 3, issue 1, 18-41

Abstract: Retrieving relevant text documents on a topic from a large document collection is a challenging task. Different clustering algorithms are developed to retrieve relevant documents of interest. Hierarchical clustering shows quadratic time complexity of O(n²) for n text documents. K-means algorithm has a time complexity of O(n) but it is sensitive to the initial randomly selected cluster centres, giving local optimum solution. Global k-means employs the k-means algorithm as a local search procedure to produce global optimum solution but shows polynomial time complexity of O(nk) to produce k clusters. In this paper, we propose an approach of clustering text documents that overcomes the drawback of k-means and global k-means and gives global optimal solution with time complexity of O(lk) to obtain k clusters from initial set of l starting clusters. Experimental evaluation on Reuters newsfeeds (Reuters-21578) shows clustering results (entropy, purity, F-measure) obtained by proposed method comparable with k-means and global k-means.

Keywords: document clustering; global k-means; information retrieval; data mining; text documents. (search for similar items in EconPapers)
Date: 2011
References: Add references at CitEc
Citations:

Downloads: (external link)
http://www.inderscience.com/link.php?id=38810 (text/html)
Access to full text is restricted to subscribers.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:3:y:2011:i:1:p:18-41

Access Statistics for this article

More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().

 
Page updated 2025-03-19
Handle: RePEc:ids:ijdmmm:v:3:y:2011:i:1:p:18-41