EconPapers    
Economics at your fingertips  
 

A method for K-Means seeds generation applied to text mining

Daniel Velez, Jorge Sueiras, Alejandro Ortega and Jose F. Velez ()
Additional contact information
Daniel Velez: Universidad Complutense
Jorge Sueiras: Universidad Rey Juan Carlos
Alejandro Ortega: Universidad Carlos III
Jose F. Velez: Universidad Rey Juan Carlos

Statistical Methods & Applications, 2016, vol. 25, issue 3, No 8, 477-499

Abstract: Abstract In this paper, a methodology is proposed in order to produce a set of seeds later used as a starting point to K-Means-type unsupervised classification algorithms for text mining. Our proposal involves using the eigenvectors obtained from principal component analysis to extract initial seeds, upon appropriate treatment for search of lightly overlapping clusters which are also clearly identified by keywords. This work is motivated by the interest of the authors in the problem of identification of topics and themes previously unknown in short texts. Therefore, in order to validate the goodness of this method, it was applied on a sample of labeled e-mails (NG20) representing a gold standard within the field of text mining. Specifically, some corpora referenced in the literature have been used, configured in accordance to a mix of topics contained in the sample. The proposed method improves on the results of other state-of-the-art methods to which it is compared.

Keywords: Text mining; K-Means; PCA; Classification; Seeds; Eigenvectors (search for similar items in EconPapers)
Date: 2016
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s10260-015-0345-4 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:stmapp:v:25:y:2016:i:3:d:10.1007_s10260-015-0345-4

Ordering information: This journal article can be ordered from
http://www.springer. ... cs/journal/10260/PS2

DOI: 10.1007/s10260-015-0345-4

Access Statistics for this article

Statistical Methods & Applications is currently edited by Tommaso Proietti

More articles in Statistical Methods & Applications from Springer, Società Italiana di Statistica
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-20
Handle: RePEc:spr:stmapp:v:25:y:2016:i:3:d:10.1007_s10260-015-0345-4