EconPapers    
Economics at your fingertips  
 

Information-theoretic feature selection with discrete $$k$$ k -median clustering

Onur Şeref (), Ya-Ju Fan (), Elan Borenstein () and Wanpracha A. Chaovalitwongse ()
Additional contact information
Onur Şeref: Virginia Polytechnic Institute and State University
Ya-Ju Fan: Lawrence Livermore National Laboratory
Elan Borenstein: Rutgers University
Wanpracha A. Chaovalitwongse: University of Washington

Annals of Operations Research, 2018, vol. 263, issue 1, No 6, 93-118

Abstract: Abstract We propose a novel computational framework that integrates information-theoretic feature selection with discrete $$k$$ k -median clustering (DKM). DKM is a domain-independent clustering algorithm which requires a pairwise distance matrix between samples that can be defined arbitrarily as input. In the proposed DKM clustering, the center of each cluster is represented by a set of samples, which induce a separate set of clusters for each feature dimension. We evaluate the relevance of each feature by the normalized mutual information (NMI) scores between the base clusters using all features and the induced clusters for that feature dimension. We propose a spectral cluster analysis (SCA) method to determine the number of clusters using the average of the relevance NMI scores. We introduce filter- and wrapper-based feature selection algorithms that produce a ranked list of features using the relevance NMI scores. We create an information gain curve and calculate the normalized area under this curve to quantify information gain and identify the contributing features. We study the properties of our information-theoretic framework for clustering, SCA and feature selection on simulated data. We demonstrate that SCA can accurately identify the number of clusters in simulated data and public benchmark datasets. We also compare the clustering and feature selection performance of our framework to other domain-dependent and domain-independent algorithms on public benchmark datasets and a real-life neural time series dataset. We show that DKM runs comparably fast with better performance.

Keywords: Discrete clustering; Information theory; Cluster analysis; Feature selection (search for similar items in EconPapers)
Date: 2018
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (3)

Downloads: (external link)
http://link.springer.com/10.1007/s10479-014-1589-3 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:annopr:v:263:y:2018:i:1:d:10.1007_s10479-014-1589-3

Ordering information: This journal article can be ordered from
http://www.springer.com/journal/10479

DOI: 10.1007/s10479-014-1589-3

Access Statistics for this article

Annals of Operations Research is currently edited by Endre Boros

More articles in Annals of Operations Research from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-20
Handle: RePEc:spr:annopr:v:263:y:2018:i:1:d:10.1007_s10479-014-1589-3