Clustering with or without the approximation
Frans Schalekamp (),
Michael Yu () and
Anke Zuylen ()
Additional contact information
Frans Schalekamp: Tsinghua University
Michael Yu: MIT
Anke Zuylen: Tsinghua University
Journal of Combinatorial Optimization, 2013, vol. 25, issue 3, No 4, 393-429
Abstract:
Abstract We study algorithms for clustering data that were recently proposed by Balcan et al. (SODA’09: 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068–1077, 2009a) and that have already given rise to several follow-up papers. The input for the clustering problem consists of points in a metric space and a number k, specifying the desired number of clusters. The algorithms find a clustering that is provably close to a target clustering, provided that the instance has the “(1+α,ε)-property”, which means that the instance is such that all solutions to the k-median problem for which the objective value is at most (1+α) times the optimal objective value correspond to clusterings that misclassify at most an ε fraction of the points with respect to the target clustering. We investigate the theoretical and practical implications of their results. Our main contributions are as follows. First, we show that instances that have the (1+α,ε)-property and for which, additionally, the clusters in the target clustering are large, are easier than general instances: the algorithm proposed in Balcan et al. (SODA’09: 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068–1077, 2009a) is a constant factor approximation algorithm with an approximation guarantee that is better than the known hardness of approximation for general instances. Further, we show that it is NP-hard to check if an instance satisfies the (1+α,ε)-property for a given (α,ε); the algorithms in Balcan et al. (SODA’09: 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068–1077, 2009a) need such α and ε as input parameters, however. We propose ways to use their algorithms even if we do not know values of α and ε for which the assumption holds. Finally, we implement these methods and other popular methods, and test them on real world data sets. We find that on these data sets there are no α and ε so that the dataset has both (1+α,ε)-property and sufficiently large clusters in the target solution. For the general case where there are no assumptions about the cluster sizes, we show that on our data sets the performance guarantee proved by Balcan et a. (SODA’09: 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068–1077, 2009a) is meaningless for the values of α,ε for which the data set has the (1+α,ε)-property. The algorithm nonetheless gives reasonable results, although it is outperformed by other methods.
Keywords: Clustering; k-median; Algorithms; Approximation (search for similar items in EconPapers)
Date: 2013
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s10878-011-9382-6 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:jcomop:v:25:y:2013:i:3:d:10.1007_s10878-011-9382-6
Ordering information: This journal article can be ordered from
https://www.springer.com/journal/10878
DOI: 10.1007/s10878-011-9382-6
Access Statistics for this article
Journal of Combinatorial Optimization is currently edited by Thai, My T.
More articles in Journal of Combinatorial Optimization from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().