The Effectiveness of Lloyd-Type Methods for the k-Means Problem

Creators: Ostrovsky, Rafail; Rabani, Yuval; Schulman, Leonard J.; Swamy, Chaitanya

Other:: Werner, Bob

Abstract

We investigate variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

Additional Information

© 2012 ACM. Received March 2010; revised April 2012 and September 2012; accepted September 2012. A preliminary version of this article appeared in Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006. R. Ostrovsky was supported in part by NSF grants 0830803, 09165174, 1065276, 1118126, and 1136174; US-Israel BSF grant 2008411; OKAWA Foundation Research Award; IBM Faculty Research Award; Xerox Faculty Research Award; B. John Garrick Foundation Research Award, Teradata Research Award; and Lockheed-Martin Corporation Research Award; and the Defense Advanced Research Projects Agency through the U.S. Office of Naval Research under Contract N00014-11-1-0392. The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. Y. Rabani was supported by ISF grant 52/03 and BSF grant 2002282. L. J. Schulman was supported in part by NSF CCF-0515342, NSF CCF-1038578, NSA H98230-06-1-0074, and NSF ITR CCR-0326554. The research of C. Swamy was supported partially by NSERC grant 327620-09 and an Ontario Early Researcher Award. Part of this work was done while Y. Rabani was visiting UCLA and Caltech. We thank the referees for their feedback and various helpful suggestions.

Additional details

Views

Downloads

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

More info on how stats are collected....

Resource type: Journal Article
Publisher: Association for Computing Machinery
Published in: Journal of the ACM, 59(6), Art. No. 28, ISSN: 0004-5411.
Imprint: 47th Annual IEEE Symposium on Foundations of Computer Science, Proceedings, Art. No. 28. Los Alamitos, CA. ISBN: 0-7695-2720-5.
Conference: 47th Annual IEEE Symposium on Foundations of Computer Science, Berkeley, CA, October 21-24, 2006