Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published May 2018 | Published
Journal Article Open

Training Gaussian Mixture Models at Scale via Coresets

Abstract

How can we train a statistical mixture model on a massive data set? In this work we show how to construct \emph{coresets} for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being \emph{independent} of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real- world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

Additional Information

© 2018 Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Submitted 9/15; Revised 1/18; Published 5/18. We thank Olivier Bachem for invaluable discussions, suggestions and comments. This research was partially supported by ONR grant N00014-09-1-1044, NSF grants CNS-0932392, IIS-0953413, DARPA MSEE grant FA8650-11-1-7156, and the Zurich Information Security Center.

Attached Files

Published - 15-506.pdf

Files

15-506.pdf
Files (612.7 kB)
Name Size Download all
md5:a1bc52b5d80044ab07a37007c6a173e2
612.7 kB Preview Download

Additional details

Created:
August 19, 2023
Modified:
October 18, 2023