Learning mixtures of arbitrary distributions over large discrete domains
- Other:
- Naor, Moni
Abstract
We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of k arbitrary distributions over a large discrete domain [n]={1, 2, ...,n} and the mixture weights, using O(n polylog n) samples. (In the topic-model learning setting, the mixture constituents correspond to the topic distributions.) This task is information-theoretically impossible for k > 1 under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the "sampling aperture", is a crucial parameter of the problem. We obtain the first bounds for this mixture-learning problem without imposing any assumptions on the mixture constituents. We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of 2k-1. Thus, we achieve near-optimal dependence on n and optimal aperture. While the sample-size required by our algorithm depends exponentially on k, we prove that such a dependence is unavoidable when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.
Additional Information
© 2014 ACM. Supported in part by NSF CCF-1038578, NSF CCF-0515342, NSA H98230-06-1-0074, and NSF ITR CCR-0326554. Supported in part by NSERC grant 32760-06, an NSERC Discovery Accelerator Supplement Award, and an Ontario Early Researcher Award.Additional details
- Eprint ID
- 72750
- DOI
- 10.1145/2554797.2554818
- Resolver ID
- CaltechAUTHORS:20161212-165501273
- NSF
- CCF-1038578
- NSF
- CCF-0515342
- National Security Agency (NSA)
- H98230-06-1-0074
- NSF
- ITR CCR-0326554
- Natural Sciences and Engineering Research Council of Canada (NSERC)
- 32760-06
- Ontario Early Researcher Award
- Created
-
2016-12-13Created from EPrint's datestamp field
- Updated
-
2021-11-11Created from EPrint's last_modified field