A Spectral Algorithm for Latent Dirichlet Allocation
Abstract
Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space.
Additional Information
Contributions to this work by NIST, an agency of the US government, are not subject to copyright laws. We thank Kamalika Chaudhuri, Adam Kalai, Percy Liang, Chris Meek, David Sontag, and Tong Zhang for many invaluable insights. We also give warm thanks to Rong Ge for sharing preliminary results (in [23]) and early insights into this problem with us. Part of this work was completed while all authors were at Microsoft Research New England. AA is supported in part by the NSF Award CCF-1219234, AFOSR Award FA9550-10-1-0310 and the ARO Award W911NF-12-1-0404.Additional details
- Eprint ID
- 118593
- Resolver ID
- CaltechAUTHORS:20221222-213700256
- Microsoft Research
- NSF
- CCF-1219234
- Air Force Office of Scientific Research (AFOSR)
- FA9550-10-1-0310
- Army Research Office (ARO)
- W911NF-12-1-0404
- Created
-
2022-12-22Created from EPrint's datestamp field
- Updated
-
2022-12-22Created from EPrint's last_modified field