DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling
Rundong Du (),
Da Kuang (),
Barry Drake () and
Haesun Park ()
Additional contact information
Rundong Du: Georgia Institute of Technology
Da Kuang: University of California
Barry Drake: Georgia Institute of Technology
Haesun Park: Georgia Institute of Technology
Journal of Global Optimization, 2017, vol. 68, issue 4, No 5, 777-798
Abstract:
Abstract The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.
Keywords: Constrained low rank approximation; Nonnegative matrix factorization; Divide and conquer; Clustering; Topic modeling; Text analysis; Scalable algorithms (search for similar items in EconPapers)
Date: 2017
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s10898-017-0515-z Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:jglopt:v:68:y:2017:i:4:d:10.1007_s10898-017-0515-z
Ordering information: This journal article can be ordered from
http://www.springer. ... search/journal/10898
DOI: 10.1007/s10898-017-0515-z
Access Statistics for this article
Journal of Global Optimization is currently edited by Sergiy Butenko
More articles in Journal of Global Optimization from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().