Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents
Zheng Tracy Ke () and
Jingming Wang
Additional contact information
Zheng Tracy Ke: Department of Statistics, Harvard University, Cambridge, MA 02138, USA
Jingming Wang: Department of Statistics, Harvard University, Cambridge, MA 02138, USA
Mathematics, 2024, vol. 12, issue 11, 1-41
Abstract:
Topic modeling is a widely utilized tool in text analysis. We investigate the optimal rate for estimating a topic model. Specifically, we consider a scenario with n documents, a vocabulary of size p , and document lengths at the order N . When N ≥ c · p , referred to as the long-document case, the optimal rate is established in the literature at p / ( N n ) . However, when N = o ( p ) , referred to as the short-document case, the optimal rate remains unknown. In this paper, we first provide new entry-wise large-deviation bounds for the empirical singular vectors of a topic model. We then apply these bounds to improve the error rate of a spectral algorithm, Topic-SCORE. Finally, by comparing the improved error rate with the minimax lower bound, we conclude that the optimal rate is still p / ( N n ) in the short-document case.
Keywords: decoupling inequality; entry-wise eigenvector analysis; pre-SVD normalization; sine-theta theorem; topic-SCORE; word frequency heterogeneity (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/11/1682/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/11/1682/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:11:p:1682-:d:1403981
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().