Stochastic variational inference for clustering short text data with finite mixtures of Dirichlet-Multinomial distributions
Massimo Bilancia (),
Andrea Nigri () and
Samuele Magro ()
Additional contact information
Massimo Bilancia: University of Bari Aldo Moro, Polyclinic University Hospital
Andrea Nigri: University of Foggia
Samuele Magro: Grotte di Castellana srl
Statistical Papers, 2025, vol. 66, issue 4, No 7, 39 pages
Abstract:
Abstract Finite mixtures of Multinomial distributions are a valuable tool for analyzing discrete positive data, particularly in the context of text analysis where data is represented as a Bag-of-Words (BOW). In this approach, only term frequency from a predefined vocabulary is considered, disregarding the specific positions of terms within the preprocessed text document. Dirichlet-Multinomial mixture models, in particular, offer a straightforward yet effective method for text categorization. These models often outperform more complex latent variable models in cases where documents are short. The combination of Dirichlet priors and Multinomial likelihoods can be addressed within a Bayesian framework. However, despite the model’s simplicity, the exact posterior distribution is intractable, necessitating the use of numerical methods. Variational inference offers a promising approach by approximating the joint posterior distribution with a probability distribution in which the model parameters are assumed to be independent a posteriori. Under certain conditions, a coordinate ascent variational algorithm can be constructed to yield an approximation that closely matches the true posterior in terms of the reverse Kullback–Leibler divergence. A notable limitation of standard variational algorithms, however, is their requirement to use the entire dataset to compute the iterative equations for estimating the local variational parameters, which poses a significant scalability issue when working with large text corpora. To address this, we employ stochastic variational inference within the exponential family to develop a scalable estimation algorithm. By leveraging straightforward assumptions about the full conditional distributions of the hierarchical model and the distributions of the variational parameters, we demonstrate that, under the Robbins–Monro conditions, a gradient ascent algorithm can be derived. This algorithm converges to a local maximum of the approximated posterior surface. Crucially, instead of utilizing all observations, each iteration relies on a noisy yet unbiased estimate of the gradient calculated from a single randomly selected data point. Numerical simulations demonstrate the superior per-iteration computational efficiency of stochastic variational inference (SVI). While SVI typically requires more iterations for convergence, its efficiency advantage extends beyond computational speed. Albeit preliminary and somewhat speculative, the obtained results suggest that SVI yields higher-quality solutions, as evidenced by both text clustering accuracy and the implicit regularization of weakly identified components.
Keywords: Dirichlet-Multinomial mixture model; Text categorization; Variational inference; Stochastic variational inference; Numerical optimization (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s00362-025-01702-0 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:stpapr:v:66:y:2025:i:4:d:10.1007_s00362-025-01702-0
Ordering information: This journal article can be ordered from
http://www.springer. ... business/journal/362
DOI: 10.1007/s00362-025-01702-0
Access Statistics for this article
Statistical Papers is currently edited by C. Müller, W. Krämer and W.G. Müller
More articles in Statistical Papers from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().