A probabilistic model for Latent Semantic Indexing

Ding, Chris H.Q.

A probabilistic model for Latent Semantic Indexing

Chris H.Q. Ding

Journal of the American Society for Information Science and Technology, 2005, vol. 56, issue 6, 597-608

Abstract: Latent Semantic Indexing (LSI), when applied to semantic space built on text collections, improves information retrieval, information filtering, and word sense disambiguation. A new dual probability model based on the similarity concepts is introduced to provide deeper understanding of LSI. Semantic associations can be quantitatively characterized by their statistical significance, the likelihood. Semantic dimensions containing redundant and noisy information can be separated out and should be ignored because their negative contribution to the overall statistical significance. LSI is the optimal solution of the model. The peak in the likelihood curve indicates the existence of an intrinsic semantic dimension. The importance of LSI dimensions follows the Zipf‐distribution, indicating that LSI dimensions represent latent concepts. Document frequency of words follows the Zipf distribution, and the number of distinct words follows log‐normal distribution. Experiments on five standard document collections confirm and illustrate the analysis.

Date: 2005
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/asi.20148

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamist:v:56:y:2005:i:6:p:597-608

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1532-2890

Access Statistics for this article

More articles in Journal of the American Society for Information Science and Technology from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().