A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature
Stephen P. Harter
Journal of the American Society for Information Science, 1975, vol. 26, issue 4, 197-206
Abstract:
The problem studied in this research is that of developing a set of formal statistical rules for the purpose of identifying the keywords of a document‐words likely to be useful as index terms for that document. The research was prompted by the observation, made by a number of writers, that non‐specialty words, words which possess little value for indexing purposes, tend to be distributed at random in a collection of documents. In contrast, specialty words are not so distributed. In Part I of the study, a mixture of two Poisson distributions is examined in detail as a model of specialty word distribution, and formulas expressing the three parameters of the model in terms of empirical frequency statistics are derived. The fit of the model is tested on an experimental document collection and found to be acceptable for the purposes of the study. A measure intended to identify specialty words, consistent with the 2‐Poisson model, is proposed and evaluated.
Date: 1975
References: Add references at CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
https://doi.org/10.1002/asi.4630260402
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:26:y:1975:i:4:p:197-206
Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571
Access Statistics for this article
More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().