EconPapers    
Economics at your fingertips  
 

Retrieval testing with hypergeometric document models

W. John Wilbur

Journal of the American Society for Information Science, 1993, vol. 44, issue 6, 340-351

Abstract: If one could identify the source subject areas of documents and could compute the probability that any given document came from a given source, one could apply Baye's theorem to compute the probability that a query document and any other document came from the same subject area (i.e., were related). Even correct prior probabilities could be assigned under this hypothesis by examining the whole database to obtain the probabilities with which different sources occur. While we do not know how to carry out this scheme in such a way as to account for all the information contained in documents, we show here how it may be realized in a limited way. A method of modeling the sources of documents is described which accounts for the information in global term weights. The methodology is based on the hypergeometric probability distribution. Such a source model may be fit closely to a real database and may be used to convert the real database to an abstract database in which document sources are known and model retrieval is the best retrieval possible based on model document content. We have constructed such an abstract model corresponding to a database of MEDLINE records. Tests of vector retrieval methods on the abstract model indicate they are near optimal but suggest minor improvement with correct parameter choices. Preliminary results based on a test set (human judged) from the real database support these results. © 1993 John Wiley & Sons, Inc.

Date: 1993
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/(SICI)1097-4571(199307)44:63.0.CO;2-Z

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:44:y:1993:i:6:p:340-351

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571

Access Statistics for this article

More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamest:v:44:y:1993:i:6:p:340-351