EconPapers    
Economics at your fingertips  
 

Error evaluation for stemming algorithms as clustering algorithms

Julie B. Lovins

Journal of the American Society for Information Science, 1971, vol. 22, issue 1, 28-40

Abstract: This paper presents mathematical evaluation measures to characterize the effect of known erroneous performance by stemming routines, and generalizes these procedures to other types of nonstatistical clustering algorithms. When clusters, or groups of intrinsically related elements, are split into smaller groups (by under‐matching the elements), there is a loss in recall in information retrieval; larger groups (caused by over‐matching) induce a loss in precision or relevance. The magnitude of error is taken to be a function of frequencies of cluster elements. When these are words in a subject‐term index generated by a stemming algorithm, retrieval capability is also affected by the strength of the algorithm, the size and content of the stemmed index, and the number of words in a query. The present Project Intrex stemming algorithm has estimated stemming‐error losses of 4% in recall and 1% in relevance on one‐word queries; the former could be reduced to almost zero by straightforward corrections of known errors in the algorithm. An expanded probabilistic model is introduced to handle a more general case in which any element need not belong unambiguously to a single cluster. Error evaluation in document classification and thesauri also is discussed in broad terms.

Date: 1971
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/asi.4630220105

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:22:y:1971:i:1:p:28-40

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571

Access Statistics for this article

More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamest:v:22:y:1971:i:1:p:28-40