EconPapers    
Economics at your fingertips  
 

Using corpus statistics to remove redundant words in text categorization

Yiming Yang and John Wilbur

Journal of the American Society for Information Science, 1996, vol. 47, issue 5, 357-369

Abstract: This article studies aggressive word removal in text categorization to reduce the noise in free texts and to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain‐independent stoplist. In our tests with three categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique words reduced the vocabulary of documents from 8,002 distinct words to 1,045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases. © 1996 John Wiley & Sons, Inc.

Date: 1996
References: Add references at CitEc
Citations: View citations in EconPapers (2)

Downloads: (external link)
https://doi.org/10.1002/(SICI)1097-4571(199605)47:53.0.CO;2-V

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:47:y:1996:i:5:p:357-369

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571

Access Statistics for this article

More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamest:v:47:y:1996:i:5:p:357-369