Using corpus statistics to remove redundant words in text categorization
Yiming Yang and
John Wilbur
Journal of the American Society for Information Science, 1996, vol. 47, issue 5, 357-369
Abstract:
This article studies aggressive word removal in text categorization to reduce the noise in free texts and to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain‐independent stoplist. In our tests with three categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique words reduced the vocabulary of documents from 8,002 distinct words to 1,045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases. © 1996 John Wiley & Sons, Inc.
Date: 1996
References: Add references at CitEc
Citations: View citations in EconPapers (2)
Downloads: (external link)
https://doi.org/10.1002/(SICI)1097-4571(199605)47:53.0.CO;2-V
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:47:y:1996:i:5:p:357-369
Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571
Access Statistics for this article
More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().