Link‐based similarity measures for the classification of Web documents
Pável Calado,
Marco Cristo,
Marcos André Gonçalves,
Edleno S. de Moura,
Berthier Ribeiro‐Neto and
Nivio Ziviani
Journal of the American Society for Information Science and Technology, 2006, vol. 57, issue 2, 208-221
Abstract:
Traditional text‐based document classifiers tend to perform poorly on the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed on a Web directory show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional text‐based classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text‐based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines on how link structure can be used effectively to classify Web documents.
Date: 2006
References: Add references at CitEc
Citations:
Downloads: (external link)
https://doi.org/10.1002/asi.20266
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bla:jamist:v:57:y:2006:i:2:p:208-221
Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1532-2890
Access Statistics for this article
More articles in Journal of the American Society for Information Science and Technology from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().