EconPapers    
Economics at your fingertips  
 

A comparison of techniques to find mirrored hosts on the WWW

Krishna Bharat, Andrei Broder, Jeffrey Dean and Monika R. Henzinger

Journal of the American Society for Information Science, 2000, vol. 51, issue 12, 1114-1122

Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web‐based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top‐down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.

Date: 2000
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/1097-4571(2000)9999:99993.0.CO;2-0

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:51:y:2000:i:12:p:1114-1122

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571

Access Statistics for this article

More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamest:v:51:y:2000:i:12:p:1114-1122