EconPapers    
Economics at your fingertips  
 

ARCOMEM Crawling Architecture

Vassilis Plachouras, Florent Carpentier, Muhammad Faheem, Julien Masanès, Thomas Risse, Pierre Senellart, Patrick Siehndel and Yannis Stavrakas
Additional contact information
Vassilis Plachouras: Institute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, Greece
Florent Carpentier: Internet Memory Foundation, 45 ter rue de la Révolution, 93100 Montreuil, France
Muhammad Faheem: CNRS LTCI, Institut Mines-Télécom, Télécom ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, France
Julien Masanès: Internet Memory Foundation, 45 ter rue de la Révolution, 93100 Montreuil, France
Thomas Risse: Research Center, University of Hannover, Appelstr. 9a, 30167 Hannover, Germany
Pierre Senellart: CNRS LTCI, Institut Mines-Télécom, Télécom ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, France
Patrick Siehndel: Research Center, University of Hannover, Appelstr. 9a, 30167 Hannover, Germany
Yannis Stavrakas: Institute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, Greece

Future Internet, 2014, vol. 6, issue 3, 1-24

Abstract: The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.

Keywords: web archiving; crawling architecture; content acquisition (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2014
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/6/3/518/pdf (application/pdf)
https://www.mdpi.com/1999-5903/6/3/518/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:6:y:2014:i:3:p:518-541:d:39354

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-03-19
Handle: RePEc:gam:jftint:v:6:y:2014:i:3:p:518-541:d:39354