Publication

Bulk-Synchronous On-Line Crawling on Clusters of Computers

Source:

16th Euromicro International Conference on Parallel, Distributed and Network-based Processing (EuroPDP 2008), IEEE-CS (2008)

Abstract:

This paper describes the design of a software module devised to perform the periodic retrieval of Web documents for a search engine able to accept on-line updates in a concurrent manner. On-line updates comes in the form of insertions of new documents or update of existing ones, all of them mixed with the usual user queries. The search engine is bulk-synchronous which allows it to deal efficiently with the concurrency control problem. The crawler is also bulk-synchronous so that it can be integrated into the same $P$-processors cluster executing the search engine. This paper describes and evaluates the practical feasibility of such a crawler. The distribution of document URLs onto processors is effected by web-sites where each processor is in charge of retrieving the documents belonging to a sub-set of the total amount of Web-sites. We present an evaluation of the performance of the proposed scheme by using a Web sample of 2.5 millions documents.

Download:

© 2009 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.