Bulk-Synchronous On-Line Crawling on Clusters of Computers
Source:
16th Euromicro International Conference on Parallel, Distributed and Network-based Processing (EuroPDP 2008), IEEE-CS (2008)
Abstract:
This paper describes the design of a software module devised to perform the periodic retrieval of Web documents for a search engine able to accept on-line updates in a concurrent manner. On-line updates comes in the form of insertions of new documents or update of existing ones, all of them mixed with the usual user queries.
The search engine is bulk-synchronous which allows it to deal efficiently with the concurrency control problem. The crawler is also bulk-synchronous so that it can be integrated into the same $P$-processors cluster executing the search engine.
This paper describes and evaluates the practical feasibility of such a crawler.
The distribution of document URLs onto processors is effected by
web-sites where each processor is in charge of retrieving the documents belonging to a sub-set of the total amount of Web-sites. We present an evaluation of the performance of the proposed scheme by using a Web sample of 2.5 millions documents.