Publications > Do Not Crawl in the dust: Different URLs with Similar Text

Do Not Crawl in the dust: Different URLs with Similar Text

Publication

Jan 1, 2009

[Work published prior to Yahoo]

Abstract

We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, translates URLs to some canonical form, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering dust; that is, for discovering rules for transforming a given URL to others that are likely to have similar content. DustBuster is able to mine dust effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from this information to increase the effectiveness of crawling, reduce indexing overhead as well as improve the quality of popularity statistics such as PageRank.

Download

Venue:

ACM Transactions on the Web (TWEB) Volume 3 Issue 1

Type:

Journal

Authors:

Ziv Bar Yossef
Idit Keidar
Uri Schonfeld

BibTeX

@inproceedings{ author = {Ziv Bar Yossef and Idit Keidar and Uri Schonfeld}, title = {Do Not Crawl in the dust: Different URLs with Similar Text}, booktitle = {Proceedings of ACM Transactions on the Web (TWEB) Volume 3 Issue 1}, year = {2009} }

- Help
- About our ads

Do Not Crawl in the dust: Different URLs with Similar Text

Publication

Abstract

ACM Transactions on the Web (TWEB) Volume 3 Issue 1

Journal

Ziv Bar Yossef

Idit Keidar

Uri Schonfeld

BibTeX