Publication

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection

Source:

Proceedings of the Workshop on Web Mining and Web Usage Analysis ({WebKDD}), ACM Press, Pennsylvania, USA (2006)

URL:

http://www.dcc.uchile.cl/~ccastill/papers/becchetti_06_automatic_link_spam_detection_rank_propagation.pdf

Keywords:

adversarial-ir

Abstract:

This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and difficult to solve, mostly due to the large size of web collections that makes many algorithms unfeasible in practice. For spam detection we apply only link-based methods, that is, we only study the topology of the Web graph without looking at the contents of the pages. We compute Web page attributes applying rank propagation and probabilistic counting over the Web graph. These attributes are used to build a classifier that is tested over a large collection of Web link spam. After ten-fold cross-validation, our best classifier can detect about 80% of the spam pages with a rate of false positives of 2%. This is competitive with state-of-the-art spam classifiers that use content attributes, and is the first automatic classifier that achieves this precision using only link data.