Set Cover at Web Scale

Aug 11, 2015

The classic Set Cover problem requires selecting a minimum size subset $\mathcal{A} \subseteq \mathcal{F}$ from a family of finite subsets $\mathcal{F}$ of $\mathcal{U}$ such that the elements covered by $\mathcal{A}$ are the ones covered by $\mathcal{F}$. It naturally occurs in many settings in web search, web mining and web advertising. The greedy algorithm that iteratively selects a set in $\mathcal{F}$ that covers the most uncovered elements, yields an optimum $(1+\ln |\mathcal{U}|)$-approximation but is inherently sequential. In this work we give the first MapReduce Set Cover algorithm that scales to problem sizes of ~1 trillion elements and runs in $\log_p\Delta$ iterations for a nearly optimum approximation ratio of $p\ln\Delta$, where $\Delta$ is the cardinality of the largest set in $\mathcal{F}$.

A web crawler is a system for bulk downloading of web pages. Given a set of seed URLs, the crawler downloads and extracts the hyperlinks embedded in them and schedules the crawling of the pages addressed by those hyperlinks for a subsequent iteration. While the average page out-degree is ~50, the crawled corpus  grows at a much smaller rate, implying a significant outlink overlap. Using our MapReduce Set Cover heuristic as a building block, we present the first large-scale seed generation algorithm that scales to ~20 billion nodes and discovers new pages at a rate ~4x faster than that obtained by prior art heuristics.

  • ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015)
  • Conference/Workshop Paper