Page-Level Template Detection via Isotonic Smoothing
Source:
16th International World Wide Web Conference (2007)
Abstract:
We develop a novel framework for the page-level template detection
problem. Our framework is built on two main ideas. The first is the
automatic generation of training data for a classifier that, given a
page, assigns a templateness score to every DOM node of the page. The
second is the global smoothing of these per-node classifier scores by
solving a regularized isotonic regression problem; the latter follows
from a simple yet powerful abstraction of templateness on a page. Our
extensive experiments on human-labeled test data show that our approach
detects templates effectively.