Hierarchical Topic Segmentation of Websites
Source:
12th International Conference on Knowledge Discovery and Data Mining (KDD), Philadelphia, USA (2006)
Abstract:
In this paper, we consider the problem of identifying and
segmenting topically cohesive regions in the URL tree of a large
website. Each page of the website is assumed to have a topic label or
a distribution on topic labels generated using a standard classifier.
We develop a set of cost measures characterizing the benefit accrued by
introducing a segmentation of the site based on the topic labels. We
propose a general framework to use these measures for describing the
quality of a segmentation; we also provide an efficient algorithm to find the
best segmentation in this framework. Extensive experiments on human-labeled
data confirm the soundness of our framework and suggest that a
judicious choice of cost measures allows the algorithm to perform
surprisingly accurate topical segmentations.