Mining search engine query logs via suggestion sampling

Publication
Jan 1, 2008
Abstract

Abstract:

Many search engines and other web applications suggestauto-completions as the user types in a query. The suggestions aregenerated from hidden underlying databases, such as query logs,directories, and lexicons. These databases consist of interesting anduseful information, but they are typically not directly accessible.

In this paper we describe two algorithms for sampling suggestionsusing only the public suggestion interface. One of the algorithmssamples suggestions uniformly at random and the other samplessuggestions proportionally to their popularity. These algorithms canbe used to mine the hidden suggestion databases. Example applicationsinclude comparison of popularity of given keywords within a searchengine's query log, estimation of the volume of commercially-orientedqueries in a query log, and evaluation of the extent to which a searchengine exposes its users to negative content.

Our algorithms employ Monte Carlo methods in order to obtain unbiasedsamples from the suggestion database. Empirical analysis using apublicly available query log demonstrates that our algorithms areefficient and accurate. Results of experiments on two major suggestionservices are also provided.


  • VLDB, Auckland, NZ

BibTeX