Ad Indexing & Retrieval

Jun 16, 2009

The sponsored search advertiser eco-system is very dynamic, with advertisers continually modifying their advertising campaigns. Indexing of the currently active ads needs to be very responsive to advertiser changes, yet allow fast retrieval of the most relevant ads for a given search query. Solving the problem requires crossing many areas including indexing, feature generation, NLP and relevance modeling. The ad indexing and retrieval (AIR) project aims at developing new algorithms that can generate a first pass set of high relevancy candidate ads given the user query and other user context. In order to satisfy performance constraints, during first pass retrieval of ads we use the efficient WAND algorithm developed at Yahoo Research. In order to improve the relevance of the candidate set of ads, the project is actively working on four important components: developing a scoring framework that uses query-dependent term importance weighting, using query expansions to improve advance match coverage on tail queries, annotating the query and ad with important semantic concepts such as geographic locations (geo) and named entities, and finally, generating features for a second pass relevance model to filter the candidate set of low relevance ads. The first pass of retrieval typically requires scoring frameworks that can separate query-dependence from the document weights for efficient indexing and retrieval. In this context, we are actively looking at improving traditional vector space and language modeling scoring techniques. However, in most standard formulations of vector similarity, query likelihood via language models or the probabilistic ranking approach, query term weighting is typically independent of the query context. We are exploring an alternative scoring framework based on the probabilistic ranking principle, which can effectively use query-dependent term importance weighting while providing a statistical framework for modeling several practical aspects of retrieval such as stop words, coordination level ranking and the concept of required words in the query where query terms deemed important will be enforced automatically in the document. Documents in sponsored search are typically more concise than the standard web documents, especially if we rely on the advertiser-provided creative and keyword text. Given a short document, retrieval using a relatively short user query or a tail query can result in an empty candidate set unless we can effectively expand the query using query rewrites, or translations or even search documents retrieved by the Yahoo search engine. We are investigating approaches to expand the query to reinforce rather than dilute the original query intent. Increasing ad coverage on tail queries can be especially problematic on geo queries and queries with named entities. We would like to identify the entities in the original query using Yahoo Search query annotation tools, and given the entities, generate new features and scoring algorithms that can customize retrieval for query slices with typical entities such as geo, named entities, lyrics and product names. Finally, while much of our focus on the first retrieval pass is to ensure that recall is high while maintaining high relevance or quality of the candidate set, it is also important to generate features that can be effectively used in the second pass relevance scoring framework that does not require features to reside in a reverse index of ads.