For contextual advertising, one of the goals is to place relevant ads on Web pages. The reason is obvious. By doing so, we expect more clicks and better user experience.
In order to achieve this goal, we first need to understand what a Web page is about. A typical page consists of multiple sections, including title, body, navigation, etc. For example, in a Yahoo Answers page, we usually have the question summary in the title, a short description and all the answers in the body, links to categories that this question belongs to and inks to international Yahoo Answers, where the anchor texts are the country names.
With a diverse page like this, we need to confine the search for relevant ads to match the main content of the page. In the above example, it is probably not a good idea to display a travel ad for a particular country just because it matches the country name in the international links. But not all links should be ignored - for instance, the category links provide useful information to describe the page content. Therefore, simple heuristics may not work well. One interesting research problem is how to leverage various visual features and semantic features to reliably determine the main content of the page.
After we extract the main content of a page, the next question is whether we should treat all the sections of the main content in the same way, or weight some sections more heavily. It is generally believed that the title, headings, emphasized and strong text on a page carry more important information than plain text in the body. Similarly, an ad can also be divided into multiple sections including title, short description and display URL etc. In the past, relative importance weights for these sections were decided based on some heuristics and manual tuning. In an effort to address this issue more systematically, we have developed a machine learning framework to automatically learn the optimal weights from the training data to maximize relevance or other utility functions.
However, retrieval based on pure semantic similarity between pages and ads is sometimes not satisfactory because user response to an ad is not perfectly correlated with semantic similarity. There are many click biases due to position and size of ads unit on the page, and other ads in the same ads unit. It is helpful to consider click feedback in addition to the relevance measure based on semantic similarity. We have developed a number of ranking algorithms to maximize relevance and/or click through rate using both editorial judged data and click data.