Project

Squeeze Every Drop of Meaning from Data

What are the most appropriate advertisements to maximize click-through rates on a particular web page? What are the most relevant search results for a particular query?

These questions may seem simple enough, but coming up with the perfect answer is a problem of staggering proportions. Why? Because the data is often high dimensional, too sparse or too noisy to make intelligent decisions.

For instance, there are billions of interactions that go on between web pages and advertisements, but the vast majority of interactions happen so infrequently. This makes it very difficult to learn from them.

An advertisement for a mom and pop pizza joint, for example, may appear on a particular page only once or twice. And what if no one clicked on that ad? Does this really mean the click-through rate is zero? Or is it possible to learn more about these types of interactions even when the data is so limited?

Yahoo! Research scientists Deepak Agarwal, Srujana Merugu and Deepayan Chakrabarti have dedicated themselves to solving the puzzle. "The goal is to do a better job with all these mom and pop pizza stores by learning more about their common characteristics and global behavior," Agarwal says. "For us, this means figuring out how to aggregate data in a more intelligent way."

It also means developing statistical learning models that do a good job of understanding extremely rare events—events that on their own may not matter a whole lot, but in a larger context are very important.

The authors have created a lightweight tool that could allow different product groups within Yahoo! to squeeze the most insight from their data—even data that was previously seen as too sparse or noisy.

The Sponsored Search group, for instance, may initially use machine learning techniques and regression models to predict click-through rates for a particular ad on a particular page. It will factor in such contextual data as what time of day an ad was clicked on, which state or country the user was from, and what words were used in the title of the ad. In fact, the machine learning model will try to extract as many features as it can from the interaction between web pages and ads to predict future actions.

"But after the most obvious features are extracted, the question becomes: Is there still some structure left in the data or is it more like salt and pepper?" asks Agarwal.

The tool developed allows groups to quickly and easily see if there is indeed additional structure in their data. The tool works by clustering data in new and sensible groupings, thereby teasing out additional intelligence from residual data—without having to manually plug thousands of extra features into the algorithm.

Ultimately, the tool can save a tremendous amount of time and effort while dramatically improving predictive models. "We can take output from any baseline model and modify them by finding structure in the residuals," concludes Agarwal. "You can now get much more out of your data by discovering these clusters in an unsupervised fashion."