Key Scientific Challenges, Entry #5: Web Information Management
On January 27 we announced
the kick-off of our 2010 Key Scientific Challenges Program
. To highlight the scientific challenge areas included in the program, we launched a series of guest blog posts earlier this month on Yodel Anecdotal
. Read our previous post on online advertising, “The Art and Science of Advertising
Another big challenge our Yahoo’s research scientists are continually examining is Web Information Management. In this entry, Brian Cooper from Yahoo Labs shares some thoughts on how Yahoo is driving research into information management and why it’s a fascinating field.
A Torrent of Data
Sites like Facebook and Twitter demonstrate that users really like to know what's going on with their friends. At the same time, the popularity of blog readers and personalized RSS aggregators like MyYahoo
and iGoogle show that users also like up-to-date news. Although social networking and content aggregation seem like different applications, at the core they share a key mechanism: collecting the most recent content from a set of producers, and distributing it to a set of consumers. In the case of social networking, producers write status updates or post links, and consumers (their friends) get a list of the updates. In a news aggregator, the producers are news sites or blogs, and the consumers are the readers that want the content.
The core mechanism that takes content from producers and redistributes it to users is actually quite hard to get right, and is an example of the really tough problems that fall under the Key Scientific Challenge area on Web Information Management
(and its sub-topic Data Management
The main problem is the multiplicative explosion of updates. Consider a popular user of a site like Twitter. Ashton Kutcher, for example, has over 4.5 million followers as of February 18, 2010
. Every time Ashton tweets, his words of wisdom have to be propagated to the feeds of all those millions of followers. Even if the average user has far less followers, their tweets may have to be propagated to hundreds of users or more. Suddenly, even a short 140-character message can consume a lot of server resources. It’s for this reason that Twitter notoriously found it difficult to scale and stay up at the same time. Some of the challenges in doing so have been described by Twitter engineers
At Yahoo, we’ve been aggregating and distributing content for a long time, and we’re increasingly distributing social updates too. Products such as Yahoo Mail and Messenger show you recent updates and activities by your friends on Yahoo. Plus, services like Yahoo Updates
provide tools for developers to feed a variety of different user activities and updates that happen on their sites into Yahoo’s products (and vice versa). As we build our infrastructure to support both content and user updates, we run into many of the same scaling problems that other companies have been reporting. To address these challenges, we took a step back and looked at the whole architecture needed to support these "user event feeds".
The Push or Pull Dilemma
I'd like to talk about one aspect in particular, as it directly addresses the intersection of building Internet-scale databases and dealing with social data. When distributing updates from producers to consumers, you have a fundamental choice: should you “push” updates to consumers, so that when a user logs in their feed is already constructed and ready to go? Or should you wait until a consumer logs in, and then “pull” updates from producers to construct the feed on the fly?
Pull ensures that we construct feeds only for users that actually log in, but it has a downside: the time a user spends waiting for their feed can be quite long because the system needs to collect and sort all of the updates from all of the people or news sources that the user follows. Many sites face this tradeoff between doing potentially unneeded work to always be ready on the one hand, and the strain of delivering on demand performance on the other. Digg recently reported
moving from a pull to a push model for one part of their site, resulting in a huge increase in the amount of "pushed" data that had to be stored, but a significant decrease in response time for users.
Threading the Needle
At Yahoo Labs, when we thought about this problem and tried to figure out whether push or pull was best, we realized something that should have been obvious all along: neither push nor pull is best.
Consider a producer named Alice that posts a new status update once an hour. We need to include her updates in the feeds we produce for her friends. Thus, when her friend Bob logs in, he should see some of Alice's recent events. But we only really need to show Bob some of his friends’ events, because there is only room on the screen to show him 10 at a time. Now imagine that Bob only logs in once a week. In that week, Alice may have produced 80 or more updates. If we pushed all 80 to Bob's feed, at least 70 will have fallen off the front page (replaced by newer Alice events or newer events from other friends) by the time Bob logs in, and all the resources (network bandwidth, disk I/O, CPU time) spent on pushing those 70 updates would be wasted. Multiply that waste by millions of users, and you can see where this is heading.
Alternatively, consider Carl, another one of Alice's friends. Carl loves to check his feed and updates it every five minutes (he’s an over-sharer). In this scenario, a push model makes more sense: rather than querying the database every five minutes for Alice’s updates, we should push Alice's events to a pre-computed page for Carl, and just serve him that same page over and over until a new Alice event arrives and updates the page. That way, Carl's experience is great for him AND no servers meltdown!
The key insight is that we should do push for some producer/consumer pairs and pull for others. Deciding whether to do push depends on the relative frequency of the producer's events
compared to the consumer's logins, as well as the relative cost of push versus pull. We can minimize resource usage by only pushing for producer/consumer pairs where the consumer logs in frequently enough that the cost of pushing, plus serving the pre-computed page, is less than the cost of retrieving the event repeatedly from the producer's queue.
It’s a challenging model to develop and put into place. It can be complicated by an almost endless array of variables, and the challenge of implementing this approach in the midst of hugely popular events, like Barack Obama's inauguration, or Michael Jackson's death, which caused huge surges in status updates, makes it even more difficult. It’s also what makes it utterly fascinating.
More details about this challenge, the techniques, arguments and the real-world experiments we’re developing at Yahoo Labs will be available in our upcoming SIGMOD paper, "Feeding Frenzy: Selectively Materializing Users' Event Feeds" written by Adam Silberstein, Brian Cooper and Raghu Ramakrishnan, as well as Jeffrey Terrace from Princeton, who was an intern here at Yahoo Labs last summer. You’ll be able to find the paper on the Yahoo Labs site later this year and take a look.