Internet use is a twenty-four-hour-a-day, worldwide activity. This puts incredible demands on the machines that support the network and its applications. When a system is down or not functioning correctly, users get frustrated and revenue can be lost. Making the system reliability challenge even more daunting is the fact that almost every request involves applications that run on multiple machines, frequently residing in multiple data centers. UC Berkeley Computer Science Professor Ion Stoica is trying to find ways to help us cope with the challenge of getting these distributed systems to perform continuously.
Stoica, who also founded startup Conviva, recently visited Yahoo as part of the Yahoo Labs Big Thinkers series. He gave a talk called “Continuous Profiling and Debugging in Distributed Systems.”
Stoica first discussed the explosion of the Internet Cloud and how users depend more and more on services such as email, instant messenger and even CRM systems for enterprise users. These applications need to perform reliably or suffer potentially damaging consequences when there are outages and failures. “Despite all the efforts to improve the availability and the performance, high-profile outages still happen, and they have a big impact,” said Stoica.
Stoica gave several examples of outages and failures that have happened to some of the Internet giants and the impact that was felt by each of the companies. For example, in June 1999, eBay suffered a corrupt database that took 50+ engineers a full day to resolve, resulting in a 9.2% decline in their stock price for the day. “Outages are unpredictable and hard to fix,” said Stoica, who went on to review several other examples of outages in large, complex systems, such as the 2007 bridge collapse in Minneapolis, the 2003 power outages in the Northeast, and various airplane crashes.
What do people do? Stoica stressed the lack of a simple solution. The first step is to avoid the failures, but that there are usually heavy and expensive processes involved. Other ways of avoiding failures include sophisticated modeling, careful design and review and implementation of standards for safety. Another step is to understand failures, which can include extensive testing, investigation and recording (such as an airplane’s “black box” – “which is actually orange,” notes Stoica).
An airplane’s black box records the cockpit voice and radio communication, and various flight parameters. Stoica compares this black box function to debugging. But a black box can also measure material degradation and engine performance. He refers to this black box function as profiling.
In the case of airplanes, it’s possible to build and test a small number, but according to Stoica, large distributed systems, such as thousands of servers that are hosted at various data centers, are much more complex. And these systems evolve continuously in ways that airplanes do not. “How do you test something you’re going to deploy at your biggest data center?” asked Stoica. There is also the issue of 24x7 availability – systems can’t really be taken offline for testing or debugging.
What can be done is to continuously log all system events, hope to catch rare and unpredictable bugs, and use the logs to reproduce bugs. “Ideally, log everything,” said Stoica. “Get around the ‘you don’t know what you don’t know’ syndrome.” Stoica introduced X-Trace and Output Deterministic Replaying as systems exemplifying his point.
X-Trace and Output Deterministic Replaying are universal frameworks for path-based tracing. They annotate network requests with metadata that can be used to reconstruct requests and record the path that requests take through a network. By tracing the path of requests, the circumstances by which bugs are developed can be identified.
Stoica concluded his talk by discussing the huge research opportunity in ensuring the performance of distributed services and applications. Continuous logging generates a huge amount of data and processing the logs can be daunting. Retrieving the desired logs is usually more expensive than storing them – and deciding what logs to retrieve poses yet another problem. There is also a danger in assuming that all logs are complete, which isn’t always the case given administrator apathy or network failures.
Stoica stressed the need for robust algorithms to extract as much information from these logs.
Lastly, many of the logging and tracing systems, such as X-Trace, require instrumentation of the application, which should be as transparent as possible. “For the future, there should be focus on developing new applications and protocols with tracing in mind,” said Stoica.