Archive for the ‘Tracing’ Category

Distributed Systems Tracing with Zipkin [Sampling @ Twitter w/ UI]

Saturday, June 9th, 2012

Distributed Systems Tracing with Zipkin

From the post:

Zipkin is a distributed tracing system that we created to help us gather timing data for all the disparate services involved in managing a request to the Twitter API. As an analogy, think of it as a performance profiler, like Firebug, but tailored for a website backend instead of a browser. In short, it makes Twitter faster. Today we’re open sourcing Zipkin under the APLv2 license to share a useful piece of our infrastructure with the open source community and gather feedback.

Hmmm, tracing based on the Dapper paper that comes with a web-based UI for a number of requests. Hard to beat that!

Thinking more about the sampling issue, what if I were to sample a very large stream of proxies and decided to only merge a certain percentage and pipe the rest to /dev/null?

For example, I have an UPI feed and that is my base set of “news” proxies. I have feeds from the various newspaper, radio and TV outlets around the United States. If the proxies from the non-UPI feeds are without some distance of the UPI feed proxies, they are simply discarded.

True, I am losing the information of which newspapers carried the stories, whose bylines consisted of changing the order of the words or dumbing them down, but those may not fall under my requirements.

I would rather than a few dozen very good sources than say 70,000 sources that say the same thing.

If you were testing for news coverage or the spread of news stories, your requirements might be different.

I first saw this at Alex Popescu’s myNoSQL.

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure [Data Sampling Lessons For “Big Data”]

Saturday, June 9th, 2012

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure by Benjamin H. Sigelman, Luiz Andr´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.


Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.

Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.

The main goal of this paper is to report on our experience building, deploying and using the system for over two years, since Dapper’s foremost measure of success has been its usefulness to developer and operations teams. Dapper began as a self-contained tracing tool but evolved into a monitoring platform which has enabled the creation of many different tools, some of which were not anticipated by its designers. We describe a few of the analysis tools that have been built using Dapper, share statistics about its usage within Google, present some example use cases, and discuss lessons learned so far.

A very important paper for anyone working with large and complex systems.

With lessons on data sampling as well:

we have found that a sample of just one out of thousands of requests provides sufficient information for many common uses of the tracing data.

You have to wonder in “data in the petabyte range” cases, how many of them could be reduced to gigabyte (or smaller) size with no loss in accuracy?

Which would reduce storage requirements, increase analysis speed, increase the complexity of analysis, etc.

Have you sampled your “big data” recently?

I first saw this at Alex Popescu’s myNoSQL.