From the post:
Zipkin is a distributed tracing system that we created to help us gather timing data for all the disparate services involved in managing a request to the Twitter API. As an analogy, think of it as a performance profiler, like Firebug, but tailored for a website backend instead of a browser. In short, it makes Twitter faster. Today we’re open sourcing Zipkin under the APLv2 license to share a useful piece of our infrastructure with the open source community and gather feedback.
Hmmm, tracing based on the Dapper paper that comes with a web-based UI for a number of requests. Hard to beat that!
Thinking more about the sampling issue, what if I were to sample a very large stream of proxies and decided to only merge a certain percentage and pipe the rest to /dev/null?
For example, I have an UPI feed and that is my base set of “news” proxies. I have feeds from the various newspaper, radio and TV outlets around the United States. If the proxies from the non-UPI feeds are without some distance of the UPI feed proxies, they are simply discarded.
True, I am losing the information of which newspapers carried the stories, whose bylines consisted of changing the order of the words or dumbing them down, but those may not fall under my requirements.
I would rather than a few dozen very good sources than say 70,000 sources that say the same thing.
If you were testing for news coverage or the spread of news stories, your requirements might be different.
I first saw this at Alex Popescu’s myNoSQL.