Archive for the ‘Hydra’ Category

Hadoop Alternative Hydra Re-Spawns as Open Source

Sunday, March 16th, 2014

Hadoop Alternative Hydra Re-Spawns as Open Source by Alex Woodie.

From the post:

It may not have the name recognition or momentum of Hadoop. But Hydra, the distributed task processing system first developed six years ago by the social bookmarking service maker AddThis, is now available under an open source Apache license, just like Hadoop. And according to Hydra’s creator, the multi-headed platform is very good at some big data tasks that the yellow pachyderm struggles with–namely real-time processing of very big data sets.

Hydra is a big data storage and processing platform developed by Matt Abrams and his colleagues at AddThis (formerly Clearspring), the company that develops the Web server widgets that allow visitors to easily share something via their Twitter, Facebook, Pintrest, Google+, or Instagram accounts.

When AddThis started scaling up its business in the mid-2000s, it got flooded with data about what users were sharing. The company needed a scalable, distributed system that could deliver real-time analysis of that data to its customers. Hadoop wasn’t a feasible option at that time. So it built Hydra instead.

So, what is Hydra? In short, it’s a distributed task processing system that supports streaming and batch operations. It utilizes a tree-based data structure to store and process data across clusters with thousands of individual nodes. It features a Linux-based file system, which makes it compatible with ext3, ext4, or even ZFS. It also features a job/cluster management component that automatically allocates new jobs to the cluster and rebalance existing jobs. The system automatically replicates data and handles node failures automatically.

The tree-based structure allows it to handle streaming and batch jobs at the same time. In his January 23 blog post announcing that Hydra is now open source, Chris Burroughs, a member of AddThis’ engineering department, provided this useful description of Hydra: “It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).”

To learn a lot more about Hydra, see its GitHub page.

Another candidate for “real-time processing of very big data sets.”

The reflex is to applaud improvements in processing speed. But what sort of problems require that kind of speed? I know the usual suspects, modeling the weather, nuclear explosions, chemical reactions, but at some point, the processing ends and a human reader has to comprehend the results.

Better to get information to the human reader sooner rather than later, but there is a limit to the speed at which a user can understand the results of a computational process.

From a UI perspective, what research is there on how fast/slow information should be pushed at a user?

Could make the difference between an app that is just annoying and one that is truly useful.

I first saw this in a tweet by Joe Crobak.