Data distillation with Hadoop and R by David Smith.
From the post:
We’re definitely in the age of Big Data: today, there are many more sources of data readily available to us to analyze than there were even a couple of years ago. But what about extracting useful information from novel data streams that are often noisy and minutely transactional … aye, there’s the rub.
One of the great things about Hadoop is that it offers a reliable, inexpensive and relatively simple framework for capturing and storing data streams that just a few years ago we would have let slip though our grasp. It doesn’t matter what format the data comes in: without having to worry about schemas or tables, you can just dump unformatted text (chat logs, tweets, email), device “exhaust” (binary, text or XML packets), flat data files, network traffic packets … all can be stored in HDFS pretty easily. The tricky bit is making sense of all this unstructured data: the downside to not having a schema is that you can’t simply make an SQL-style query to extract a ready-to-analyze table. That’s where Map-Reduce comes in.
Think of unstructured data in Hadoop as being a bit like crude oil: it’s a valuable raw material, but before you can extract useful gasoline from Brent Sweet Light Crude or Dubai Sour Crude you have to put it through a distillation process in a refinery to remove impurities, and extract the useful hydrocarbons.
I may find this a useful metaphor because I grew up in Louisiana where land based oil wells were abundant and there was an oil reflinery only a couple of miles from my home.
Not a metaphor that will work for everyone but one you should keep in mind.