Hadoop: Is There a Metadata Mess Waiting in the Wings? by Robin Bloor.
From the post:
Why is Hadoop so popular? There are many reasons. First of all it is not so much a product as an ecosystem, with many components: MapReduce, HBase, HCatalog, Pig, Hive, Sqoop, Mahout and quite a few more. That makes it versatile, and all these components are open source, so most of them improve with each release cycle.
But, as far as I can tell, the most important feature of Hadoop is its file system: HDFS. This has two notable features: it is a key-value store, and it is built for scale-out use. The IT industry seemed to have forgotten about key-value stores. They used to be called ISAM files and came with every operating system until Unix, then Windows and Linux took over. These operating systems didn’t provide general purpose key-value stores, and nobody seemed to care much because there was a plethora of databases that you could use to store data, and there were even inexpensive open source ones. So, that seemed to take care of the data layer.
But it didn’t. The convenience of a key-value store is that you can put anything you want into it as long as you can think of a suitable index for it, and that is usually a simple choice. With a database you have to create a catalog or schema to identify what’s in every table. And, if you are going to use the data coherently, you have to model the data and determine what tables to hold and what attributes are in each table. This puts a delay into importing data from new sources into the database.
Now you can, if you want, treat a database table as a key-value store and define only the index. But that is regarded as bad practice, and it usually is. Add to this the fact that the legacy databases were never built to scale out and you quickly conclude that Hadoop can do something that a database cannot. It can become a data lake – a vast and very scalable data staging area that will accommodate any data you want, no matter how “unstructured” it is.
I rather like that imagery, unadorned Hadoop as a “data lake.”
But that’s not the only undocumented data in a Hadoop ecosystem.
What about the PIG scripts? The MapReduce routines? Or Mahout, Hive, Hbase, etc., etc.
Do you think all the other members of the Hadoop ecosystem also have undocumented data? And other variables?
When Robin mentions Revelytix as having a solution, I assume he means Loom.
Looking at Loom, ask yourself how well it documents other parts of the Hadoop ecosystem?
Robin has isolated a weakness in the current Hadoop system that will unexpectedly and suddenly make itself known.
Will you be ready?