Hadoop and Metadata (Removing the Impedance Mis-match) by Alan Gates, Russell Jurney.
From the post:
Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?
As a Hadoop user, the need for a metadata directory is clear. Users don’t want to ‘reinvent the wheel’ and repeat the work of others. They want to share results and intermediate data-sets and collaborate with colleagues. Given the needs of users, the case for a generic metadata mechanism on top of Hadoop is easy to make: increased visibility into data assets by registering them with a metadata registry for discovery and sharing enables increased efficiency. Less work for the user.
Users also want to be able to use different tool-sets and systems together – Hadoop and non-Hadoop alike. As a Hadoop user, there is a clear need for interoperability among the diverse tools on today’s Hadoop cluster: Hive, Pig, Cascading, Java MapReduce and streaming Python, C/C++, perl, and ruby with data stored in formats from CSV, TSV, Thrift, Protobuf, Avro, SequenceFiles, Hive’s RCFile as well as proprietary formats.
Finally, raw data does not usually originate on the Hadoop Distributed Filesystem. There is a clear need for a central point to register resources from different kinds of systems for ETL onto HDFS, and to publish results of analyses on Hadoop onto other systems.
Sounds topic mappish doesn’t it?
Marketable HCatalog data products anyone?
I first saw this at Hortonworks.