Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop by Wolfgang Hoschek.
From the post:
Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.
A “morphline” is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data, and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing, maintaining, or integrating custom ETL projects.
Morphlines is a library, embeddable in any Java codebase. A morphline is an in-memory container of transformation commands. Commands are plugins to a morphline that perform tasks such as loading, parsing, transforming, or otherwise processing a single record. A record is an in-memory data structure of name-value pairs with optional blob attachments or POJO attachments. The framework is extensible and integrates existing functionality and third-party systems in a simple and straightforward manner.
The Morphlines library was developed as part of Cloudera Search. It powers a variety of ETL data flows from Apache Flume and MapReduce into Solr. Flume covers the real time case, whereas MapReduce covers the batch processing case.
Since the launch of Cloudera Search, Morphlines development has graduated into the Cloudera Development Kit (CDK) in order to make the technology accessible to a wider range of users, contributors, integrators, and products beyond Search. The CDK is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem (and hence a perfect home for Morphlines). The CDK is hosted on GitHub and encourages involvement by the community.
(…)
The sidebar promises: Morphlines replaces Java programming with simple configuration steps, reducing the cost and effort of doing custom ETL.
Sound great!
But how do I search one or more morphlines for the semantics of the records/fields that are being processed or the semantics of that processing?
If I want to save “cost and effort,” shouldn’t I be able to search for existing morphlines that have transformed particular records/fields?
True, morphlines have “#” comments but that seems like a poor way to document transformations.
How would you test for field documentation?
Or make sure transformations of particular fields always use the same semantics?
Ponder those questions while you are reading:
Cloudera Morphlines Reference Guide
and,
Syntax – HOCON github page.
If we don’t capture semantics at the point of authoring, subsequent searches are mechanized guessing.