Archive for the ‘Morphlines’ Category

How-to: Process Data using Morphlines (in Kite SDK)

Friday, April 11th, 2014

How-to: Process Data using Morphlines (in Kite SDK) by Janos Matyas.

From the post:

SequenceIQ has an Apache Hadoop-based platform and API that consume and ingest various types of data from different sources to offer predictive analytics and actionable insights. Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation.

These datasets come from different sources (industry-standard and proprietary adapters, Apache Flume, MQTT, iBeacon, and so on), so we need a flexible, embeddable framework to support our ETL process chain. Hello, Morphlines! (As you may know, originally the Morphlines library was developed as part of Cloudera Search; eventually, it graduated into the Kite SDK as a general-purpose framework.)

To define a Morphline transformation chain, you need to describe the steps in a configuration file, and the framework will then turn into an in-memory container for transformation commands. Commands perform tasks such as transforming, loading, parsing, and processing records, and they can be linked in a processing chain.

In this blog post, I’ll demonstrate such an ETL process chain containing custom Morphlines commands (defined via config file and Java), and use the framework within MapReduce jobs and Flume. For the sample ETL with Morphlines use case, we have picked a publicly available “million song” dataset from Last.fm. The raw data consist of one JSON file/entry for each track; the dictionary contains the following keywords:

A welcome demonstration of Morphines but I do wonder about the statement:

Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation. (Emphasis added.)

If you don’t have experience with S3 and this pipleine, it is a good starting point for your investigations.

Introducing Morphlines:…

Friday, July 12th, 2013

Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop by Wolfgang Hoschek.

From the post:

Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

A “morphline” is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data, and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing, maintaining, or integrating custom ETL projects.

Morphlines is a library, embeddable in any Java codebase. A morphline is an in-memory container of transformation commands. Commands are plugins to a morphline that perform tasks such as loading, parsing, transforming, or otherwise processing a single record. A record is an in-memory data structure of name-value pairs with optional blob attachments or POJO attachments. The framework is extensible and integrates existing functionality and third-party systems in a simple and straightforward manner.

The Morphlines library was developed as part of Cloudera Search. It powers a variety of ETL data flows from Apache Flume and MapReduce into Solr. Flume covers the real time case, whereas MapReduce covers the batch processing case.

Since the launch of Cloudera Search, Morphlines development has graduated into the Cloudera Development Kit (CDK) in order to make the technology accessible to a wider range of users, contributors, integrators, and products beyond Search. The CDK is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem (and hence a perfect home for Morphlines). The CDK is hosted on GitHub and encourages involvement by the community.

(…)

The sidebar promises: Morphlines replaces Java programming with simple configuration steps, reducing the cost and effort of doing custom ETL.

Sound great!

But how do I search one or more morphlines for the semantics of the records/fields that are being processed or the semantics of that processing?

If I want to save “cost and effort,” shouldn’t I be able to search for existing morphlines that have transformed particular records/fields?

True, morphlines have “#” comments but that seems like a poor way to document transformations.

How would you test for field documentation?

Or make sure transformations of particular fields always use the same semantics?

Ponder those questions while you are reading:

Cloudera Morphlines Reference Guide

and,

Syntax – HOCON github page.

If we don’t capture semantics at the point of authoring, subsequent searches are mechanized guessing.