How-to: Process Data using Morphlines (in Kite SDK) by Janos Matyas.
From the post:
SequenceIQ has an Apache Hadoop-based platform and API that consume and ingest various types of data from different sources to offer predictive analytics and actionable insights. Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation.
These datasets come from different sources (industry-standard and proprietary adapters, Apache Flume, MQTT, iBeacon, and so on), so we need a flexible, embeddable framework to support our ETL process chain. Hello, Morphlines! (As you may know, originally the Morphlines library was developed as part of Cloudera Search; eventually, it graduated into the Kite SDK as a general-purpose framework.)
To define a Morphline transformation chain, you need to describe the steps in a configuration file, and the framework will then turn into an in-memory container for transformation commands. Commands perform tasks such as transforming, loading, parsing, and processing records, and they can be linked in a processing chain.
In this blog post, I’ll demonstrate such an ETL process chain containing custom Morphlines commands (defined via config file and Java), and use the framework within MapReduce jobs and Flume. For the sample ETL with Morphlines use case, we have picked a publicly available “million song” dataset from Last.fm. The raw data consist of one JSON file/entry for each track; the dictionary contains the following keywords:
A welcome demonstration of Morphines but I do wonder about the statement:
Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation. (Emphasis added.)
If you don’t have experience with S3 and this pipleine, it is a good starting point for your investigations.