Falcon « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 11, 2015

Define and Process Data Pipelines in Hadoop with Apache Falcon

Filed under: Falcon,Hadoop,Pig — Patrick Durusau @ 9:49 am

Define and Process Data Pipelines in Hadoop with Apache Falcon

From the webpage:

Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.

Scenario

In this tutorial we will walk through a scenario where email data lands hourly on a cluster. In our example:

This cluster is the primary cluster located in the Oregon data center.

Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.

The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.

To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.
…

Not only a great tutorial on Falcon, this tutorial is a great example of writing a tuturial!

Comments Off

August 13, 2014

HDP 2.1 Tutorials

Filed under: Falcon,Hadoop,Hive,Hortonworks,Knox Gateway,Storm,Tez — Patrick Durusau @ 11:17 am

HDP 2.1 tutorials from Hortonworks :

The quality you have come to expect from Hortonwork tutorials but the data sets are a bit dull.

What data sets would you suggest to spice up this tutorials?

Comments Off

April 2, 2014

Hortonworks Data Platform 2.1

Filed under: Apache Ambari,Falcon,Hadoop,Hadoop YARN,Hive,Hortonworks,Knox Gateway,Solr,Storm,Tez — Patrick Durusau @ 2:49 pm

Hortonworks Data Platform 2.1 by Jim Walker.

From the post:

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

A VM available now, full releases to follow later in April.

Just grabbing the headings from Jim’s post:

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Data Governance with Apache Falcon

Security with Apache Knox

Stream Processing with Apache Storm

Searching Hadoop Data with Apache Solr

Advanced Operations with Apache Ambari

See Jim’s post for some of the details and the VM for others.

Comments Off

April 3, 2013

Project Falcon…

Filed under: Data Management,Falcon,Workflow — Patrick Durusau @ 9:16 am

Project Falcon: Tackling Hadoop Data Lifecycle Management via Community Driven Open Source by Venkatesh Seetharam.

From the post:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

All About Falcon and Data Lifecycle Management

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

I am certain a topic map based workflow solution could be created.

However, using a solution being promoted by others removes one thing from the topic map “to do” list.

Not to mention giving topic maps an introduction to other communities.

Comments Off