Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 11, 2015

Define and Process Data Pipelines in Hadoop with Apache Falcon

Filed under: Falcon,Hadoop,Pig — Patrick Durusau @ 9:49 am

Define and Process Data Pipelines in Hadoop with Apache Falcon

From the webpage:

Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.

Scenario

In this tutorial we will walk through a scenario where email data lands hourly on a cluster. In our example:

  • This cluster is the primary cluster located in the Oregon data center.
  • Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.

The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.

To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.

Not only a great tutorial on Falcon, this tutorial is a great example of writing a tuturial!

August 13, 2014

HDP 2.1 Tutorials

Filed under: Falcon,Hadoop,Hive,Hortonworks,Knox Gateway,Storm,Tez — Patrick Durusau @ 11:17 am

HDP 2.1 tutorials from Hortonworks:

  1. Securing your Data Lake Resource & Auditing User Access with HDP Security
  2. Searching Data with Apache Solr
  3. Define and Process Data Pipelines in Hadoop with Apache Falcon
  4. Interactive Query for Hadoop with Apache Hive on Apache Tez
  5. Processing streaming data in Hadoop with Apache Storm
  6. Securing your Hadoop Infrastructure with Apache Knox

The quality you have come to expect from Hortonwork tutorials but the data sets are a bit dull.

What data sets would you suggest to spice up this tutorials?

April 2, 2014

Hortonworks Data Platform 2.1

Filed under: Apache Ambari,Falcon,Hadoop,Hadoop YARN,Hive,Hortonworks,Knox Gateway,Solr,Storm,Tez — Patrick Durusau @ 2:49 pm

Hortonworks Data Platform 2.1 by Jim Walker.

From the post:

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

A VM available now, full releases to follow later in April.

Just grabbing the headings from Jim’s post:

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Data Governance with Apache Falcon

Security with Apache Knox

Stream Processing with Apache Storm

Searching Hadoop Data with Apache Solr

Advanced Operations with Apache Ambari

See Jim’s post for some of the details and the VM for others.

April 3, 2013

Project Falcon…

Filed under: Data Management,Falcon,Workflow — Patrick Durusau @ 9:16 am

Project Falcon: Tackling Hadoop Data Lifecycle Management via Community Driven Open Source by Venkatesh Seetharam.

From the post:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

All About Falcon and Data Lifecycle Management

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

Falcon workflow

I am certain a topic map based workflow solution could be created.

However, using a solution being promoted by others removes one thing from the topic map “to do” list.

Not to mention giving topic maps an introduction to other communities.

Powered by WordPress