The Definitive “Getting Started” Tutorial for Apache Hadoop + Your Own Demo Cluster by Justin Kestelyn.
From the post:
…
Most Hadoop tutorials take a piecemeal approach: they either focus on one or two components, or at best a segment of the end-to-end process (just data ingestion, just batch processing, or just analytics). Furthermore, few if any provide a business context that makes the exercise pragmatic.
This new tutorial closes both gaps. It takes the reader through the complete Hadoop data lifecycle—from data ingestion through interactive data discovery—and does so while emphasizing the business questions concerned: What products do customers view on the Web, what do they like to buy, and is there a relationship between the two?
Getting those answers is a task that organizations with traditional infrastructure have been doing for years. However, the ones that bought into Hadoop do the same thing at greater scale, at lower cost, and on the same storage substrate (with no ETL, that is) upon which many other types of analysis can be done.
To learn how to do that, in this tutorial (and assuming you are using our sample dataset) you will:
- Load relational and clickstream data into HDFS (via Apache Sqoop and Apache Flume respectively)
- Use Apache Avro to serialize/prepare that data for analysis
- Create Apache Hive tables
- Query those tables using Hive or Impala (via the Hue GUI)
- Index the clickstream data using Flume, Cloudera Search, and Morphlines, and expose a search GUI for business users/analysts
…
I can’t imagine what “other” tutorials that Justin has in mind. 😉
To be fair, I haven’t taken this particular tutorial. Hadoop tutorials you suggest as comparisons to this one? Your comparisons of Hadoop tutorials?