Archive for the ‘MapReduce’ Category

Getting Started with Spark (in Python)

Sunday, April 26th, 2015

Getting Started with Spark (in Python) by Benjamin Bengfort.

From the post:

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

These two ideas have been the prime drivers for the advent of scaling analytics, large scale machine learning, and other big data appliances for the last ten years! However, in technology terms, ten years is an incredibly long time, and there are some well-known limitations that exist, with MapReduce in particular. Notably, programming MapReduce is difficult. You have to chain Map and Reduce tasks together in multiple steps for most analytics. This has resulted in specialized systems for performing SQL-like computations or machine learning. Worse, MapReduce requires data to be serialized to disk between each step, which means that the I/O cost of a MapReduce job is high, making interactive analysis and iterative algorithms very expensive; and the thing is, almost all optimization and machine learning is iterative.

To address these problems, Hadoop has been moving to a more general resource management framework for computation, YARN (Yet Another Resource Negotiator). YARN implements the next generation of MapReduce, but also allows applications to leverage distributed resources without having to compute with MapReduce. By generalizing the management of the cluster, research has moved toward generalizations of distributed computation, expanding the ideas first imagined in MapReduce.

Spark is the first fast, general purpose distributed computing paradigm resulting from this shift and is gaining popularity rapidly. Spark extends the MapReduce model to support more types of computations using a functional programming paradigm, and it can cover a wide range of workflows that previously were implemented as specialized systems built on top of Hadoop. Spark uses in-memory caching to improve performance and, therefore, is fast enough to allow for interactive analysis (as though you were sitting on the Python interpreter, interacting with the cluster). Caching also improves the performance of iterative algorithms, which makes it great for data theoretic tasks, especially machine learning.

In this post we will first discuss how to set up Spark to start easily performing analytics, either simply on your local machine or in a cluster on EC2. We then will explore Spark at an introductory level, moving towards an understanding of what Spark is and how it works (hopefully motivating further exploration). In the last two sections we will start to interact with Spark on the command line and then demo how to write a Spark application in Python and submit it to the cluster as a Spark job.

Be forewarned, this post uses the “F” word (functional) to describe the programming paradigm of Spark. Just so you know. 😉

If you aren’t already using Spark, this is about as easy a learning curve as can be expected.


I first saw this in a tweet by DataMining.

Association Rule Mining – Not Your Typical Data Science Algorithm

Monday, March 23rd, 2015

Association Rule Mining – Not Your Typical Data Science Algorithm by Dr. Kirk Borne.

From the post:

Many machine learning algorithms that are used for data mining and data science work with numeric data. And many algorithms tend to be very mathematical (such as Support Vector Machines, which we previously discussed). But, association rule mining is perfect for categorical (non-numeric) data and it involves little more than simple counting! That’s the kind of algorithm that MapReduce is really good at, and it can also lead to some really interesting discoveries.

Association rule mining is primarily focused on finding frequent co-occurring associations among a collection of items. It is sometimes referred to as “Market Basket Analysis”, since that was the original application area of association mining. The goal is to find associations of items that occur together more often than you would expect from a random sampling of all possibilities. The classic example of this is the famous Beer and Diapers association that is often mentioned in data mining books. The story goes like this: men who go to the store to buy diapers will also tend to buy beer at the same time. Let us illustrate this with a simple example. Suppose that a store’s retail transactions database includes the following information:

If you aren’t familiar with association rule mining, I think you will find Dr. Borne’s post an entertaining introduction.

I would not go quite as far as Dr. Borne with “explanations” for the pop-tart purchases before hurricanes. For retail purposes, so long as we spot the pattern, they could be building dikes out of them. The same is the case for other purchases. Take advantage of the patterns and try to avoid second guessing consumers. You can read more about testing patterns Selling Blue Elephants.


Convolutional Neural Nets in Net#

Wednesday, March 11th, 2015

Convolutional Neural Nets in Net# by by Alexey Kamenev.

From the post:

After introducing Net# in the previous post, we continue with our overview of the language and examples of convolutional neural nets or convnets.

Convnets have become very popular in recent years as they consistently produce great results on hard problems in computer vision, automatic speech recognition and various natural language processing tasks. In most such problems, the features have some geometric relationship, like pixels in an image or samples in audio stream. An excellent introduction to convnets can be found here: (Lecture 5)

Before we start discussing convnets, let’s introduce one definition that is important to understand when working with Net#. In a neural net structure, each trainable layer (a hidden or an output layer) has one or more connection bundles. A connection bundle consists of a source layer and a specification of the connections from that source layer. All the connections in a given bundle share the same source layer and the same destination layer. In Net#, a connection bundle is considered as belonging to the bundle’s destination layer. Net# supports various kinds of bundles like fully connected, convolutional, pooling and so on. A layer might have multiple bundles which connect it to different source layers.

BTW, the previous post was: Neural Nets in Azure ML – Introduction to Net#. Not exactly what I was expecting by the Net# reference.

Machine Learning Blog needs to be added to your RSS feed.

If you need more information: Guide to Net# neural network specification language.


I first saw this in a tweet by Michael Cavaretta.

Po’ Boy MapReduce

Friday, February 27th, 2015


Posted by Mirko Krivanek as What Is MapReduce?, credit @Tgrall

Working with Small Files in Hadoop – Part 1, Part 2, Part 3

Wednesday, February 25th, 2015

Working with Small Files in Hadoop – Part 1, Part 2, Part 3 by Chris Deptula.

From the post:

Why do small files occur?

The small file problem is an issue Inquidia Consulting frequently sees on Hadoop projects. There are a variety of reasons why companies may have small files in Hadoop, including:

  • Companies are increasingly hungry for data to be available near real time, causing Hadoop ingestion processes to run every hour/day/week with only, say, 10MB of new data generated per period.
  • The source system generates thousands of small files which are copied directly into Hadoop without modification.
  • The configuration of MapReduce jobs using more than the necessary number of reducers, each outputting its own file. Along the same lines, if there is a skew in the data that causes the majority of the data to go to one reducer, then the remaining reducers will process very little data and produce small output files.

Does it sound like you have small files? If so, this series by Chris is what you are looking for.

Google open sources a MapReduce framework for C/C++

Monday, February 23rd, 2015

Google open sources a MapReduce framework for C/C++ by Derrick Harris,

From the post:

Google announced on Wednesday that the company is open sourcing a MapReduce framework that will let users run native C and C++ code in their Hadoop environments. Depending on how much traction MapReduce for C, or MR4C, gets and by whom, it could turn out to be a pretty big deal.

Hadoop is famously, or infamously, written in Java and as such can suffer from performance issues compared with native C++ code. That’s why Google’s original MapReduce system was written in C++, as is the Quantcast File System, that company’s homegrown alternative for the Hadoop Distributed File System. And, as the blog post announcing MR4C notes, “many software companies that deal with large datasets have built proprietary systems to execute native code in MapReduce frameworks.”

Great news but be aware that “performance” is a tricky issue. If “performance” had a single meaning, the TIOBE Index for February 2015 (a rough gauge of programming language popularity) to look quite different over the years.

I remember a conference story where a programmer had written an application using Python, reasoning that resource limitations would compel the client to return for a fuller, enterprise solution. To their chagrin, the customer never exhausted the potential of the first solution. 😉

Review of Large-Scale RDF Data Processing in MapReduce

Wednesday, January 7th, 2015

Review of Large-Scale RDF Data Processing in MapReduce by Ke Hou, Jing Zhang and Xing Fang.


Resource Description Framework (RDF) is an important data presenting standard of semantic web and how to process, the increasing RDF data is a key problem for development of semantic web. MapReduce is a widely-used parallel programming model which can provide a solution to large-scale RDF data processing. This study reviews the recent literatures on RDF data processing in MapReduce framework in aspects of the forward-chaining reasoning, the simple querying and the storage mode determined by the related querying method. Finally, it is proposed that the future research direction of RDF data processing should aim at the scalable, increasing and complex RDF data query.

I count twenty-nine (29) projects with two to three sentence summaries of each one. Great starting point for an in-depth review of RDF data processing using mapreduce.

I first saw this in a tweet by Marin Dimitrov.

Tesser: Another Level of Indirection

Saturday, December 6th, 2014

Tesser: Another Level of Indirection by Kyle Kingsbury.

Slides for Kyle’s presentation at

From the presentation description:

Clojure’s sequence library and the threading macro make lazy sequence operations like map, filter, and reduce composable, and their immutable semantics allow efficient re-use of intermediate results. Core.reducers combine multiple map, filter, takes, et al into a single *fold*, taking advantage of stream fusion–and in the upcoming Clojure 1.7, transducers abstract away the underlying collection entirely.

I’ve been working on concurrent folds, where we sacrifice some order in exchange for parallelism. Tesser generalizes reducers to a two-dimensional fold: concurrent reductions over independent chunks of a sequence, and a second reduction over those values. Higher-order fold combinators allow us to build up faceted data structures which compute many properties of a dataset in a single pass. The same fold can be run efficiently on multicore systems or transparently distributed–e.g. over Hadoop.

Heavy wading but definitely worth the effort.

BTW, how do you like the hand drawn slides? I ask because I am toying with the idea of a graphics tablet for the Linux box.

Announcing Apache Hadoop 2.6.0

Tuesday, December 2nd, 2014

Announcing Apache Hadoop 2.6.0 by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has released Apache Hadoop 2.6.0 !

In particular, we are excited about three major pieces in this release: heterogeneous storage in HDFS with SSD & Memory tiers, support for long-running services in YARN and rolling upgrades—the ability to upgrade your cluster software and restart upgraded nodes without taking the cluster down or losing work in progress. With YARN as its architectural center, Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways.

Many thanks to all of the contributors and committers who collaborated on this version and resolved a total of nearly 900 JIRA issues across four areas:

  • Hadoop Common: 231 JIRAs resolved
  • Hadoop HDFS: 305 JIRAs resolved
  • Hadoop YARN: 290 JIRAs resolved
  • Hadoop MapReduce: 70 JIRAs resolved

Highlights for Apache Hadoop 2.6.0

Here are some details about the most important features. For the complete list of features, improvements and bug fixes, see the sidebar and the release notes.

The post includes a nifty PNG file that lists the major issues with what I think were intended to be links to JIRA issues. Unfortunately the links point to the PNG file. I suspect a missing map directive. I posted a comment on same and hopefully it will be fixed soon.

In the meantime, enjoy the new release of Hadoop! (And thank the many contributors to the project this holiday season!)

Massively Parallel Clustering: Overview

Tuesday, November 11th, 2014

Massively Parallel Clustering: Overview by Grigory Yaroslavtsev.

From the post:

Clustering is one of the main vechicles of machine learning and data analysis.
In this post I will describe how to make three very popular sequential clustering algorithms (k-means, single-linkage clustering and correlation clustering) work for big data. The first two algorithms can be used for clustering a collection of feature vectors in \(d\)-dimensional Euclidean space (like the two-dimensional set of points on the picture below, while they also work for high-dimensional data). The last one can be used for arbitrary objects as long as for any pair of them one can define some measure of similarity.

mapreduce clustering

Besides optimizing different objective functions these algorithms also give qualitatively different types of clusterings.
K-means produces a set of exactly k clusters. Single-linkage clustering gives a hierarchical partitioning of the data, which one can zoom into at different levels and get any desired number of clusters.
Finally, in correlation clustering the number of clusters is not known in advance and is chosen by the algorithm itself in order to optimize a certain objective function.

All algorithms described in this post use the model for massively parallel computation that I described before.

I thought you might be interested in parallel clustering algorithms after the post on OSM-France. Don’t skip model for massively parallel computation. It and the discussion that follows is rich in resources on parallel clustering. Lots of links.

I take heart from the line:

The last one [Correlation Clustering] can be used for arbitrary objects as long as for any pair of them one can define some measure of similarity.

The words “some measure of similarity” should be taken as a warning the any particular “measure of similarity” should be examined closely and tested against the data so processed. It could be that the “measure of similarity” produces a desired result on a particular data set. You won’t know until you look.

Best Map/Reduce Explanation

Wednesday, September 3rd, 2014

Michael Klishin tweeted today: “The best map/reduce explanation ever:

For your viewing pleasure:


It does have “side effects” though.

Is it lunch time yet?

Scalding Source Code

Thursday, June 12th, 2014

Programming MapReduce With Scalding

Source code for PACKT book: Programming MapReduce With Scalding (June 2014).

I haven’t seen the book and there are no sample chapters, yet.

Ping me if you post comments about it.


Lock and Load Hadoop

Sunday, May 4th, 2014

How to Load Data for Hadoop into the Hortonworks Sandbox


This tutorial describes how to load data into the Hortonworks sandbox.

The Hortonworks sandbox is a fully contained Hortonworks Data Platform (HDP) environment. The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all the tools needed for data ingestion and processing. You can access and analyze sandbox data with many Business Intelligence (BI) applications.

In this tutorial, we will load and review data for a fictitious web retail store in what has become an established use case for Hadoop: deriving insights from large data sources such as web logs. By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.

“Big data” applications are fun to read about but aren’t really interesting until your data has been loaded.

If you don’t have the Hortonworks Sandbox you need to get it: Hortonworks Sandbox.

How-to: Process Data using Morphlines (in Kite SDK)

Friday, April 11th, 2014

How-to: Process Data using Morphlines (in Kite SDK) by Janos Matyas.

From the post:

SequenceIQ has an Apache Hadoop-based platform and API that consume and ingest various types of data from different sources to offer predictive analytics and actionable insights. Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation.

These datasets come from different sources (industry-standard and proprietary adapters, Apache Flume, MQTT, iBeacon, and so on), so we need a flexible, embeddable framework to support our ETL process chain. Hello, Morphlines! (As you may know, originally the Morphlines library was developed as part of Cloudera Search; eventually, it graduated into the Kite SDK as a general-purpose framework.)

To define a Morphline transformation chain, you need to describe the steps in a configuration file, and the framework will then turn into an in-memory container for transformation commands. Commands perform tasks such as transforming, loading, parsing, and processing records, and they can be linked in a processing chain.

In this blog post, I’ll demonstrate such an ETL process chain containing custom Morphlines commands (defined via config file and Java), and use the framework within MapReduce jobs and Flume. For the sample ETL with Morphlines use case, we have picked a publicly available “million song” dataset from The raw data consist of one JSON file/entry for each track; the dictionary contains the following keywords:

A welcome demonstration of Morphines but I do wonder about the statement:

Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation. (Emphasis added.)

If you don’t have experience with S3 and this pipleine, it is a good starting point for your investigations.

Apache Hadoop 2.4.0 Released!

Thursday, April 10th, 2014

Apache Hadoop 2.4.0 Released! by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.4.0! Thank you to every single one of the contributors, reviewers and testers!

Hadoop 2.4.0 continues that momentum, with additional enhancements to both HDFS & YARN:

  • Support for Access Control Lists in HDFS (HDFS-4685)
  • Native support for Rolling Upgrades in HDFS (HDFS-5535)
  • Smooth operational upgrades with protocol buffers for HDFS FSImage (HDFS-5698)
  • Full HTTPS support for HDFS (HDFS-5305)
  • Support for Automatic Failover of the YARN ResourceManager (YARN-149) (a.k.a Phase 1 of YARN ResourceManager High Availability)
  • Enhanced support for new applications on YARN with Application History Server (YARN-321) and Application Timeline Server (YARN-1530)
  • Support for strong SLAs in YARN CapacityScheduler via Preemption (YARN-185)

And of course:


See Arun’s post for more details or just jump to the downloads links.

Scalding 0.9: Get it while it’s hot!

Thursday, April 10th, 2014

Scalding 0.9: Get it while it’s hot! by P. Oscar Boykin.

From the post:

It’s been just over two years since we open sourced Scalding and today we are very excited to release the 0.9 version. Scalding at Twitter powers everything from internal and external facing dashboards, to custom relevance and ad targeting algorithms, including many graph algorithms such as PageRank, approximate user cosine similarity and many more.

Oscar covers:

  • Joins
  • Input/output
    • Parquet Format
    • Avro
    • TemplateTap
  • Hadoop counters
  • Typed API
  • Matrix API

Or if you want something a bit more visual and just as enthusiastic, see:

Basically the same content but with Oscar live!

Apache Mahout, “…Ya Gotta Hit The Road”

Thursday, March 27th, 2014

The news in Derrick Harris’ “Apache Mahout, Hadoop’s original machine learning project, is moving on from MapReduce” reminded of a line from Tommy, “Just as the gypsy queen must do, ya gotta hit the road.”

From the post:

Apache Mahout, a machine learning library for Hadoop since 2009, is joining the exodus away from MapReduce. The project’s community has decided to rework Mahout to support the increasingly popular Apache Spark in-memory data-processing framework, as well as the H2O engine for running machine learning and mathematical workloads at scale.

While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.

H2O was developed separately by a startup called 0xadata (pronounced hexadata), although it’s also available as open source software. It’s an in-memory data engine specifically designed for running various types of types of statisical computations — including deep learning models — on data stored in the Hadoop Distributed File System.

Support for multiple data frameworks is yet another reason to learn Mahout.

Use Parquet with Impala, Hive, Pig, and MapReduce

Saturday, March 22nd, 2014

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce by John Russell.

From the post:

The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.

An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:

  • Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.
  • Flexible compression options: The data can be compressed with any of several codecs. Different data files can be compressed differently. The compression is transparent to applications that read the data files.
  • Innovative encoding schemes: Sequences of identical, similar, or related data values can be represented in ways that save disk space and memory, yet require little effort to decode. The encoding schemes provide an extra level of space savings beyond the overall compression for each data file.
  • Large file size: The layout of Parquet data files is optimized for queries that process large volumes of data, with individual files in the multi-megabyte or even gigabyte range.

Impala can create Parquet tables, insert data into them, convert data from other file formats to Parquet, and then perform SQL queries on the resulting data files. Parquet tables created by Impala can be accessed by Apache Hive, and vice versa.

That said, the CDH software stack lets you use the tool of your choice with the Parquet file format, for each phase of data processing. For example, you can read and write Parquet files using Apache Pig and MapReduce jobs. You can convert, transform, and query Parquet tables through Impala and Hive. And you can interchange data files between all of those components — including ones external to CDH, such as Cascading and Apache Tajo.

In this blog post, you will learn the most important principles involved.

Since I mentioned ROOT files yesterday, I am curious what you make of the use of Thrift metadata definitions to read Parquet files?

It’s great that data can be documented for reading, but reading doesn’t imply to me that its semantics have been captured.

A wide variety of products read data, less certain they can document data semantics.


I first saw this in a tweet by Patrick Hunt.

Kite Software Development Kit

Thursday, March 13th, 2014

Kite Software Development Kit

From the webpage:

The Kite Software Development Kit (Apache License, Version 2.0), or Kite for short, is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

  • Codifies expert patterns and practices for building data-oriented systems and applications
  • Lets developers focus on business logic, not plumbing or infrastructure
  • Provides smart defaults for platform choices
  • Supports gradual adoption via loosely-coupled modules

Version 0.12.0 was released March 10, 2014.

Do note that unlike some “pattern languages,” these are legitimate patterns are based on expert patterns and practices. (There are “patterns” produced like Uncle Bilius (Harry Potter and the Deathly Hallows, Chapter Eight) after downing a bottle of firewhiskey. You should avoid such patterns.)

Data Science Challenge

Tuesday, March 11th, 2014

Data Science Challenge

Some details from the registration page:

Prerequisite: Data Science Essentials (DS-200)
Schedule: Twice per year
Duration: Three months from launch date
Next Challenge Date: March 31, 2014
Language: English
Price: USD $600

From the webpage:

Cloudera will release a Data Science Challenge twice each year. Each bi-quarterly project is based on a real-world data science problem involving a large data set and is open to candidates for three months to complete. During the open period, candidates may work on their project individually and at their own pace.

Current Data Science Challenge

The new Data Science Challenge: Detecting Anomalies in Medicare Claims will be available starting March 31, 2014, and will cost USD $600.

In the U.S., Medicare reimburses private providers for medical procedures performed for covered individuals. As such, it needs to verify that the type of procedures performed and the cost of those procedures are consistent and reasonable. Finally, it needs to detect possible errors or fraud in claims for reimbursement from providers. You have been hired to analyze a large amount of data from Medicare and try to detect abnormal data — providers, areas, or patients with unusual procedures and/or claims.

Register for the challenge.

Build a Winning Model

CCP candidates compete against each other and against a benchmark set by a committee including some of the world’s elite data scientists. Participants who surpass evaluation benchmarks receive the CCP: Data Scientist credential.

Lead the Field

Those with the highest scores from each Challenge will have an opportunity to share their solutions and promote their work on and via press and social media outlets. All candidates retain the full rights to their own work and may leverage their models outside of the Challenge as they choose.

Useful way to develop some street cred in data science.

Apache Tez 0.3 Released!

Monday, March 10th, 2014

Apache Tez 0.3 Released! by Bikas Saha.

From the post:

The Apache Tez community has voted to release 0.3 of the software.

Apache™ Tez is a replacement of MapReduce that provides a powerful framework for executing a complex topology of tasks. Tez 0.3.0 is an important release towards making the software ready for wider adoption by focusing on fundamentals and ironing out several key functions. The major action areas in this release were

  1. Security. Apache Tez now works on secure Hadoop 2.x clusters using the built-in security mechanisms of the Hadoop ecosystem.
  2. Scalability. We tested the software on large clusters, very large data sets and large applications processing tens of TB each to make sure it scales well with both data-sets and machines.
  3. Fault Tolerance. Apache Tez executes a complex DAG workflow that can be subject to multiple failure conditions in clusters of commodity hardware and is highly resilient to these and other sorts of failures.
  4. Stability. A large number of bug fixes went into this release as early adopters and testers put the software through its paces and reported issues.

To prove the stability and performance of Tez, we executed complex jobs comprised of more than 50 different stages and tens of thousands of tasks on a fairly large cluster (> 300 Nodes, > 30TB data). Tez passed all our tests and we are certain that new adopters can integrate confidently with Tez and enjoy the same benefits as Apache Hive & Apache Pig have already.

I am curious how the Hadoop community is going to top 2013. I suspect Tez is going to be part of that answer!

Merge Mahout item based recommendations…

Saturday, March 8th, 2014

Merge Mahout item based recommendations results from different algorithms

From the post:

Apache Mahout is a machine learning library that leverages the power of Hadoop to implement machine learning through the MapReduce paradigm. One of the implemented algorithms is collaborative filtering, the most successful recommendation technique to date. The basic idea behind collaborative filtering is to analyze the actions or opinions of users to recommend items similar to the one the user is interacting with.

Similarity isn’t restricted to a particular measure or metric.

How similar is enough to be considered the same?

That is a question topic map designers must answer on a case by case basis.

Mortar PIG Cheat Sheet

Thursday, February 27th, 2014

Mortar PIG Cheat Sheet

From the cheatsheet:

We love Apache Pig for data processing—it’s easy to learn, it works with all kinds of data, and it plays well with Python, Java, and other popular languages. And, of course, Pig runs on Hadoop, so it’s built for high-scale data science.

Whether you’re just getting started with Pig or you’ve already written a variety of Pig scripts, this compact reference gathers in one place many of the tools you’ll need to make the most of your data using Pig 0.12

Easier on the eyes than a one pager!

Not to mention being a good example of how to write and format a cheat sheet.

Secrets of Cloudera Support:…

Wednesday, February 26th, 2014

Secrets of Cloudera Support: Inside Our Own Enterprise Data Hub by Adam Warrington.

From the post:

Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.

In this post, I’m happy to write about a new feature in CSI, which we call Monocle Stack Trace.

Stack Trace Exploration with Search

Hadoop log messages and the stack traces in those logs are critical information in many of the support cases Cloudera handles. We find that our customer operation engineers (COEs) will regularly search for stack traces they find referenced in support cases to try to determine where else that stack trace has shown up, and in what context it would occur. This could be in the many sources we were already indexing as part of Monocle Search in CSI: Apache JIRAs, Apache mailing lists, internal Cloudera JIRAs, internal Cloudera mailing lists, support cases, Knowledge Base articles, Cloudera Community Forums, and the customer diagnostic bundles we get from Cloudera Manager.

It turns out that doing routine document searches for stack traces doesn’t always yield the best results. Stack traces are relatively long compared to normal search terms, so search indexes won’t always return the relevant results in the order you would expect. It’s also hard for a user to churn through the search results to figure out if the stack trace was actually an exact match in the document to figure out how relevant it actually is.

To solve this problem, we took an approach similar to what Google does when it wants to allow searching over a type that isn’t best suited for normal document search (such as images): we created an independent index and search result page for stack-trace searches. In Monocle Stack Trace, the search results show a list of unique stack traces grouped with every source of data in which unique stack trace was discovered. Each source can be viewed in-line in the search result page, or the user can go to it directly by following a link.

We also give visual hints as to how the stack trace for which the user searched differs from the stack traces that show up in the search results. A green highlighted line in a search result indicates a matching call stack line. Yellow indicates a call stack line that only differs in line number, something that may indicate the same stack trace on a different version of the source code. A screenshot showing the grouping of sources and visual highlighting is below:

See Adam’s post for the details.

I like the imaginative modification of standard search.

Not all data is the same and searching it as if it were, leaves a lot of useful data unfound.

Apache Hadoop 2.3.0 Released!

Tuesday, February 25th, 2014

Apache Hadoop 2.3.0 Released! by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.3.0!

hadoop-2.3.0 is the first release for the year 2014, and brings a number of enhancements to the core platform, in particular to HDFS.

With this release, there are two significant enhancements to HDFS:

  • Support for Heterogeneous Storage Hierarchy in HDFS (HDFS-2832)
  • In-memory Cache for data resident in HDFS via Datanodes (HDFS-4949)

With support for heterogeneous storage classes in HDFS, we now can take advantage of different storage types on the same Hadoop clusters. Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, SSDs, Memory etc. More details on this major enhancement are available here.

Along similar lines, it is now possible to use memory available in the Hadoop cluster to centrally cache and administer data-sets in-memory in the Datanode’s address space. Applications such as MapReduce, Hive, Pig etc. can now request for memory to be cached (for the curios, we use a combination of mmap, mlock to achieve this) and then read it directly off the Datanode’s address space for extremely efficient scans by avoiding disk altogether. As an example, Hive is taking advantage of this feature by implementing an extremely efficient zero-copy read path for ORC files – see HIVE-6347 for details.

See Arun’s post for more details.

I guess there really is a downside to open source development.

It’s so much faster than commercial product development cycles. 😉 (Hard to keep up.)

Spring XD – Tweets – Hadoop – Sentiment Analysis

Saturday, February 15th, 2014

Using Spring XD to stream Tweets to Hadoop for Sentiment Analysis

From the webpage:

This tutorial will build on the previous tutorial – 13 – Refining and Visualizing Sentiment Data – by using Spring XD to stream in tweets to HDFS. Once in HDFS, we’ll use Apache Hive to process and analyze them, before visualizing in a tool.

I re-ordered the text:

This tutorial is from the Community part of tutorial for Hortonworks Sandbox (1.3) – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series.

This community tutorial submitted by mehzer with source available at Github. Feel free to contribute edits or your own tutorial and help the community learn Hadoop.

not to take anything away from Spring XD or Sentiment Analysis but to emphasize the community tutorial aspects of the Hortonworks Sandbox.

At present count on tutorials:

Hortonworks: 14

Partners: 12

Community: 6

Thoughts on what the next community tutorial should be?

Mining of Massive Datasets 2.0

Thursday, February 13th, 2014

Mining of Massive Datasets 2.0

From the webpage:

The following is the second edition of the book, which we expect to be published soon. We have added Jure Leskovec as a coauthor. There are three new chapters, on mining large graphs, dimensionality reduction, and machine learning.

There is a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice, rather than how it was described in the original paper. Chapter 2 also has new material on algorithm design techniques for map-reduce.

Aren’t you wishing for more winter now? 😉

I first saw this in a tweet by Gregory Piatetsky.

Introducing PigPen: Map-Reduce for Clojure

Sunday, February 9th, 2014

Introducing PigPen: Map-Reduce for Clojure by Matt Bossenbroek.


From the post:

It is our pleasure to release PigPen to the world today. PigPen is map-reduce for Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it.

What is PigPen?

  • A map-reduce language that looks and behaves like clojure.core
  • The ability to write map-reduce queries as programs, not scripts
  • Strong support for unit tests and iterative development

Note: If you are not familiar at all with Clojure, we strongly recommend that you try a tutorial here, here, or here to understand some of the basics.

Not a quick read but certainly worth the effort!

Write and Run Giraph Jobs on Hadoop

Sunday, February 9th, 2014

Write and Run Giraph Jobs on Hadoop by Mirko Kämpf.

From the post:

Create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets. Giraph was inspired by Google’s Pregel framework and integrates well with Apache Accumulo, Apache HBase, Apache Hive, and Cloudera Impala.

Currently, the upstream “quick start” document explains how to deploy Giraph on a Hadoop cluster with two nodes running Ubuntu Linux. Although this setup is appropriate for lightweight development and testing, using Giraph with an enterprise-grade CDH-based cluster requires a slightly more robust approach.

In this how-to, you will learn how to use Giraph 1.0.0 on top of CDH 4.x using a simple example dataset, and run example jobs that are already implemented in Giraph. You will also learn how to set up your own Giraph-based development environment. The end result will be a setup (not intended for production) for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets. (In future posts, I will explain how to implement your own graph algorithms and graph generators as well as how to export your results to Gephi, the “Adobe Photoshop for graphs”, through Impala and JDBC for further inspection.)

The first in a series of posts on Giraph.

This is great stuff!

It should keep you busy during your first conference call and/or staff meeting on Monday morning.

Monday won’t seem so bad. 😉

Create a Simple Hadoop Cluster with VirtualBox ( < 1 Hour)

Wednesday, January 29th, 2014

How-to: Create a Simple Hadoop Cluster with VirtualBox by Christian Javet.

From the post:

I wanted to get familiar with the big data world, and decided to test Hadoop. Initially, I used Cloudera’s pre-built virtual machine with its full Apache Hadoop suite pre-configured (called Cloudera QuickStart VM), and gave it a try. It was a really interesting and informative experience. The QuickStart VM is fully functional and you can test many Hadoop services, even though it is running as a single-node cluster.

I wondered what it would take to install a small four-node cluster…

I did some research and I found this excellent video on YouTube presenting a step by step explanation on how to setup a cluster with VMware and Cloudera. I adapted this tutorial to use VirtualBox instead, and this article describes the steps used.

Watch for the line:

Overall we will allocate 14GB of memory, so ensure that the host machine has sufficient memory, otherwise this will impact your experience negatively.

Yes, “…impact your experience negatively.”