Archive for the ‘HDFS’ Category

Using Apache Spark and Neo4j for Big Data Graph Analytics

Monday, November 3rd, 2014

Using Apache Spark and Neo4j for Big Data Graph Analytics by Kenny Bastani.

From the post:

Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.

Still, where does all that data come from? Where does it go when the analysis is done?

Graph databases

I’ve been working with graph database technologies for the last few years and I have yet to become jaded by its powerful ability to combine both the transformation of data with analysis. Graph databases like Neo4j are solving problems that relational databases cannot.

Graph processing at scale from a graph database like Neo4j is a tremendously valuable power.

But if you wanted to run PageRank on a dump of Wikipedia articles in less than 2 hours on a laptop, you’d be hard pressed to be successful. More so, what if you wanted the power of a high-performance transactional database that seamlessly handled graph analysis at this scale?

Mazerunner for Neo4j

Mazerunner is a Neo4j unmanaged extension and distributed graph processing platform that extends Neo4j to do big data graph processing jobs while persisting the results back to Neo4j.

Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark’s GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.

Mazerunner is an alpha release with page rank as its only algorithm.

It has a great deal of potential so worth your time to investigate further.

Why Extended Attributes are Coming to HDFS

Saturday, June 28th, 2014

Why Extended Attributes are Coming to HDFS by Charles Lamb.

From the post:

Extended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too.

Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. A filesystem that implements extended attributes also provides system calls and shell commands to get, list, set, and remove attributes (and values) to/from a file or directory.

Recently, my Intel colleague Yi Liu led the implementation of extended attributes for HDFS (HDFS-2006). This work is largely motivated by Cloudera and Intel contributions to bringing at-rest encryption to Apache Hadoop (HDFS-6134; also see this post) under Project Rhino – extended attributes will be the mechanism for associating encryption key metadata with files and encryption zones — but it’s easy to imagine lots of other places where they could be useful.

For instance, you might want to store a document’s author and subject in sometime like and user.subject=HDFS. You could store a file checksum in an attribute called user.checksum. Even just comments about a particular file or directory can be saved in an extended attribute.

In this post, you’ll learn some of the details of this feature from an HDFS user’s point of view.

Extended attributes sound like an interesting place to tuck away additional information about a file.

Such as the legend to be used to interpret it?

Diving into HDFS

Thursday, May 15th, 2014

Diving into HDFS by Julia Evans.

From the post:

Yesterday I wanted to start learning about how HDFS (the Hadoop Distributed File System) works internally. I knew that

  • It’s distributed, so one file may be stored across many different machines
  • There’s a namenode, which keeps track of where all the files are stored
  • There are data nodes, which contain the actual file data

But I wasn’t quite sure how to get started! I knew how to navigate the filesystem from the command line (hadoop fs -ls /, and friends), but not how to figure out how it works internally.

Colin Marc pointed me to this great library called snakebite which is a Python HDFS client. In particular he pointed me to the part of the code that reads file contents from HDFS. We’re going to tear it apart a bit and see what exactly it does!

Be cautious reading Julia’s post!

Her enthusiasm can be infectious. 😉

Seriously, I take Julia’s posts as the way CS topics are supposed to be explored. While there is hard work, there is also the thrill of discovery. Not a bad approach to have.

6X Performance with Impala

Sunday, May 4th, 2014

In-memory Caching in HDFS: Lower latency, same great taste by Andrew Wang.

From the post:

My coworker Colin McCabe and I recently gave a talk at Hadoop Summit Amsterdam titled “In-memory Caching in HDFS: Lower latency, same great taste.” I’m very pleased with how this feature turned out, since it was approximately a year-long effort going from initial design to production system. Combined with Impala, we showed up to a 6x performance improvement by running on cached data, and that number will only improve with time. Slides and video of our presentation are available online.

Finding data the person who signs the checks will be interested in seeing with 6X performance is left as an exercise for the reader. 😉

Hadoop – 100x Faster… [With NO ETL!]

Monday, November 11th, 2013

Hadoop – 100x Faster. How we did it… by Nikita Ivanov.

From the post:

Almost two years ago, Dmitriy and I stood in front of a white board at GridGain’s office thinking: “How can we deliver the real-time performance of GridGain’s in-memory technology to Hadoop customers without asking them rip and replace their systems and without asking them to move their datasets off Hadoop?”.

Given Hadoop’s architecture – the task seemed daunting; and it proved to be one of the more challenging engineering puzzles we have had to solve.

After two years of development, tens of thousands of lines of Java, Scala and C++ code, multiple design iterations, several releases and dozens of benchmarks later, we finally built a product that can deliver real-time performance to Hadoop customers with seamless integration and no tedious ETL. Actual customers deployments can now prove our performance claims and validate our product’s architecture.

Here’s how we did it.

The Idea – In-Memory Hadoop Accelerator

Hadoop is based on two primary technologies: HDFS for storing data, and MapReduce for processing these data in parallel. Everything else in Hadoop and the Hadoop ecosystem sits atop these foundation blocks.

Originally, neither HDFS nor MapReduce were designed with real-time performance in mind. In order to deliver real-time processing without moving data out of Hadoop onto another platform, we had to improve the performance of both of these subsystems. (emphasis added)

The highlighted phrase is the key isn’t it?

In order to deliver real-time processing without moving data out of Hadoop onto another platform

ETL is down time, expense and risk of data corruption.

Given a choice between making your current data platform (of whatever type) more robust or risking a migration to a new data platform, which one would you choose?

Bear in mind those 2.5 million spreadsheets that Felienne mentions in her presentation.

Are you really sure you want to ETL on all you data?

As opposed to making your most critical data more robust and enhanced by other data? All while residing where it lives right now.

Are you ready to get off the ETL merry-go-round?

Applying the Big Data Lambda Architecture

Sunday, October 27th, 2013

Applying the Big Data Lambda Architecture by Michael Hausenblas.

From the article:

Based on his experience working on distributed data processing systems at Twitter, Nathan Marz recently designed a generic architecture addressing common requirements, which he called the Lambda Architecture. Marz is well-known in Big Data: He’s the driving force behind Storm and at Twitter he  led the streaming compute team, which provides and develops shared infrastructure to support critical real-time applications.

Marz and his team described the underlying motivation for building systems with the lambda architecture as:

  • The need for a robust system that is fault-tolerant, both against hardware failures and human mistakes.
  • To serve a wide range of workloads and use cases, in which low-latency reads and updates are required. Related to this point, the system should support ad-hoc queries.
  • The system should be linearly scalable, and it should scale out rather than up, meaning that throwing more machines at the problem will do the job.
  • The system should be extensible so that features can be added easily, and it should be easily debuggable and require minimal maintenance.

From a bird’s eye view the lambda architecture has three major components that interact with new data coming in and responds to queries, which in this article are driven from the command line:

The goal of the article:

In this article, I employ the lambda architecture to implement what I call UberSocialNet (USN). This open-source project enables users to store and query acquaintanceship data. That is, I want to be able to capture whether I happen to know someone from multiple social networks, such as Twitter or LinkedIn, or from real-life circumstances. The aim is to scale out to several billions of users while providing low-latency access to the stored information. To keep the system simple and comprehensible, I limit myself to bulk import of the data (no capabilities to live-stream data from social networks) and provide only a very simple a command-line user interface. The guts, however, use the lambda architecture.

Something a bit challenging for the start of the week. 😉

Hadoop Tutorials – Hortonworks

Wednesday, October 16th, 2013

With the GA release of Hadoop 2, it seems appropriate to list a set of tutorials for the Hortonworks Sandbox.

Tutorial 1: Hello World – An Overview of Hadoop with HCatalog, Hive and Pig

Tutorial 2: How To Process Data with Apache Pig

Tutorial 3: How to Process Data with Apache Hive

Tutorial 4: How to Use HCatalog, Pig & Hive Commands

Tutorial 5: How to Use Basic Pig Commands

Tutorial 6: How to Load Data for Hadoop into the Hortonworks Sandbox

Tutorial 7: How to Install and Configure the Hortonworks ODBC driver on Windows 7

Tutorial 8: How to Use Excel 2013 to Access Hadoop Data

Tutorial 9: How to Use Excel 2013 to Analyze Hadoop Data

Tutorial 10: How to Visualize Website Clickstream Data

Tutorial 11: How to Install and Configure the Hortonworks ODBC driver on Mac OS X

Tutorial 12: How to Refine and Visualize Server Log Data

Tutorial 13: How To Refine and Visualize Sentiment Data

Tutorial 14: How To Analyze Machine and Sensor Data

By the time you finish these, I am sure there will be more tutorials or even proposed additions to the Hadoop stack!

(Updated December 3, 2013 to add #13 and #14.)

Hadoop: Is There a Metadata Mess Waiting in the Wings?

Tuesday, October 8th, 2013

Hadoop: Is There a Metadata Mess Waiting in the Wings? by Robin Bloor.

From the post:

Why is Hadoop so popular? There are many reasons. First of all it is not so much a product as an ecosystem, with many components: MapReduce, HBase, HCatalog, Pig, Hive, Sqoop, Mahout and quite a few more. That makes it versatile, and all these components are open source, so most of them improve with each release cycle.

But, as far as I can tell, the most important feature of Hadoop is its file system: HDFS. This has two notable features: it is a key-value store, and it is built for scale-out use. The IT industry seemed to have forgotten about key-value stores. They used to be called ISAM files and came with every operating system until Unix, then Windows and Linux took over. These operating systems didn’t provide general purpose key-value stores, and nobody seemed to care much because there was a plethora of databases that you could use to store data, and there were even inexpensive open source ones. So, that seemed to take care of the data layer.

But it didn’t. The convenience of a key-value store is that you can put anything you want into it as long as you can think of a suitable index for it, and that is usually a simple choice. With a database you have to create a catalog or schema to identify what’s in every table. And, if you are going to use the data coherently, you have to model the data and determine what tables to hold and what attributes are in each table. This puts a delay into importing data from new sources into the database.

Now you can, if you want, treat a database table as a key-value store and define only the index. But that is regarded as bad practice, and it usually is. Add to this the fact that the legacy databases were never built to scale out and you quickly conclude that Hadoop can do something that a database cannot. It can become a data lake – a vast and very scalable data staging area that will accommodate any data you want, no matter how “unstructured” it is.

I rather like that imagery, unadorned Hadoop as a “data lake.”

But that’s not the only undocumented data in a Hadoop ecosystem.

What about the PIG scripts? The MapReduce routines? Or Mahout, Hive, Hbase, etc., etc.

Do you think all the other members of the Hadoop ecosystem also have undocumented data? And other variables?

When Robin mentions Revelytix as having a solution, I assume he means Loom.

Looking at Loom, ask yourself how well it documents other parts of the Hadoop ecosystem?

Robin has isolated a weakness in the current Hadoop system that will unexpectedly and suddenly make itself known.

Will you be ready?

Indexing use cases and technical strategies [Hadoop]

Thursday, August 22nd, 2013

Indexing use cases and technical strategies

From the post:

In this post, let us look at 3 real life indexing use cases. While Hadoop is commonly used for distributed batch index building, it is desirable to optimize the index capability in near real time. We look at some practical real life implementations where the engineers have successfully worked out their technology stack combinations using different products.

Resources on:

  1. Near Real Time index at eBay
  2. Distributed indexing strategy at Trovit
  3. Incremental Processing by Google’s Percolator

Presentations and a paper for the weekend!

Visualization on Impala: Big, Real-Time, and Raw

Tuesday, August 20th, 2013

Visualization on Impala: Big, Real-Time, and Raw by Justin Kestelyn.

From the post:

What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.

The traditional Apache Hadoop approach — in which you store all your data in HDFS and do batch processing through MapReduce — works well for data geeks and data scientists, who can write MapReduce jobs and wait hours for them to run before asking the next question. But many businesses have never even heard of Hadoop, don’t employ a data scientist, and want their data questions answered in a second or two — not in hours.

We at Zoomdata, working with the Cloudera team, have figured out how to make Big Data simple, useful, and instantly accessible across an organization, with Cloudera Impala being a key element. Zoomdata is a next-generation user interface for data, and addresses streams of data as opposed to sets. Zoomdata performs continuous math across data streams in real-time to drive visualizations on touch, gestural, and legacy web interfaces. As new data points come in, it re-computes their values and turns them into visuals in milliseconds.

To handle historical data, Zoomdata re-streams the historical raw data through the same stream-processing engine, the same way you’d rewind a television show on your home DVR. The amount of the data involved can grow rapidly, so the ability to crunch billions of rows of raw data in a couple seconds is important –- which is where Impala comes in.

With Impala on top of raw HDFS data, we can run flights of tiny queries, each to do a tiny fraction of the overall work. Zoomdata adds the ability to process the resulting stream of micro-result sets instead of processing the raw data. We call this approach “micro-aggregate delegation”; it enables users to see results immediately, allowing for instantaneous analysis of arbitrarily large amounts of raw data. The approach also allows for joining micro-aggregate streams from disparate Hadoop, NoSQL, and legacy sources together while they are in-flight, an approach we call the “Death Star Join” (more on that in a future blog post).

The demo below shows how this works, by visualizing a dataset of 1 billion raw records per day nearly instantaneously, with no pre-aggregation, no indexing, no database, no star schema, no pre-built reports, and no data movement — just billions of rows of raw data in HDFS with Impala and Zoomdata on top.

The demo is very impressive! A must see.

The riff that Zoomdata is a “dvr” for data will resonate with many users.

My only caveat is caution with regard to the cleanliness of your data. The demo presumes that the underlying data is clean and the relationships displayed are relevant to the user’s query.

Neither of those assumptions may be correct in your particular case. Not the fault of Zoomdata because no software can correct a poor choice of data for analysis.

See Zoomdata.


Sunday, June 2nd, 2013

Hadoop REST API – WebHDFS by Istvan Szegedi.

From the post:

Hadoop provides a Java native API to support file system operations such as create, rename or delete files and directories, open, read or write files, set permissions, etc. A very basic example can be found on Apache wiki about how to read and write files from Hadoop.

This is great for applications running within the Hadoop cluster but there may be use cases where an external application needs to manipulate HDFS like it needs to create directories and write files to that directory or read the content of a file stored on HDFS. Hortonworks developed an additional API to support these requirements based on standard REST functionalities.

Something to add to your Hadoop toolbelt.

How Hadoop Works? HDFS case study

Thursday, April 18th, 2013

How Hadoop Works? HDFS case study by Dane Dennis.

From the post:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Hadoop library contains two major components HDFS and MapReduce, in this post we will go inside each HDFS part and discover how it works internally.

Knowing how to use Hadoop is one level of expertise.

Knowing how Hadoop works takes you to the next level.

One where you can better adapt Hadoop to your needs.

Understanding HDFS is a step in that direction.


Wednesday, April 17th, 2013

Tachyon by UC Berkeley AMP Lab.

From the webpage:

Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.It offers up to 300 times higher throughput than HDFS, by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that is frequently read.

Since we aren’t quite to in-memory computing just yet, you may want to review Tachyon.

The numbers are very impressive.

HDFS File Operations Made Easy with Hue (demo)

Monday, April 8th, 2013

HDFS File Operations Made Easy with Hue by Romain Rigaux.

From the post:

Managing and viewing data in HDFS is an important part of Big Data analytics. Hue, the open source web-based interface that makes Apache Hadoop easier to use, helps you do that through a GUI in your browser — instead of logging into a Hadoop gateway host with a terminal program and using the command line.

The first episode in a new series of Hue demos, the video below demonstrates how to get up and running quickly with HDFS file operations via Hue’s File Browser application.

Very nice 2:18 video.

Brings the usual graphical file interface to Hadoop (no small feat) but reminds me of every other graphical file interface.

To step beyond the common graphical file interface, why not:

  • Links to scripts that call a file
  • File ownership – show all files owned by a user
  • Navigation of files by content type(s)
  • Grouping of files by common scripts
  • Navigation of files by content
  • Grouping of files by script owners calling the files

are just a few of the possibilities that come to mind.

I would make the roles in those relationships explicit but that is probably my topic map background showing through.

Apache Tajo

Wednesday, March 27th, 2013

Apache Tajo

From the webpage:


Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo uses HDFS as a primary storage layer and has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer.


  • Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
  • Rudiment ETL that transforms one data format to another data format.
  • Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
  • Command line interface to allow users to submit SQL queries
  • Java API to enable clients to submit SQL queries to Tajo

If you ever wanted to get in on the ground floor of a data warehouse project, this could be your chance!

I first saw this at ‎Apache Incubator: Tajo – a Relational and Distributed Data Warehouse for Hadoop by Alex Popescu.


Wednesday, March 6th, 2013


From the webpage:

PolyBase is a fundamental breakthrough in data processing used in SQL Server 2012 Parallel Data Warehouse to enable truly integrated query across Hadoop and relational data.

Complementing Microsoft’s overall Big Data strategy, PolyBase is a breakthrough new technology on the data processing engine in SQL Server 2012 Parallel Data Warehouse designed as the simplest way to combine non-relational data and traditional relational data in your analysis. While customers would normally burden IT to pre-populate the warehouse with Hadoop data or undergo an extensive training on MapReduce in order to query non-relational data, PolyBase does this all seamlessly giving you the benefits of “Big Data” without the complexities.

I must admit I had my hopes up for the videos labeled: “Watch informative videos to understand PolyBase.”

But the first one was only 2:52 in length and the second was about the Jim Gray Systems Lab (2:13).

So, fair to say it was short on details. 😉

The closest thing I found to a clue was in the PolyBase datasheet that reads (under PolyBase Use Cases, if you are reading along) where it says:

PolyBase introduces the concept of external tables to represent data residing in HDFS. An external table defines a schema (that is, columns and their types) for data residing in HDFS. The table’s metadata lives in the context of a SQL Server database and the actual table data resides in HDFS.

I assume that means that the data in HDFS could have multiple external tables for the same data? Depending upon the query?

Curious if the external tables and/or data types are going to have mapreduce capabilities built-in? To take advantage of parallel processing of the data?

BTW, for topic map types, subject identities for the keys and data types would be the same as with more traditional “internal” tables. In case you want to merge data.

Just out of curiosity, any thoughts on possible IP on external schemas being applied to data?

I first saw this at Alex Popescu’s Microsoft PolyBase: Unifying Relational and Non-Relational Data.

Data deduplication tactics with HDFS and MapReduce [Contractor Plagiarism?]

Wednesday, February 13th, 2013

Data deduplication tactics with HDFS and MapReduce

From the post:

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.

Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.

From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.

Covers five (5) tactics:

  1. Using HDFS and MapReduce only
  2. Using HDFS and HBase
  3. Using HDFS, MapReduce and a Storage Controller
  4. Using Streaming, HDFS and MapReduce
  5. Using MapReduce with Blocking techniques

In these times of “Great Sequestration,” how much you are spending on duplicated contractor documentation?

You do get electronic forms of documentation. Yes?

Not that difficult to document prior contractor self-plagiarism. Teasing out what you “mistakenly” paid for it may be harder.

Question: Would you rather find out now and correct or have someone else find out?

PS: For the ambitious in government employment. You might want to consider how discovery of contractor self-plagiarism reflects on your initiative and dedication to “good” government.

Exploring Apache Hama

Friday, November 16th, 2012

Exploring Apache Hama

From the post:

Apache Hama is one of the under-hyped projects in the Hadoop ecosystem but gaining a lot of traction steadily with the efforts of its committers. “Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.

A summary of resources on Apache Hama.

You won’t learn Hama over the weekend but you can get a start towards a new skill to list at LinkedIn.

Twitter Flies by Hadoop on Search Quest

Friday, November 9th, 2012

Twitter Flies by Hadoop on Search Quest by Ian Armas Foster.

From the post:

People who use Twitter may not give a second thought to the search bar at the top of the page. It’s pretty basic, right? You type something into the nifty little box and, like the marvel of efficient search that it is, it offers suggestions for things the user might wish to search during the typing process.

On the surface, it operates like any decent search engine. Except, of course, this is Twitter we’re talking about. There is no basic functionality at the core here. As it turns out, a significant amount of effort went into designing the Twitter search suggestion engine and the network is still just getting started refining this engine.

A recent Twitter-published scientific paper tells the tale of Twitter’s journey through their previously existing Hadoop infrastructure to a custom combined infrastructure. This connects the HDFS to a frontend cache (to deal with queries and responses) and a backend (which houses algorithms that rank relevance).

The latency of the Hadoop solution was too high.

Makes me think about topic map authoring with a real time “merging” interface. One that displays the results of a current topic, association or occurrence that is being authored on the map.

Or at least the option to choose to see such a display with some reasonable response time.

CDH4.1 Now Released!

Wednesday, October 3rd, 2012

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.


Analyzing Twitter Data with Hadoop [Hiding in a Public Data Stream]

Wednesday, September 19th, 2012

Analyzing Twitter Data with Hadoop by Jon Natkins

From the post:

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.

In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.

Looking forward to more posts in this series!

Social media is a focus for marketing teams for obvious reasons.

Analysis of snaps, crackles and pops en masse.

What if you wanted to communicate securely with others using social media?

Thinking of something more robust and larger than two (or three) lovers agreeing on code words.

How would you hide in a public data stream?

Or the converse, how would you hunt for someone in a public data stream?

How would you use topic maps to manage the semantic side of such a process?

Automating Your Cluster with Cloudera Manager API

Monday, September 10th, 2012

Automating Your Cluster with Cloudera Manager API

From the post:

API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.

Cloudera Manager API Basics

The CM API is an HTTP REST API, using JSON serialization. The API is served on the same host and port as the CM web UI, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Web UI. API users have the same privileges as they do in the web UI world.

You can read the full API documentation here.

We are nearing mid-September so the holiday season will be here before long. It isn’t too early to start planning on price/hardware break points.

This will help configure a HDFS and MapReduce cluster on your holiday hardware.

CDH3 update 5 is now available

Monday, August 13th, 2012

CDH3 update 5 is now available by Arvind Prabhakar

From the post:

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

  • Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
  • Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.
  • WebHDFS – A full read/write REST API to HDFS.

Maintenance release. Installation is good practice before major releases.

Introducing Apache Hadoop YARN

Friday, August 3rd, 2012

Introducing Apache Hadoop YARN by Arun Murthy.

From the post:

I’m thrilled to announce that the Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF!

Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.

In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.

Considering the explosive growth of Hadoop, what new data processing applications do you see emerging first in YARN?

Why we build our platform on HDFS

Thursday, July 26th, 2012

Why we build our platform on HDFS by Charles Zedlewski

Charles Zedlewski pushes the number of Hadoop competitors up to twelve:

It’s not often the case that I have a chance to concur with my colleague E14 over at Hortonworks but his recent blog post gave the perfect opportunity. I wanted to build on a few of E14’s points and add some of my own.

A recent GigaOm article presented 8 alternatives to HDFS. They actually missed at least 4 others. For over a year, Parascale marketed itself as an HDFS alternative (until it became an asset sale to Hitachi). Appistry continues to market its HDFS alternative. I’m not sure if it’s released yet but it is very evident that Symantec’s Veritas unit is proposing its Clustered Filesystem (CFS) as an alternative to HDFS as well. HP Ibrix has also supported the HDFS API for some years now.

The GigaOm article implies that the presence of twelve other vendors promoting alternatives must speak to some deficiencies in HDFS for what else would motivate so many offerings? This really draws the incorrect conclusion. I would ask this:

What can we conclude from the fact that there are:

Best links I have for Hadoop competitors (for your convenience and additions):

  1. Appistry
  2. Cassandra (DataStax)
  3. Ceph (Inktrack)
  4. Clustered Filesystem (CFS)
  5. Dispersed Storage Network (Cleversafe)
  6. GPFS (IBM)
  7. Ibrix
  8. Isilon (EMC)
  9. Lustre
  10. MapR File System
  11. NetApp Open Solution for Hadoop
  12. Parascale

Thinking about the HDFS vs. Other Storage Technologies

Wednesday, July 25th, 2012

Thinking about the HDFS vs. Other Storage Technologies by Eric Baldeschwieler.

Just to whet your interest (see Eric’s post for the details):

As Apache Hadoop has risen in visibility and ubiquity we’ve seen a lot of other technologies and vendors put forth as replacements for some or all of the Hadoop stack. Recently, GigaOM listed eight technologies that can be used to replace HDFS (Hadoop Distributed File System) in some use cases. HDFS is not without flaws, but I predict a rosy future for HDFS. Here is why…

To compare HDFS to other technologies one must first ask the question, what is HDFS good at:

  • Extreme low cost per byte….
  • Very high bandwidth to support MapReduce workloads….
  • Rock solid data reliability….

A lively storage competition is a good thing.

A good opportunity to experiment with different storage strategies.

Hortonworks Data Platform v1.0 Download Now Available

Thursday, June 21st, 2012

Hortonworks Data Platform v1.0 Download Now Available

From the post:

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

I’m a member of the general public! And you probably are too! 😉

See the rest of the post for more goodies that are included with this release.

HBase Write Path

Wednesday, June 20th, 2012

HBase Write Path by Jimmy Xiang.

From the post:

Apache HBase is the Hadoop database, and is based on the Hadoop Distributed File System (HDFS). HBase makes it possible to randomly access and update data stored in HDFS, but files in HDFS can only be appended to and are immutable after they are created. So you may ask, how does HBase provide low-latency reads and writes? In this blog post, we explain this by describing the write path of HBase — how data is updated in HBase.

The write path is how an HBase completes put or delete operations. This path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile. Included in the design of the write path are features that HBase uses to prevent data loss in the event of a region server failure. Therefore understanding the write path can provide insight into HBase’s native data loss prevention mechanism.

Whether you intend to use Hadoop for topic map processing or not, this will be a good introduction to updating data in HBase. Not all applications using Hadoop are topic maps so this may serve you in other contexts as well.

Cloudera Manager 3.7.6 released!

Monday, June 4th, 2012

Cloudera Manager 3.7.6 released! by Jon Zuanich.

Jon writes:

We are pleased to announce that Cloudera Manager 3.7.6 is now available! The most notable updates in this release are:

  • Support for multiple Hue service instances
  • Separating RPC queue and processing time metrics for HDFS
  • Performance tuning of the Resource Manager components
  • Several bug fixes and performance improvements

The detailed Cloudera Manager 3.7.6 release notes are available at:

Cloudera Manager 3.7.6 is available to download from:

Only fair since I mentioned the Cray earlier that I get a post about Cloudera out today as well.