Archive for the ‘Hadoop’ Category

Hadoop® v3.0.0, Pre-1990 Documentation Practice

Saturday, December 16th, 2017

Apache® Hadoop® v3.0.0 General Availability

From the post:

Ubiquitous Open Source enterprise framework maintains decade-long leading role in $100B annual Big Data market

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, today announced Apache® Hadoop® v3.0.0, the latest version of the Open Source software framework for reliable, scalable, distributed computing.

Over the past decade, Apache Hadoop has become ubiquitous within the greater Big Data ecosystem by enabling firms to run and manage data applications on large hardware clusters in a distributed computing environment.

"This latest release unlocks several years of development from the Apache community," said Chris Douglas, Vice President of Apache Hadoop. "The platform continues to evolve with hardware trends and to accommodate new workloads beyond batch analytics, particularly real-time queries and long-running services. At the same time, our Open Source contributors have adapted Apache Hadoop to a wide range of deployment environments, including the Cloud."

"Hadoop 3 is a major milestone for the project, and our biggest release ever," said Andrew Wang, Apache Hadoop 3 release manager. "It represents the combined efforts of hundreds of contributors over the five years since Hadoop 2. I'm looking forward to how our users will benefit from new features in the release that improve the efficiency, scalability, and reliability of the platform."

Apache Hadoop 3.0.0 highlights include:

  • HDFS erasure coding —halves the storage cost of HDFS while also improving data durability;
  • YARN Timeline Service v.2 (preview) —improves the scalability, reliability, and usability of the Timeline Service;
  • YARN resource types —enables scheduling of additional resources, such as disks and GPUs, for better integration with machine learning and container workloads;
  • Federation of YARN and HDFS subclusters transparently scales Hadoop to tens of thousands of machines;
  • Opportunistic container execution improves resource utilization and increases task throughput for short-lived containers. In addition to its traditional, central scheduler, YARN also supports distributed scheduling of opportunistic containers; and 
  • Improved capabilities and performance improvements for cloud storage systems such as Amazon S3 (S3Guard), Microsoft Azure Data Lake, and Aliyun Object Storage System.

… (emphasis in original)

Ah, the Hadoop link.

Do you find it odd use of the leader in the “$100B annual Big Data market” is documented by string comments in scripts and code?

Do you think non-technical management benefits from the documentation so captured?

Or that documentation for field names, routines, etc., can be easily extracted?

If software is maturing in a $100B market, shouldn’t it have mature documentation capabilities as well?

Agile Data Science [Free Download]

Tuesday, February 9th, 2016

Agile Data Science by Russell Jurney.

From the preface:

I wrote this book to get over a failed project and to ensure that others do not repeat my mistakes. In this book, I draw from and reflect upon my experience building analytics applications at two Hadoop shops.

Agile Data Science has three goals: to provide a how-to guide for building analyticsapplications with big data using Hadoop; to help teams collaborate on big data projectsin an agile manner; and to give structure to the practice of applying Agile Big Data analytics in a way that advances the field.

From 2013 and data science has moved quite a bit in the meantime but the principles Russell illustrates remain sound and people do still use Hadoop.

Depending on what you gave up for Lent, you should have enough non-work time to work through Agile Data Science by the end of Lent.

Maybe this year you will have something to show for the forty days of Lent. 😉

What is this Hadoop thing? [What’s Missing From This Picture?]

Friday, January 29th, 2016

Kirk Borne posted this image to Twitter today:


You have seen this image or similar ones. About Hadoop, Big Data, non-IT stuff, etc. You can probably recite the story with your eyes closed, even when you are drunk or stoned. 😉

But today I realized not only who is in the image but who’s missing. Especially in a Hadoop/Big Data context.

Who’s in the image? Customers. They are the blind actors who would not recognize Hadoop in a closet with the light on. They have no idea what relevance Hadoop has to their data and/or any possible benefit to their business problems.

Who’s not in the image? Marketers. They are just out of view in this image. Once they learn a customer has data, they have the solution, Hadoop. “What do you want Hadoop to do exactly?” marketers ask before directly a customer to a particular part of the Hadoop/elephant.

Lo and behold, data salvation is at hand! May the IT gods be praised! We are going to have big data with Hadoop, err, ah, pushing, dragging, well, we’ll get to the specifics later.

The crux of every business problem is a business and not technological need.

You may not be able to store full bandwidth teleconference videos but if you don’t do any video teleconferencing, that’s not really your problem.

If you are already running three shifts making as many HellFire missiles as you can, there isn’t much point in building a recommendation system for your sales department to survey your customers.

Go into every IT conversation with a list of your business needs and require that proposed solutions address those needs, in defined and measurable ways.

You can avoid feeling up an elephant while someone loots your wallet.

Spinning up a Spark Cluster on Spot Instances: Step by Step [$0.69 for 6 hours]

Thursday, October 29th, 2015

Spinning up a Spark Cluster on Spot Instances: Step by Step by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how to deploy a Spark Standalone cluster on AWS Spot Instances for less than $1. In a follow up post, we will show you how to use a Jupyter notebook on Spark for ad hoc analysis of reddit comment data on Amazon S3.

One of the significant hurdles in learning to build distributed systems is understanding how these various technologies are installed and their inter-dependencies. In our experience, the best way to get started with these technologies is to roll up your sleeves and build projects you are passionate about.

This following tutorial shows how you can deploy your own Spark cluster in standalone mode on top of Hadoop. Due to Spark’s memory demand, we recommend using m4.large spot instances with 200GB of magnetic hard drive space each.

m4.large spot instances are not within the free-tier package on AWS, so this tutorial will incur a small cost. The tutorial should not take any longer than a couple hours, but if we allot 6 hours for your 4 node spot cluster, the total cost should run around $0.69 depending on the region of your cluster. If you run this cluster for an entire month we can look at a bill of around $80, so be sure to spin down you cluster after you are finished using it.

How does $0.69 to improve your experience with distributed systems sound?

It’s hard to imagine a better deal.

The only reason to lack experience with distributed systems is lack of interest.

Odd I know but it does happen (or so I have heard). 😉

I first saw this in a tweet by Kirk Borne.

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Monday, June 1st, 2015

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop by Ted Malaska.

From the post:

Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.

The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.

In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop.

Streaming Patterns

The four basic streaming patterns (often used in tandem) are:

  • Stream ingestion: Involves low-latency persisting of events to HDFS, Apache HBase, and Apache Solr.
  • Near Real-Time (NRT) Event Processing with External Context: Takes actions like alerting, flagging, transforming, and filtering of events as they arrive. Actions might be taken based on sophisticated criteria, such as anomaly detection models. Common use cases, such as NRT fraud detection and recommendation, often demand low latencies under 100 milliseconds.
  • NRT Event Partitioned Processing: Similar to NRT event processing, but deriving benefits from partitioning the data—like storing more relevant external information in memory. This pattern also requires processing latencies under 100 milliseconds.
  • Complex Topology for Aggregations or ML: The holy grail of stream processing: gets real-time answers from data with a complex and flexible set of operations. Here, because results often depend on windowed computations and require more active data, the focus shifts from ultra-low latency to functionality and accuracy.

In the following sections, we’ll get into recommended ways for implementing such patterns in a tested, proven, and maintainable way.

Great post on patterns for near real-time data processing.

What I have always wondered about is how much of a use case is there for “near real-time processing” of data? If human decision makers are in the loop, that is outside of ecommerce and algorithmic trading, what is the value-add of “near real-time processing” of data?

For example, Kai Wähner in Real-Time Stream Processing as Game Changer in a Big Data World with Hadoop and Data Warehouse gives the following as common use cases for “near real-time processing” of data”

  • Network monitoring
  • Intelligence and surveillance
  • Risk management
  • E-commerce
  • Fraud detection
  • Smart order routing
  • Transaction cost analysis
  • Pricing and analytics
  • Market data management
  • Algorithmic trading
  • Data warehouse augmentation

Ecommerce, smart order routing, algorithmic trading all fall into the category of no human involved so those may need real-time processing.

But take network monitoring for example. From the news reports I understand that hackers had free run of the Sony network for months. I suppose you must have network monitoring at all before real-time network monitoring would be useful at all.

I would probe to make sure that “real-time” was necessary for the use cases at hand before simply assuming it. In smaller organizations, access to data and “real-time” results are more often a symptom of control issues as opposed to any actual use case for the data.

MapR on Open Data Platform: Why we declined

Wednesday, April 29th, 2015

MapR on Open Data Platform: Why we declined by John Schroeder.

From the post:

Open Data Platform is “solving” problems that don’t need solving

Companies implementing Hadoop applications do not need to be concerned about vendor lock-in or interoperability issues. Gartner analysts Merv Adrian and Nick Heudecker disclosed in a recent blog that less than 1% of companies surveyed thought that vendor lock-in or interoperability was an issue—dead last on the list of customer concerns. Project and sub-project interoperability are very good and guaranteed by both free and paid-for distributions. Applications built on one distribution can be migrated with virtually zero switching costs to the other distributions.

Open Data Platform participation lacks participation by the Hadoop leaders

~75% of Hadoop implementations run on MapR and Cloudera. MapR and Cloudera have both chosen not to participate. The Open Data Platform without MapR and Cloudera is a bit like one of the Big Three automakers pushing for a standards initiative without the involvement of the other two.

I mention this post because it touches on two issues that should concern all users of Hadoop applications.

On “vendor lock-in” you will find the question that was asked was “…how many attendees considered vendor lock-in a barrier to investment in Hadoop. It came in dead last. With around 1% selecting it.” Who Asked for an Open Data Platform?. Considering that it was in the context of a Gartner webinar, it could have been only one person selected it. Not what I would call a representative sample.

Still, I think John in right in saying that vendor lock-in isn’t a real issue with Hadoop. Hadoop applications aren’t off the shelf items and are custom constructs for your needs and data. Not much opportunity for vendor lock-in. You’re in greater danger of IT lock-in due to poor or non-existent documentation for your Hadoop application. If anyone tells you a Hadoop application doesn’t need documentation because you can “…read the code…,” they are building up job security, quite possibly at your future expense.

John is spot on about the Open Data Platform not including all of the Hadoop market leaders. As John says, Open Data Platform does not include those responsible for 75% of the existing Hadoop implementations.

I have seen that situation before in standards work and it never leads to a happy conclusion, for the participants, non-participants and especially the consumers, who are supposed to benefit from the creation of standards. Non-standards for a minority of the market only serve to confuse not overly clever consumers. To say nothing of the popular IT press.

The Open Data Platform also raises questions about how one goes about creating a standard. One approach is to create a standard based on your projection of market needs and to campaign for its adoption. Another is to create a definition of an “ODP Core” and see if it is used by customers in development contracts and purchase orders. If consumers find it useful, they will no doubt adopt it as a de facto standard. Formalization can follow in due course.

So long as we are talking about possible future standards, a practice of documentation more advanced than C style comments for Hadoop ecosystems would be a useful Hadoop standard in the future.

Getting Started with Spark (in Python)

Sunday, April 26th, 2015

Getting Started with Spark (in Python) by Benjamin Bengfort.

From the post:

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

These two ideas have been the prime drivers for the advent of scaling analytics, large scale machine learning, and other big data appliances for the last ten years! However, in technology terms, ten years is an incredibly long time, and there are some well-known limitations that exist, with MapReduce in particular. Notably, programming MapReduce is difficult. You have to chain Map and Reduce tasks together in multiple steps for most analytics. This has resulted in specialized systems for performing SQL-like computations or machine learning. Worse, MapReduce requires data to be serialized to disk between each step, which means that the I/O cost of a MapReduce job is high, making interactive analysis and iterative algorithms very expensive; and the thing is, almost all optimization and machine learning is iterative.

To address these problems, Hadoop has been moving to a more general resource management framework for computation, YARN (Yet Another Resource Negotiator). YARN implements the next generation of MapReduce, but also allows applications to leverage distributed resources without having to compute with MapReduce. By generalizing the management of the cluster, research has moved toward generalizations of distributed computation, expanding the ideas first imagined in MapReduce.

Spark is the first fast, general purpose distributed computing paradigm resulting from this shift and is gaining popularity rapidly. Spark extends the MapReduce model to support more types of computations using a functional programming paradigm, and it can cover a wide range of workflows that previously were implemented as specialized systems built on top of Hadoop. Spark uses in-memory caching to improve performance and, therefore, is fast enough to allow for interactive analysis (as though you were sitting on the Python interpreter, interacting with the cluster). Caching also improves the performance of iterative algorithms, which makes it great for data theoretic tasks, especially machine learning.

In this post we will first discuss how to set up Spark to start easily performing analytics, either simply on your local machine or in a cluster on EC2. We then will explore Spark at an introductory level, moving towards an understanding of what Spark is and how it works (hopefully motivating further exploration). In the last two sections we will start to interact with Spark on the command line and then demo how to write a Spark application in Python and submit it to the cluster as a Spark job.

Be forewarned, this post uses the “F” word (functional) to describe the programming paradigm of Spark. Just so you know. 😉

If you aren’t already using Spark, this is about as easy a learning curve as can be expected.


I first saw this in a tweet by DataMining.

New Spatial Aggregation Tutorial for GIS Tools for Hadoop

Sunday, March 29th, 2015

New Spatial Aggregation Tutorial for GIS Tools for Hadoop by Sarah Ambrose.

Sarah motivates you to learn about spatial aggregation, aka spatial binning, with two visualizations of New York taxi data:

No aggregation:




Now that I have your attention, ;-), from the post:

This spatial analysis ability is available using the GIS Tools for Hadoop. There is a new tutorial posted that takes you through the steps of aggregating taxi data. You can find the tutorial here.


Association Rule Mining – Not Your Typical Data Science Algorithm

Monday, March 23rd, 2015

Association Rule Mining – Not Your Typical Data Science Algorithm by Dr. Kirk Borne.

From the post:

Many machine learning algorithms that are used for data mining and data science work with numeric data. And many algorithms tend to be very mathematical (such as Support Vector Machines, which we previously discussed). But, association rule mining is perfect for categorical (non-numeric) data and it involves little more than simple counting! That’s the kind of algorithm that MapReduce is really good at, and it can also lead to some really interesting discoveries.

Association rule mining is primarily focused on finding frequent co-occurring associations among a collection of items. It is sometimes referred to as “Market Basket Analysis”, since that was the original application area of association mining. The goal is to find associations of items that occur together more often than you would expect from a random sampling of all possibilities. The classic example of this is the famous Beer and Diapers association that is often mentioned in data mining books. The story goes like this: men who go to the store to buy diapers will also tend to buy beer at the same time. Let us illustrate this with a simple example. Suppose that a store’s retail transactions database includes the following information:

If you aren’t familiar with association rule mining, I think you will find Dr. Borne’s post an entertaining introduction.

I would not go quite as far as Dr. Borne with “explanations” for the pop-tart purchases before hurricanes. For retail purposes, so long as we spot the pattern, they could be building dikes out of them. The same is the case for other purchases. Take advantage of the patterns and try to avoid second guessing consumers. You can read more about testing patterns Selling Blue Elephants.


MapR Sandbox Fastest On-Ramp to Hadoop

Monday, March 23rd, 2015

MapR Sandbox Fastest On-Ramp to Hadoop

From the webpage:

The MapR Sandbox for Hadoop provides tutorials, demo applications, and browser-based user interfaces to let developers and administrators get started quickly with Hadoop. It is a fully functional Hadoop cluster running in a virtual machine. You can try our Sandbox now – it is completely free and available as a VMware or VirtualBox VM.

If you are a business intelligence analyst or a developer interested in self-service data exploration on Hadoop using SQL and BI Tools, the MapR Sandbox including Apache Drill will get you started quickly. You can download the Drill Sandbox here.

You of course know about the Hortonworks and Cloudera (at the very bottom of the page) sandboxes as well.

Don’t expect a detailed comparison of all three because the features and distributions change too quickly for that to be useful. And my interest is more in capturing the style or approach that may make a difference to a starting user.


I first saw this in a tweet by Kirk Borne.

Jump-Start Big Data with Hortonworks Sandbox on Azure

Thursday, March 19th, 2015

Jump-Start Big Data with Hortonworks Sandbox on Azure by Saptak Sen.

From the post:

We’re excited to announce the general availability of Hortonworks Sandbox for Hortonworks Data Platform 2.2 on Azure.

Hortonworks Sandbox is already a very popular environment in which developers, data scientists, and administrators can learn and experiment with the latest innovations in the Hortonworks Data Platform.

The hundreds of innovations span Hadoop, Kafka, Storm, Hive, Pig, YARN, Ambari, Falcon, Ranger, and other components of which HDP is composed. Now you can deploy this environment for your learning and experimentation in a few clicks on Microsoft Azure.

Follow the guide to Getting Started with Hortonworks Sandbox with HDP 2.2 on Azure to set up your own dev-ops environment on the cloud in a few clicks.

We also provide step by step tutorials to help you get a jump-start on how to use HDP to implement a Modern Data Architecture at your organization.

The Hadoop Sandbox is an excellent way to explore the Hadoop ecosystem. If you trash the setup, just open another sandbox.

Add Hortonworks tutorials to the sandbox and you are less likely to do something really dumb. Or at least you will understand what happened and how to avoid it before you go into production. Always nice to keep the dumb mistakes on your desktop.

Now the Hortonworks Sandbox is on Azure. Same safe learning environment but the power to scale when you are really to go live!

Convolutional Neural Nets in Net#

Wednesday, March 11th, 2015

Convolutional Neural Nets in Net# by by Alexey Kamenev.

From the post:

After introducing Net# in the previous post, we continue with our overview of the language and examples of convolutional neural nets or convnets.

Convnets have become very popular in recent years as they consistently produce great results on hard problems in computer vision, automatic speech recognition and various natural language processing tasks. In most such problems, the features have some geometric relationship, like pixels in an image or samples in audio stream. An excellent introduction to convnets can be found here: (Lecture 5)

Before we start discussing convnets, let’s introduce one definition that is important to understand when working with Net#. In a neural net structure, each trainable layer (a hidden or an output layer) has one or more connection bundles. A connection bundle consists of a source layer and a specification of the connections from that source layer. All the connections in a given bundle share the same source layer and the same destination layer. In Net#, a connection bundle is considered as belonging to the bundle’s destination layer. Net# supports various kinds of bundles like fully connected, convolutional, pooling and so on. A layer might have multiple bundles which connect it to different source layers.

BTW, the previous post was: Neural Nets in Azure ML – Introduction to Net#. Not exactly what I was expecting by the Net# reference.

Machine Learning Blog needs to be added to your RSS feed.

If you need more information: Guide to Net# neural network specification language.


I first saw this in a tweet by Michael Cavaretta.

Apache Tajo brings data warehousing to Hadoop

Tuesday, March 10th, 2015

Apache Tajo brings data warehousing to Hadoop by Joab Jackson.

From the post:

Organizations that want to extract more intelligence from their Hadoop deployments might find help from the relatively little known Tajo open source data warehouse software, which the Apache Software Foundation has pronounced as ready for commercial use.

The new version of Tajo, Apache software for running a data warehouse over Hadoop data sets, has been updated to provide greater connectivity to Java programs and third party databases such as Oracle and PostGreSQL.

While less well-known than other Apache big data projects such as Spark or Hive, Tajo could be a good fit for organizations outgrowing their commercial data warehouses. It could also be a good fit for companies wishing to analyze large sets of data stored on Hadoop data processing platforms using familiar commercial business intelligence tools instead of Hadoop’s MapReduce framework.

Tajo performs the necessary ETL (extract-transform-load process) operations to summarize large data sets stored on an HDFS (Hadoop Distributed File System). Users and external programs can then query the data through SQL.

The latest version of the software, issued Monday, comes with a newly improved JDBC (Java Database Connectivity) driver that its project managers say makes Tajo as easy to use as a standard relational database management system. The driver has been tested against a variety of commercial business intelligence software packages and other SQL-based tools. (Just so you know, I took out the click following stuff and inserted the link to the Tajo project page only.)

Being surprised by Apache Tajo I looked at the list of the top level projects at Apache and while I recognized a fair number of them by name, I could tell you the status only of those I actively follow. Hard to say what other jewels are hidden there.

Joab cites several large data consumers who have found Apache Tajo faster than Hive for their purposes. Certainly an option to keep in mind.

Operationalizing a Hadoop Eco-System

Monday, March 2nd, 2015

(Part 1: Installing & Configuring a 3-node Cluster) by Louis Frolio.

From the post:

The objective of DataTechBlog is to bring the many facets of data, data tools, and the theory of data to those curious about data science and big data. The relationship between these disciplines and data can be complex. However, if careful consideration is given to a tutorial, it is a practical expectation that the layman can be brought online quickly. With that said, I am extremely excited to bring this tutorial on the Hadoop Eco-system. Hadoop & MapReduce (at a high level) are not complicated ideas. Basically, you take a large volume of data and spread it across many servers (HDFS). Once at rest, the data can be acted upon by the many CPU’s in the cluster (MapReduce). What makes this so cool is that the traditional approach to processing data (bring data to cpu) is flipped. With MapReduce, CPU is brought to the data. This “divide-and-conquer” approach makes Hadoop and MapReduce indispensable when processing massive volumes of data. In part 1 of this multi-part series, I am going to demonstrate how to install, configure and run a 3-node Hadoop cluster. Finally, at the end I will run a simple MapReduce job to perform a unique word count of Shakespeare’s Hamlet. Future installments of this series will include topics such as: 1. Creating an advanced word count with MapReduce, 2. Installing and running Hive, 3. Installing and running Pig, 4. Using Sqoop to extract and import structured data into HDFS. The goal is to illuminate all the popular and useful tools that support Hadoop.

Operationalizing a Hadoop Eco-System (Part 2: Customizing Map Reduce)

Operationalizing a Hadoop Eco-System (Part 3: Installing and using Hive)

Be forewarned that Louis suggests hosting three Linux VMs on a fairly robust machine. He worked on a Windows 7 x64 machine with 1 TB of storage and 24G of RAM. (How much of that was used by Windows and Office he doesn’t say. 😉 )

The last post in this series was in April 2014 so you may have to look elsewhere for tutorials on Pig and Sqoop.


Working with Small Files in Hadoop – Part 1, Part 2, Part 3

Wednesday, February 25th, 2015

Working with Small Files in Hadoop – Part 1, Part 2, Part 3 by Chris Deptula.

From the post:

Why do small files occur?

The small file problem is an issue Inquidia Consulting frequently sees on Hadoop projects. There are a variety of reasons why companies may have small files in Hadoop, including:

  • Companies are increasingly hungry for data to be available near real time, causing Hadoop ingestion processes to run every hour/day/week with only, say, 10MB of new data generated per period.
  • The source system generates thousands of small files which are copied directly into Hadoop without modification.
  • The configuration of MapReduce jobs using more than the necessary number of reducers, each outputting its own file. Along the same lines, if there is a skew in the data that causes the majority of the data to go to one reducer, then the remaining reducers will process very little data and produce small output files.

Does it sound like you have small files? If so, this series by Chris is what you are looking for.

Start of a new era: Apache HBase™ 1.0

Wednesday, February 25th, 2015

Start of a new era: Apache HBase™ 1.0

From the post:

The Apache HBase community has released Apache HBase 1.0.0. Seven years in the making, it marks a major milestone in the Apache HBase project’s development, offers some exciting features and new API’s without sacrificing stability, and is both on-wire and on-disk compatible with HBase 0.98.x.

In this blog, we look at the past, present and future of Apache HBase project. 

The 1.0.0 release has three goals:

1) to lay a stable foundation for future 1.x releases;

2) to stabilize running HBase cluster and its clients; and

3) make versioning and compatibility dimensions explicit 

Seven (7) years is a long time so kudos to everyone who contributed to getting Apache HBase to this point!

For those of you who like documentation, see the Apache HBase™ Reference Guide.

Define and Process Data Pipelines in Hadoop with Apache Falcon

Wednesday, February 11th, 2015

Define and Process Data Pipelines in Hadoop with Apache Falcon

From the webpage:

Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.


In this tutorial we will walk through a scenario where email data lands hourly on a cluster. In our example:

  • This cluster is the primary cluster located in the Oregon data center.
  • Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.

The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.

To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.

Not only a great tutorial on Falcon, this tutorial is a great example of writing a tuturial!

Hortonworks Establishes Data Governance Initiative

Monday, February 2nd, 2015

Hortonworks Establishes Data Governance Initiative

From the post:

Hortonworks® (NASDAQ:HDP), the leading contributor to and provider of enterprise Apache™ Hadoop®, today announced the creation of the Data Governance Initiative (DGI). DGI will develop an extensible foundation that addresses enterprise requirements for comprehensive data governance. In addition to Hortonworks, the founding members of DGI are Aetna, Merck, and Target and Hortonworks’ technology partner SAS.

Enterprises adopting a modern data architecture must address certain realities when legacy and new data from disparate platforms are brought under management. DGI members will work with the open source community to deliver a comprehensive solution; offering fast, flexible and powerful metadata services, deep audit store and an advanced policy rules engine. It will also feature deep integration with Apache Falcon for data lifecycle management and Apache Ranger for global security policies. Additionally, the DGI solution will interoperate with and extend existing third-party data governance and management tools by shedding light on the access of data within Hadoop. Further DGI investment roadmap phases will be released in the coming weeks.

Supporting quotes

“This joint engineering initiative is another pillar in our unique open source development model,” said Tim Hall, vice president, product management at Hortonworks. “We are excited to partner with the other DGI members to build a completely open data governance foundation that meets enterprise requirements.”

“As customers are moving Hadoop into corporate data and processing environments, metadata and data governance are much needed capabilities. SAS participation in this initiative strengthens the integration of SAS data management, analytics and visualization into the HDP environment and more broadly it helps advance the Apache Hadoop project. This additional integration will give customers better ability to manage big data governance within the Hadoop framework,” said SAS Vice President of Product Management Randy Guard.

Further reading

Enterprise Hadoop:

Apache Falcon:

Hadoop and a Modern Data Architecture:

For more information:

Mike Haro

Quite possibly an opportunity to push for topic map like capabilities in an enterprise setting.

That will require affirmative action on the part of members of the TM community as it is unlikely Hortonworks and others will educate themselves on topic maps.


MapR Offers Free Hadoop Training and Certifications

Thursday, January 29th, 2015

MapR Offers Free Hadoop Training and Certifications by Thor Olavsrud.

From the post:

In an effort to make Hadoop training for developers, analysts and administrators more accessible, Hadoop distribution specialist MapR Technologies Tuesday unveiled a free on-demand training program. Another track for HBase developers will be added later this quarter.

“This represents a $50 million, in-kind contribution to the Hadoop community,” says Jack Norris, CMO of MapR. “The focus is overcoming what many people consider the major obstacle to the adoption of big data, particularly Hadoop.”

The developer track is about building big data applications in Hadoop. The topics range from the basics of Hadoop and related technologies to advanced topics like designing and developing MapReduce and HBase applications with hands-on labs. The courses include:

  • Hadoop Essentials. This course, which is immediately available, provides an introduction to Hadoop, the ecosystem, common solutions and use cases.
  • Developing Hadoop Applications. This course is also immediately available and focuses on designing and writing effective Hadoop applications with MapReduce and YARN.
  • HBase Schema Design and Modeling. This course will become available in February and will focus on architecture, schema design and data modeling on HBase.
  • Developing HBase Applications. This course will also debut in February and focuses on real-world application design in HBase (Time Series and Social Application examples).
  • Hadoop Data Analysis – Drill. Slated for debut in March, this course covers interactive SQL on Hadoop for structured, semi-structured and nested data.

I remember how expensive the Novell training classes were back in the Netware 4.11 days. (Yes, that has been a while.)

I wonder whose software will come to mind after completing the MapR training courses and passing the certification exams?

That’s what I think too. Send kudos to MapR for this effort!

Looking forward to seeing some of you at Hadoop certification exams later this year!

I first saw this in a tweet by Kirk Borne.

Data Science and Hadoop: Predicting Airline Delays – Part 3

Tuesday, January 27th, 2015

Data Science and Hadoop: Predicting Airline Delays – Part 3 by Ofer Mendelevitch and Beau Plath.

From the post:

In our series on Data Science and Hadoop, predicting airline delays, we demonstrated how to build predictive models with Apache Hadoop, using existing tools. In part 1, we employed Pig and Python; part 2 explored Spark, ML-Lib and Scala.

Throughout the series, the thesis, theme, topic, and algorithms were similar. That is, we wanted to dismiss the misconception that data scientists – when applying predictive learning algorithms, like Linear Regression, Random Forest or Neural Networks to large datasets – require dramatic changes to the tooling; that they need dedicated clusters; and that existing tools will not suffice.

Instead, we used the same HDP cluster configuration, the same machine learning techniques, the same data sets, and the same familiar tools like PIG, Python and Scikit-learn and Spark.

For the final part, we resort to Scalding and R. R is a very popular, robust and mature environment for data exploration, statistical analysis, plotting and machine learning. We will use R for data exploration, graphics as well as for building our predictive models with Random Forest and Gradient Boosted Trees. Scalding, on the other hand, provides Scala libraries that abstract Hadoop MapReduce and implement data pipelines. We demonstrate how to pre-process the data into a feature matrix using the Scalding framework.

For brevity I shall spare summarizing the methodology here, since both previous posts (and their accompanying IPython Notebooks) expound the steps, iteration and implementation code. Instead, I would urge that you read all parts as well as try the accompanying IPython Notebooks.

Finally, for this last installment in the series in Scaling and R, read its IPython Notebook for implementation details.

Given the brevity of this post, you are definitely going to need Part 1 and Part 2.

The data science world could use more demonstrations like this series.

Command-line tools can be 235x faster than your Hadoop cluster

Wednesday, January 21st, 2015

Command-line tools can be 235x faster than your Hadoop cluster by Adam Drake.

From the post:

As I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Hayden about using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive, and generally have fun with EMR. Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR. Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec). (emphasis added)

BTW, Adam was using twice as much data as Tom in his analysis.

The lesson here is to not be a one-trick pony as a data scientist. Most solutions, Hadoop, Spark, Titan, can solve most problems. However, anyone who merits the moniker “data scientist” should be able to choose the “best” solution for a given set of circumstances. In some cases that maybe simple shell scripts.

I first saw this in a tweet by Atabey Kaygun.

New in CDH 5.3: Transparent Encryption in HDFS

Wednesday, January 7th, 2015

New in CDH 5.3: Transparent Encryption in HDFS by Charles Lamb, Yi Liu & Andrew Wang

From the post:

Apache Hadoop 2.6 adds support for transparent encryption to HDFS. Once configured, data read from and written to specified HDFS directories will be transparently encrypted and decrypted, without requiring any changes to user application code. This encryption is also end-to-end, meaning that data can only be encrypted and decrypted by the client. HDFS itself never handles unencrypted data or data encryption keys. All these characteristics improve security, and HDFS encryption can be an important part of an organization-wide data protection story.

Cloudera’s HDFS and Cloudera Navigator Key Trustee (formerly Gazzang zTrustee) engineering teams did this work under HDFS-6134 in collaboration with engineers at Intel as an extension of earlier Project Rhino work. In this post, we’ll explain how it works, and how to use it.

Excellent news! Especially for data centers who are responsible for the data of others.

The authors do mention the problem of rogue users, that is on the client side:

Finally, since each file is encrypted with a unique DEK and each EZ can have a different key, the potential damage from a single rogue user is limited. A rogue user can only access EDEKs and ciphertext of files for which they have HDFS permissions, and can only decrypt EDEKs for which they have KMS permissions. Their ability to access plaintext is limited to the intersection of the two. In a secure setup, both sets of permissions will be heavily restricted.

Just so you know, it won’t be a security problem with Hadoop 2.6 if Sony is hacked while running on a Hadoop 2.6 at a data center. Anyone who copies the master access codes from sticky notes will be able to do a lot of damage. North Korea, will be the whipping boy for major future cyberhacks. That’s policy, not facts talking.

For users who do understand what secure environments should look like, this a great advance.

Cascalog 2.0 In Depth

Sunday, January 4th, 2015

Cascalog 2.0 In Depth by Sam Ritchie.

From the post:

Cascalog 2.0 has been out for over a year now, and outside of a post to the mailing list and a talk at Clojure/Conj 2013 (slides here), I’ve never written up the
startingly long list of new features brought by that release. So shameful.

This post fixes that. 2.0 was a big deal. Anonymous functions make it easy to reuse your existing, non Cascalog code. The interop story with vanilla Clojure is much better, which is huge for testing. Finally, users can access the JobConf, Cascading’s counters and other Cascading guts during operations.

Here’s a list of the features I’ll cover in this post:

  • new def*ops,
  • Anonymous function support
  • Higher order functions
  • Lifting Clojure functions into Cascalog
  • expand-query
  • Using functions as implicit filters in queries
  • prepared functions, and access to Cascading’s guts

As if that weren’t enough, 2.0 adds a standalone Cascading DSL with an API similar to Scalding’s. You can move between this Cascading API and Cascalog. This makes it easy to use Cascading’s new features, like optimized joins, that haven’t bubbled up to the Cascalog DSL.

I’ll go over the Cascading DSL and the support for non-Cascading execution environments in a later post. For now, let’s get into it.

If you want to follow along, go ahead and clone the Cascalog repo, cd into the “cascalog-core” subdirectory and run “lein repl”. To try this code out in other projects, run “lein sub install” in the root directory. This will install [cascalog/cascalog-core "3.0.0-SNAPSHOT"] locally, so you can add it to your project.clj and give the code a whirl.

Belated but welcome review of the features of Cascalog 2.0!

I particularly liked the suggested “follow along” approach of the post.


Cloudera Live (Update)

Thursday, December 25th, 2014

Cloudera Live (Update)

I thought I had updated: Cloudera Live (beta) but apparently not!

Let me correct that today:

Cloudera Live is the fastest and easiest way to get started with Apache Hadoop and it now includes self­-guided, interactive demos and tutorials. With a one-­button deployment option, you can spin up a four-­node cluster of CDH, Cloudera’s open source Hadoop platform, within minutes. This free, cloud­-based Hadoop environment lets you:

  • Learn the basics of Hadoop (and CDH) through pre-­loaded, hands-­on tutorials
  • Plan your Hadoop project using your own datasets
  • Explore the latest features in CDH
  • Extend the capabilities of Hadoop and CDH through familiar partner tools, including Tableau and Zoomdata

Caution: The free trial is for fourteen (14) days only. To prevent billing to your account, you must delete the four machine cluster that you create.

I understand the need for a time limit but fourteen (14) days seems rather short to me, considering the number of options in the Hadoop ecosystem.

There is a read-only CDH option which is limited to three hour sessions.


Apache Ranger Audit Framework

Wednesday, December 24th, 2014

Apache Ranger Audit Framework by Madhan Neethiraj.

From the post:

Apache Ranger provides centralized security for the Enterprise Hadoop ecosystem, including fine-grained access control and centralized audit mechanism, all essential for Enterprise Hadoop. This blog covers various details of Apache Ranger’s audit framework options available with Apache Ranger Release 0.4.0 in HDP 2.2 and how they can be configured.

From the Ranger homepage:

Apache Ranger offers a centralized security framework to manage fine-grained access control over Hadoop data access components like Apache Hive and Apache HBase. Using the Apache Ranger console, security administrators can easily manage policies for access to files, folders, databases, tables, or column. These policies can be set for individual users or groups and then enforced within Hadoop.

Security administrators can also use Apache Ranger to manage audit tracking and policy analytics for deeper control of the environment. The solution also provides an option to delegate administration of certain data to other group owners, with the aim of securely decentralizing data ownership.

Apache Ranger currently supports authorization, auditing and security administration of following HDP components:

And you are going to document the semantics of the settings, events and other log information….where?

Oh, aha, you know what those settings, events and other log information mean and…, not planning on getting hit by a bus are we? Or planning to stay in your present position forever?

No joke. I know someone training their replacements in ten year old markup technologies. Systems built on top of other systems. And they kept records. Lots of records.

Test your logs on a visiting Hadoop systems administrator. If they don’t get 100% correct on your logging, using whatever documentation you have, you had better start writing.

I hadn’t thought about the visiting Hadoop systems administrator idea before but that would be a great way to test the documentation for Hadoop ecosystems. Better to test it that way instead of after a natural or unnatural disaster.

Call it the Hadoop Ecosystem Documentation Audit. Give a tester tasks to perform, which must be accomplished with existing documentation. No verbal assistance. I suspect a standard set of tasks could be useful in defining such a process.

Spark 1.2.0 released

Monday, December 22nd, 2014

Spark 1.2.0 released

From the post:

We are happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark’s largest release ever, with contributions from 172 developers and more than 1,000 commits!

This release brings operational and performance improvements in Spark core including a new network transport subsytem designed for very large shuffles. Spark SQL introduces an API for external data sources along with Hive 13 support, dynamic partitioning, and the fixed-precision decimal type. MLlib adds a new pipeline-oriented package ( for composing multiple algorithms. Spark Streaming adds a Python API and a write ahead log for fault tolerance. Finally, GraphX has graduated from alpha and introduces a stable API.

Visit the release notes to read about the new features, or download the release today.

It looks like Christmas came a bit early this year. 😉

Lots of goodies to try out!

The Top 10 Posts of 2014 from the Cloudera Engineering Blog

Thursday, December 18th, 2014

The Top 10 Posts of 2014 from the Cloudera Engineering Blog by Justin Kestelyn.

From the post:

Our “Top 10″ list of blog posts published during a calendar year is a crowd favorite (see the 2013 version here), in particular because it serves as informal, crowdsourced research about popular interests. Page views don’t lie (although skew for publishing date—clearly, posts that publish earlier in the year have pole position—has to be taken into account).

In 2014, a strong interest in various new components that bring real time or near-real time capabilities to the Apache Hadoop ecosystem is apparent. And we’re particularly proud that the most popular post was authored by a non-employee.

See Justin’s post for the top ten (10) list!

The Cloudera blog always has high quality content so this the cream of the crop!



Saturday, December 13th, 2014

Hadoop: What it is and how people use it: my own summary by Bob DuCharme.

From the post:

The web offers plenty of introductions to what Hadoop is about. After reading up on it and trying it out a bit, I wanted to see if I could sum up what I see as the main points as concisely as possible. Corrections welcome.

Hadoop is an open source Apache project consisting of several modules. The key ones are the Hadoop Distributed File System (whose acronym is trademarked, apparently) and MapReduce. The HDFS lets you distribute storage across multiple systems and MapReduce lets you distribute processing across multiple systems by performing your “Map” logic on the distributed nodes and then the “Reduce” logic to gather up the results of the map processes on the master node that’s driving it all.

This ability to spread out storage and processing makes it easier to do large-scale processing without requiring large-scale hardware. You can spread the processing across whatever boxes you have lying around or across virtual machines on a cloud platform that you spin up for only as long as you need them. This ability to inexpensively scale up has made Hadoop one of the most popular technologies associated with the buzzphrase “Big Data.”

If you aren’t already familiar with Hadoop or if you are up to your elbows in Hadoop and need a literate summary to forward to others, I think this post does the trick.

Bob covers the major components of the Hadoop ecosystem without getting lost in the weeds.

Recommended reading.

The Impala Cookbook

Thursday, December 11th, 2014

The Impala Cookbook by Justin Kestelyn.

From the post:

Impala, the open source MPP analytic database for Apache Hadoop, is now firmly entrenched in the Big Data mainstream. How do we know this? For one, Impala is now the standard against which alternatives measure themselves, based on a proliferation of new benchmark testing. Furthermore, Impala has been adopted by multiple vendors as their solution for letting customers do exploratory analysis on Big Data, natively and in place (without the need for redundant architecture or ETL). Also significant, we’re seeing the emergence of best practices and patterns out of customer experiences.

As an effort to streamline deployments and shorten the path to success, Cloudera’s Impala team has compiled a “cookbook” based on those experiences, covering:

  • Physical and Schema Design
  • Memory Usage
  • Cluster Sizing and Hardware Recommendations
  • Benchmarking
  • Multi-tenancy Best Practices
  • Query Tuning Basics
  • Interaction with Apache Hive, Apache Sentry, and Apache Parquet

By using these recommendations, Impala users will be assured of proper configuration, sizing, management, and measurement practices to provide an optimal experience. Happy cooking!

I must confess to some confusion when I first read Justin’s post. I thought the slide set was a rather long description of the cookbook and not the cookbook itself. I was searching for the cookbook and kept finding the slides. 😉

Oh, the slides are very much worth your time but I would reserve the term “cookbook” for something a bit more substantive.

Although O’Reilly thinks a few more than 800 responses constitutes a “survey” of data scientists. Survey results that are free from any mention of Impala. Another reason to use that “survey” with caution.

Data Science with Hadoop: Predicting Airline Delays – Part 2

Tuesday, December 9th, 2014

Using machine learning algorithms, Spark and Scala – Part 2 by Ofer Mendelevitch and Beau Plath.

From the post:

In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib.

Apache Spark is a relatively new entrant to the Hadoop ecosystem. Now running natively on Apache Hadoop YARN, the architectural center of Hadoop, Apache Spark is an in-memory data processing API and execution engine that is effective for machine learning and data science use cases. And with Spark on YARN, data workers can simultaneously use Spark for data science workloads alongside other data access engines–all accessing the same shared dataset on the same cluster.


The next installment in this series continues the analysis with the same dataset but then with R!

The bar for user introductions to technology is getting higher even as we speak!