Archive for the ‘Hadoop’ Category

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA!

Tuesday, May 21st, 2013

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA! by John Kreisa.

From the post:

Today we are very excited to announce that Hortonworks Data Platform for Windows (HDP for Windows) is now generally available and ready to support the most demanding production workloads.

We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.

With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.

Additionally, customers now also have complete portability of their Hadoop applications between on-premise and cloud deployments via HDP for Windows and Microsofts’s HDInsight Service.

Two lessons here:

First, Hadoop is a very popular way to address enterprise big data.

Second, going where users are, not where they ought to be, is a smart business move.

Apache Hive 0.11: Stinger Phase 1 Delivered

Saturday, May 18th, 2013

Apache Hive 0.11: Stinger Phase 1 Delivered by Owen O’Malley.

From the post:

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop. Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them. See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Stinger

Welcome news for the Hive and SQL communities alike!

Hadoop SDK and Tutorials for Microsoft .NET Developers

Friday, May 17th, 2013

Hadoop SDK and Tutorials for Microsoft .NET Developers by Marc Holmes.

From the post:

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel. It also covers some Mahout use to build a recommendation engine.
  • Microsoft Hive ODBC Driver. The examples above use this preview driver to enable the connection from Hive to Excel.

If all of the above excites you our Hadoop on Windows for Developers training course also similar content in a lot of depth.

Hadoop is coming to an office/data center near you.

Will you be ready?

Hadoop Toolbox: When to Use What

Friday, May 17th, 2013

Hadoop Toolbox: When to Use What by Mohammad Tariq.

From the post:

Eight years ago not even Doug Cutting would have thought that the tool which he’s naming after his kid’s soft toy would so soon become a rage and change the way people and organizations look at their data. Today Hadoop and Big Data have almost become synonyms to each other. But Hadoop is not just Hadoop now. Over time it has evolved into one big herd of various tools, each meant to serve a different purpose. But glued together they give you a powerpacked combo.

Having said that, one must be careful while choosing these tools for their specific use case as one size doesn’t fit all. What is working for someone might not be that productive for you. So, here I will show you which tool should be picked in which scenario. It’s not a big comparative study but a short intro to some very useful tools. And, this is based totally on my experience so there is always some scope of suggestions. Please feel free to comment or suggest if you have any. I would love to hear from you. Let’s get started :

Not shallow enough to be useful for the c-suite types, not deep enough for decision making.

Nice to use in a survey context, where users need an overview of the Hadoop ecosystem.

How-to: Configure Eclipse for Hadoop Contributions

Thursday, May 16th, 2013

How-to: Configure Eclipse for Hadoop Contributions by Karthik Kambatla.

From the post:

Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

A post to ease your way towards contributing to the Hadoop project!

Or if you simply want to know the code you are running cold.

Or something in between!

Graph processing platform Apache Giraph reaches 1.0

Friday, May 10th, 2013

Graph processing platform Apache Giraph reaches 1.0

From the post:

Used by Facebook and Yahoo, the Apache Giraph project for distributed graph processing has released version 1.0. This is the first new version since the project left incubation and became a top-level project in May 2012, though for some reason it has yet to make it to the Apache index of top level projects.

Giraph allows social graphs and other richly interconnected data structures with many billions of edges to be analysed using hundreds of machines. It is inspired by the Bulk Synchronous Parallel abstract computer model and the Google Pregel system for large scale graph-processing. The developers of Giraph say that unlike those systems, Giraph is an open source, scalable platform built atop of the Apache Hadoop infrastructure which has no single point of failure by design. The documentation includes an introduction to Giraph’s iterative graph processing and how to implement graph processing functions in Java. The Giraph project has seen contributions from Yahoo!, Twitter, Facebook and LinkedIn and from academic institutions around the world.

It’s a little early to be downloading software for the weekend but why not? ;-)

Enjoy!

Spatially Visualize and Analyze Vast Data Stores…

Wednesday, May 8th, 2013

Spatially Visualize and Analyze Vast Data Stores with Esri’s GIS Tools for Hadoop

From the post:

Perhaps the greatest untapped IT resource available today is the ability to spatially analyze and visualize Big Data. As part of its continuing effort to expand the use of geographic information system (GIS) technology among web, mobile, and other developers, Esri has launched GIS Tools for Hadoop. The toolkit removes the obstacles of building map applications for developers to truly capitalize on geoenabling Big Data within Hadoop—the popular open source data management framework. Developers now will be able to answer the where questions in their large data stores.

“Hadoop’s method of processing volumes of information directly addresses the most significant challenge facing IT today,” says Marwa Mabrouk, product manager at Esri. “Enabling Hadoop with spatial capabilities is part of Esri’s continued effort to derive more value from Big Data through spatial analysis.”

Processing and displaying Big Data on maps requires functionality that core Hadoop lacks. GIS Tools for Hadoop extends the Hadoop platform with a series of libraries and utilities that connect Esri ArcGIS to the Hadoop environment. It allows ArcGIS users to export map data in HDFS format—Hadoop’s native file system—and intersect it with billions of records stored in Hadoop. Results can be either directly saved to the Hadoop database or reimported back to ArcGIS for higher-level geoprocessing and visualization.

GIS Tools for Hadoop includes the following:

  • Sample tools and templates that demonstrate the power of GIS
  • Spatial querying inside Hadoop using Hive—Hadoop’s ad hoc querying module
  • Geometry Library to build spatial applications in Hadoop

“GIS Tools for Hadoop not only introduces spatial analysis to Hadoop but creates a looping workflow that pulls Big Data into the ArcGIS environment,” says Mansour Raad, senior software architect at Esri. “It provides tools for Hadoop users who need to visualize Big Data on maps.”

Esri recognizes Big Data as a challenge that community-level involvement can help solve. As such, Esri provides GIS Tools for Hadoop as an open source product available on GitHub. Esri encourages users to download the toolkit, report issues, and actively contribute to improving the tools through the GitHub system.

To download GIS Tools for Hadoop, visit http://esri.github.com/gis-tools-for-hadoop.

Once you have where, your topic map can merge in who, what, why and how.

Natural Language Processing and Big Data…

Wednesday, May 8th, 2013

Natural Language Processing and Big Data: Using NLTK and Hadoop – Talk Overview by Benjamin Bengfort.

From the post:

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem — designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word … but trust me, it’s worth it.

If you can’t wait for the future posts, Benjamin’s presentation from April is here. Amusing but fairly sparse slides.

Looking forward to more posts in this series!


Big Data and Natural Language Processing – Part 1

The “Foo” of Big Data – Part 2

Python’s Natural Language Took Kit (NLTK) and Hadoop – Part 3

Hadoop for Preprocessing Language – Part 4

Beyond Preprocessing – Weakly Inferred Meanings – Part 5

Cloudera Development Kit (CDK)…

Tuesday, May 7th, 2013

Cloudera Development Kit (CDK): Hadoop Application Development Made Easier by Eric Sammer & Tom White.

From the post:

At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.

So, we’ve made Hadoop much easier to “consume” for admins and other operators — but what about for developers, whether working for ISVs, SIs, or users? Until now, they’ve largely been on their own.

That’s why we’re really excited to announce the Cloudera Developer Kit (CDK), a new open source project designed to help developers get up and running to build applications on CDH, Cloudera’s open source distribution including Hadoop, faster and easier than before. The CDK is a collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with the platform. Just like CDH, the CDK is 100% free, open source, and licensed under the same permissive Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.

The CDK lives on GitHub where users can freely browse, download, fork, and contribute back to the source. Community contributions are not only welcome but strongly encouraged. Since most Java developers use tools such as Maven (or tools that are compatible with Maven repositories), artifacts are also available from the Cloudera Maven Repository for easy project integration.

The CDK is a collection of libraries, tools, examples, and docs engineered to simplify common tasks.

What’s In There Today

Our goal is to release a number of CDK modules over time. The first module that can be found in the current release is the CDK Data module; a set of APIs to drastically simplify working with datasets in Hadoop filesystems such as HDFS and the local filesystem. The Data module handles automatic serialization and deserialization of Java POJOs as well as Avro Records, automatic compression, file and directory layout and management, automatic partitioning based on configurable functions, and a metadata provider plugin interface to integrate with centralized metadata management systems (including HCatalog). All Data APIs are fully documented with javadoc. A reference guide is available to walk you through the important parts of the module, as well. Additionally, a set of examples is provided to help you see the APIs in action immediately.

Here’s to hoping that vendor support as shown for Hadoop, Lucene/Solr, R, (who am I missing?), continues and spreads to other areas of software development.

Hadoop Webinars (WANdisco)

Tuesday, May 7th, 2013

Hadoop Webinars (WANdisco)

From May to July, webinars on Hadoop:

A Hadoop Overview

Wednesday, May 15
10:00 a.m. PT
1:00 p.m. ET Register Now

In this webinar, we'll provide an overview of Hadoop’s history and architecture.

This session will highlight: 

  • Major components such as HDFS, MapReduce, and HBase – the NoSQL database management system used with Hadoop for real-time applications
  • A summary of Hadoop’s ecosystem
  • A  review of public and private cloud deployment options
  • Common business use cases
  • And more…

Hadoop: A Deep Dive

Wednesday, June 5
10:00 a.m. PT
1:00 p.m. ET Register Now

This session will present: 

  • Various Hadoop misconceptions (not all clusters are comprised of thousands of machines)
  • Information about real world Hadoop deployments
  • A detailed review of Hadoop’s ecosystem (Sqoop, Flume, Nutch, Oozie, etc.)
  • An in-depth look at HDFS
  • An explanation of MapReduce in relation to latency and dependence on other Hadoop activities
  • An introduction to concepts attendees will need as a prerequisite for subsequent training webinars covering MapReduce, HBase and other major components at a deeper technical level

Hadoop: A MapReduce Tutorial

Wednesday, June 19
10:00 a.m. PT
1:00 p.m. ET Register Now

This session will cover: 

  • MapReduce at a deep technical level
  • The history of MapReduce
  • How a MapReduce job works, its logical flow, and the rules and types of MapReduce jobs
  • Writing, de-bugging and testing MapReduce jobs
  • Various available workflow tools
  • And more…

Hadoop: HBase In-Depth

Wednesday, July 10
10:00 a.m. PT
1:00 p.m. ET Register Now

This session is a deep technical review covering:

  • Flexibility
  • Scalability
  • Components (cells, rows, columns, qualifiers)
  • Schema samples
  • Hardware requirements
  • And more…

Hard to say how “deep” the webinars will be able to get in only one (1) hour.

I have registered for all four (4) and will be reporting back on my experience.

Ambari for provisioning, managing and monitoring Hadoop

Saturday, May 4th, 2013

Ambari for provisioning, managing and monitoring Hadoop

From the post:

Ambari is 100% open source and included in HDP, greatly simplifying installation and initial configuration of Hadoop clusters. In this article we’ll be running through some installation steps to get started with Ambari. Most of the steps here are covered in the main HDP documentation here.

The first order of business is getting Ambari Server itself installed. There are different approaches to this, but for the purposes of this short tour, we’ll assume Ambari is already installed on its own dedicated node somewhere or on one of the nodes on the (future) cluster itself. Instructions can be found under the installation steps linked above. Once Ambari Server is running, the hard work is actually done. Ambari simplifies cluster install and initial configuration with a wizard interface, taking care of it with but a few clicks and decisions from the end user. Hit http://:8080 and log in with admin/admin. Upon logging in, we are greeted with a user-friendly, wizard interface. Welcome to Apache Ambari! Name that cluster and let’s get going.

Even if you are working on the bleeding edge of big data, somebody got to mind the cluster.

This will help you discuss the process of building a cluster with confidence. (Even if you are chary of taking on the task alone.)

Impala 1.0

Wednesday, May 1st, 2013

Impala 1.0: Industry’s First Production-Ready SQL-on-Hadoop Solution

From the post:

Cloudera, the category leader that sets the standard for Apache Hadoop in the enterprise, today announced the general availability of Cloudera Impala, its open source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time. Cloudera was first-to-market with its SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, it has worked closely with customers and open source users, rigorously testing and refining the platform in real world applications to deliver today’s production-hardened and customer validated release, designed from the ground-up for enterprise workloads. The company noted that adoption of the platform has been strong: over 40 enterprise customers and open source users are using Impala today, including 37signals, Expedia, Six3 Systems, Stripe, and Trion Worlds. With its 1.0 release, Impala extends Cloudera’s unified Platform for Big Data, which is designed specifically to bring different computation frameworks and applications to a single pool of data, using a common set of system resources.

The bigger data pools get, the more opportunity there is for semantic confusion.

Or to put that more positively, the greater the market for tools to lessen or avoid semantic confusion.

;-)

Have you used Lua for MapReduce?

Wednesday, May 1st, 2013

Have you used Lua for MapReduce?

From the post:

Lua as a cross platform programming language has been popularly used in games and embedded systems. However, due to its excellent use for configuration, it has found wider acceptance in other user cases as well.

Lua was inspired from SOL (Simple Object Language) and DEL(Data-Entry Language) and created by Roberto Ierusalimschy, Waldemar Celes, and Luiz Henrique de Figueiredo at the Pontifical Catholic University of Rio de Janeiro, Brazil. Roughly translated to ‘Moon’ in Portuguese, it has found many big takers like Adobe, Nginx, Wikipedia.

Another scripting language to use with MapReduce and Hadoop.

Have you ever noticed the Tower of Babel seems to follow human activity around?

First, it was building a tower to heaven – confuse the workforce.

Then it was other community efforts.

And many, many thens, later, it has arrived at MapReduce/Hadoop configuration languages.

Like a kaleidoscope, it just gets richer the more semantic diversity we add.

Do you wonder what the opposite of semantic diversity must look like?

Or if we are the cause, what would it mean to eliminate semantic diversity?

What’s New in Hue 2.3

Sunday, April 28th, 2013

What’s New in Hue 2.3

From the post:

We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.

Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.

Here’s the new features list:

  • Pig Editor: new application for editing and running Apache Pig scripts with UDFs and parameters
  • Table Browser: new application for managing Apache Hive databases, viewing table schemas and sample of content
  • Apache Oozie Bundles are now supported
  • SQL highlighting and auto-completion for Hive/Impala apps
  • Multi-query and highlight/run a portion of a query
  • Job Designer was totally restyled and now supports all Oozie actions
  • Oracle databases (11.2 and later) are now supported

Time to upgrade!

Hadoop Summit North America (June 26-27, 2013)

Thursday, April 25th, 2013

Hadoop Summit North America

From the webpage:

Hortonworks and Yahoo! are pleased to host the 6th Annual Hadoop Summit, the leading conference for the Apache Hadoop community. This two-day event will feature many of the Apache Hadoop thought leaders who will showcase successful Hadoop use cases, share development and administration tips and tricks, and educate organizations about how best to leverage Apache Hadoop as a key component in their enterprise data architecture. It will also be an excellent networking event for developers, architects, administrators, data analysts, data scientists and vendors interested in advancing, extending or implementing Apache Hadoop.

Community Choice Selectees:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

If you need another reason to attend, it’s located in San Jose, California.

2nd best US location for a conference. #1 being New Orleans.

Beginner Tips For Elastic MapReduce

Thursday, April 25th, 2013

Beginner Tips For Elastic MapReduce by John Berryman.

From the post:

By this point everyone is well acquainted with the power of Hadoop’s MapReduce. But what you’re also probably well acquainted with is the pain that must be suffered when setting up your own Hadoop cluster. Sure, there are some really good tutorials online if you know where to look:

However, I’m not much of a dev ops guy so I decided I’d take a look at Amazon’s Elastic MapReduce (EMR) and for the most part I’ve been very pleased. However, I did run into a couple of difficulties, and hopefully this short article will help you avoid my pitfalls.

I often dream of setting up a cluster that requires a newspaper hat because of the oil from cooling the coils, wait!, that was replica of the early cyclotron, sorry, wrong experiment. ;-)

I mean a cluster of computers humming and driving up my cooling bills.

But there are alternatives.

Amazon’s Elastic Map Reduce (EMR) is one.

You can learn Hadoop with Hortonworks Sandbox and when you need production power, EMR awaits.

From a cost effectiveness standpoint, that sounds like a good deal to me.

You?

PS: Someone told me today that Amazon isn’t a reliable cloud because they have downtime. It is true that Amazon does have downtime but that isn’t a deciding factor.

You have to consider the relationship between Amazon’s aggressive pricing and how much reliability you need.

If you are running flight control for a moon launch, you probably should not use a public cloud.

Or for a heart surgery theater. And a few other places like that.

If you mean the webservices for your < 4,000 member NGO, 100% guaranteed uptime is a recipe for someone making money, off of you.

MRQL – a SQL on Hadoop Miracle

Tuesday, April 23rd, 2013

MRQL – a SQL on Hadoop Miracle by Edward J. Yoon.

From the post:

Recently, the Apache Incubator accepted a new query engine for Hadoop and Hama, called MRQL (pronounced miracle), which was initially developed in 2011 by Leonidas Fegaras.

MRQL (MapReduce Query Language) is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop and Hama. MRQL has some overlapping functionality with Hive, Impala and Drill, but one major difference is that it can capture many complex data analysis algorithms that can not be done easily in those systems in declarative form. So, complex data analysis tasks, such as PageRank, k-means clustering, and matrix multiplication and factorization, can be expressed as short SQL-like queries, while the MRQL system is able to evaluate these queries efficiently.

Another difference from these systems is that the MRQL system can run these queries in BSP (Bulk Synchronous Parallel) mode, in addition to the MapReduce mode. With BSP mode, it achieves lower latency and higher speed. According to MRQL team, “In near future, MRQL will also be able to process very large data effectively fast without memory limitation and significant performance degradation in the BSP mode”.

Maybe I should turn my back on the newsfeed more often. ;-)

I suspect the announcement and my distraction were unrelated.

This looks very important.

I can feel another Apache list subscription in the offing.

Schema on Read? [The virtues of schema on write]

Friday, April 19th, 2013

Apache Hadoop and Data Agility by Ofer Mendelevitch.

From the post:

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.

In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

If a schema is supplied “on read,” how is data validation accomplished?

I don’t mean in terms of datatypes such as string, integer, double, etc. That are trivial forms of data validation.

How do we validate the semantics of data when a schema is supplied on read?”

Mistakes do happen in RDBMS systems but with a schema, which defines data semantics, applications can attempt to police those semantics.

I don’t doubt that schema “on read” supplies a lot of useful flexibility, but how do we limit the damage that flexibility can cause?

For example, many years ago, area codes (for telephones) in the USA were tied to geographic exchanges. Data from the era still exists in the bowels of some data stores. That is no longer true in many cases.

Let’s assume I have older data that has area codes tied to geographic areas and newer data that has area codes that are not. Without a schema to define the area code data in both cases, how would I know to treat the area code data differently?

I concede that schema “on read” can be quite flexible.

On the other hand, let’s not discount the value of schema “on write” as well.

Analyzing Data with Hue and Hive

Friday, April 19th, 2013

Analyzing Data with Hue and Hive by Romain Rigaux.

From the post:

In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).

The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!

I think the demo would be more effective if a city known for good food, New Orleans, for example, had been chosen for the challenge.

But given the complexity of the cuisine, that would be a stress test for human experts.

What chance would Apache Hadoop have? ;-)

Hadoop: The Lay of the Land

Thursday, April 18th, 2013

Hadoop: The Lay of the Land by Tom White.

From the post:

The core map-reduce framework for big data consists of several interlocking technologies. This first installment of our tutorial explains what Hadoop does and how the pieces fit together.

Big Data is in the news these days, and Apache Hadoop is one of the most popular platforms for working with Big Data. Hadoop itself is undergoing tremendous growth as new features and components are added, and for this reason alone, it can be difficult to know how to start working with it. In this three-part series, I explain what Hadoop is and how to use it, presenting a simple, hands-on examples that you can try yourself. First, though, let’s look at the problem that Hadoop was designed to solve.

Much later:

Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He is an engineer at Cloudera, a company set up to offer Hadoop tools, support, and training. He is the author of the best-selling O’Reilly book, Hadoop: The Definitive Guide.

If you are getting started with Hadoop or need a good explanation for others, start here.

I first saw this at: Learn How To Hadoop from Tom White in Dr. Dobb’s by Justin Kestelyn.

How Hadoop Works? HDFS case study

Thursday, April 18th, 2013

How Hadoop Works? HDFS case study by Dane Dennis.

From the post:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Hadoop library contains two major components HDFS and MapReduce, in this post we will go inside each HDFS part and discover how it works internally.

Knowing how to use Hadoop is one level of expertise.

Knowing how Hadoop works takes you to the next level.

One where you can better adapt Hadoop to your needs.

Understanding HDFS is a step in that direction.

Hadoop, The Perfect App for OpenStack

Tuesday, April 16th, 2013

Hadoop, The Perfect App for OpenStack by Shaun Connolly.

From the post:

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud).

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Because big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Why is this news important for topic maps?

Have you noticed that none, read none of the big data or cloud efforts say anything about data semantics?

As if when big data and the cloud arrives, all your data integration problems will magically melt away.

I don’t think so.

What I think is going to happen is discordant data sets are going to start rubbing and binding on each other. Perhaps not a lot at first but as data explorers get bolder, the squeaks are going to get louder.

So loud in fact the squeaks (now tearing metal sounds) are going to attract the attention of… (drum roll)… the CEO.

What’s your answer for discordant data?

  • Ear plugs?
  • Job with another company?
  • Job in another country?
  • Job under an assumed name?

I would say none of the above.

Iterative Map Reduce – Prior Art

Tuesday, April 16th, 2013

Iterative Map Reduce – Prior Art

From the post:

There have been several attempts in the recent past at extending Hadoop to support efficient iterative data processing on clusters. To facilitate understanding this problem better here is a collection of some prior art relating to this problem space.

Short summaries of:

Other proposals to add to this list?

Apache Hadoop Patterns of Use: Refine, Enrich and Explore

Tuesday, April 9th, 2013

Apache Hadoop Patterns of Use: Refine, Enrich and Explore by Jim Walter.

From the post:

“OK, Hadoop is pretty cool, but exactly where does it fit and how are other people using it?” Here at Hortonworks, this has got to be the most common question we get from the community… well that and “what is the airspeed velocity of an unladen swallow?”

We think about this (where Hadoop fits) a lot and have gathered a fair amount of expertise on the topic. The core team at Hortonworks includes the original architects, developers and operators of Apache Hadoop and its use at Yahoo, and through this experience and working within the larger community they have been privileged to see Hadoop emerge as the technological underpinning for so many big data projects. That has allowed us to observe certain patterns that we’ve found greatly simplify the concepts associated with Hadoop, and our aim is to share some of those patterns here.

As an organization laser focused on developing, distributing and supporting Apache Hadoop for enterprise customers, we have been fortunate to have a unique vantage point.

With that, we’re delighted to share with you our new whitepaper ‘Apache Hadoop Patterns of Use’. The patterns discussed in the whitepaper are:

Refine: Collect data and apply a known algorithm to it in a trusted operational process.
Enrich: Collect data, analyze and present salient results for online apps.
Explore: Collect data and perform iterative investigation for value.

You can download it here, and we hope you enjoy it.

If you are looking for detailed patterns of use, you will be disappointed.

Runs about nine (9) pages in very high level summary mode.

What remains to be written (to my knowledge) is a collection of use patterns with a realistic amount of detail from a cross-section of Hadoop users.

That would truly be a compelling resource for the community.

One Hour Hadoop Cluster

Tuesday, April 9th, 2013

How to setup a Hadoop cluster in one hour using Ambari?

A guide to setting up a 3-node Hadoop cluster using Oracle’s VirtualBox and Apache Ambari.

HPC may not be the key to semantics but it can still be useful. ;-)

WTF: [footnote 1]

Monday, April 8th, 2013

WTF: The Who to Follow Service at Twitter by Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh.

Abstract:

WTF (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which signicantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for WTF. Besides powering Twitter’s user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a second-generation system under development.

You know it is going to be an amusing paper when footnote 1 reads:

The confusion with the more conventional expansion of the acronym is intentional and the butt of many internal jokes. Also, it has not escaped our attention that the name of the service is actually ungrammatical; the pronoun should properly be in the objective case, as in \whom to follow”.

;-)

Algorithmic recommendations may miss the mark for an end user.

On the other hand, what about an authoring interface that supplies recommendations of associations and other subjects?

A paper definitely worth a slow read!

I first saw this at: WTF: The Who to Follow Service at Twitter (Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh).

HDFS File Operations Made Easy with Hue (demo)

Monday, April 8th, 2013

HDFS File Operations Made Easy with Hue by Romain Rigaux.

From the post:

Managing and viewing data in HDFS is an important part of Big Data analytics. Hue, the open source web-based interface that makes Apache Hadoop easier to use, helps you do that through a GUI in your browser — instead of logging into a Hadoop gateway host with a terminal program and using the command line.

The first episode in a new series of Hue demos, the video below demonstrates how to get up and running quickly with HDFS file operations via Hue’s File Browser application.

Very nice 2:18 video.

Brings the usual graphical file interface to Hadoop (no small feat) but reminds me of every other graphical file interface.

To step beyond the common graphical file interface, why not:

  • Links to scripts that call a file
  • File ownership – show all files owned by a user
  • Navigation of files by content type(s)
  • Grouping of files by common scripts
  • Navigation of files by content
  • Grouping of files by script owners calling the files

are just a few of the possibilities that come to mind.

I would make the roles in those relationships explicit but that is probably my topic map background showing through.

MapR and Ubuntu

Wednesday, April 3rd, 2013

MapR has posted all of its Hadoop ecosystem source code to Github: MapR Technologies.

MapR has also partnered with Canonical to release the entire Hadoop stack for 12.04 LTS and 12.10 releases of Ubuntu on www.ubuntu.com starting April 25, 2013.

For details see: MapR Teams with Canonical to Deliver Hadoop on Ubuntu.

I first saw this at: MapR Turns to Ubuntu in Bid to Increase Footprint by Isaac Lopez.

Data accounts for up to 75 percent of value in half of businesses

Sunday, March 31st, 2013

Data accounts for up to 75 percent of value in half of businesses

From the post:

As the volume of data stored in the enterprise continues to grow, organizations see this information as representing a substantial portion of their assets. With tools such as Hadoop for Windows, businesses are unlocking the value of this data, Anthony Saxby, Microsoft U.K.’s data platform product marketing manager, said in a recent talk at Computing’s Big Data Summit 2013. According to Microsoft’s research, half of all organizations think their data represents 50 to 75 percent of their total value.

The challenge in unlocking this value is technology, Saxby said, according to Computing. Much of this information is internally siloed or separated from the external data sources that it could be combined with to create more effective, monetized results. Today’s businesses want to bring together unstructured and structured data to create new insights. With tools such as Hadoop, this type of analysis is increasingly possible. For instance, record label EMI uses a variety of data types across 25 countries to determine how to market music artists in different geographies.

The headline reminded me of Bilbo Baggins:

I don’t know half of you half as well as I should like; and I like less than half of you half as well as you deserve.

As the narrator notes:

This was unexpected and rather difficult.

I don’t follow the WSJ as closely as some but what of inventories, brick and mortar assets, accounts receivable, employees, IP, etc.?

Not that I doubt the value of data.

I do doubt the ability of businesses that manage by catch phrases like “big data,” “silos,” “unstructured and structured data,” Hadoop,” to realize its value.

Hadoop will figure in successful projects to “unlock data,” but only where it is used as a tool and not a magic bullet.

A clear understanding of data and its sources, how to measure ROI from its use, are only two of the keys to successful use of any data tool.

Pilling up data freed from internal silos upon data from external sources results in a big heap of data.

Impressive to the uninformed but it won’t increase your bottom line.

Analyzing Twitter Data with Apache Hadoop, Part 3:…

Tuesday, March 26th, 2013

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive by Jon Natkins.

From the post:

This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.

In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.

In this article, we’ll learn more about Hive, its strengths and weaknesses, and why Hive is the right choice for analyzing tweets in this application.

I didn’t realize I had missed this part of the Hive series until I saw it mentioned in the Hue post.

Good introduction to Hive.

BTW, is Twitter data becoming the “hello world” of data mining?