Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 27, 2014

…NFL’s ‘Play by Play’ Dataset

Filed under: Hadoop,MapReduce — Patrick Durusau @ 9:28 pm

Data Insights from the NFL’s ‘Play by Play’ Dataset by Jesse Anderson.

From the post:

In a recent GigaOM article, I shared insights from my analysis of the NFL’s Play by Play Dataset, which is a great metaphor for how enterprises can use big data to gain valuable insights into their own businesses. In this follow-up post, I will explain the methodology I used and offer advice for how to get started using Hadoop with your own data.

To see how my NFL data analysis was done, you can view and clone all of the source code for this project on my GitHub account. I am using Hadoop and its ecosystem for this processing. All of the data for this project uses the NFL 2002 season to the 4th week of the 2013 season.

Two MapReduce programs do the initial processing. These programs process the Play by Play data and parse out the play description. Each play has unstructured or handwritten data that describes what happened in the play. Using Regular Expressions, I figured out what type of play it was and what happened during the play. Was there a fumble, was it a run or was it a missed field goal? Those scenarios are all accounted for in the MapReduce program.

Just in case you aren’t interested in winning $1 billion at basketball or you just want to warm up for that challenge, try some NFL data on for size.

Could be useful in teaching you the limits of analysis. For all the stats that can be collected and crunched, games don’t always turn out as predicted.

On any given Monday morning you may win or lose a few dollars in the office betting pool, but number crunching is used for more important decisions as well.

January 16, 2014

Apache Crunch User Guide (new and improved)

Filed under: Apache Crunch,Hadoop,MapReduce — Patrick Durusau @ 10:13 am

Apache Crunch User Guide

From the motivation section:

Let’s start with a basic question: why should you use any high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, or Tez APIs directly? Doesn’t adding another layer of abstraction just increase the number of moving pieces you need to worry about, ala the Law of Leaky Abstractions?

As with any decision like this, the answer is “it depends.” For a long time, the primary payoff of using a high-level tool was being able to take advantage of the work done by other developers to support common MapReduce patterns, such as joins and aggregations, without having to learn and rewrite them yourself. If you were going to need to take advantage of these patterns often in your work, it was worth the investment to learn about how to use the tool and deal with the inevitable leaks in the tool’s abstractions.

With Hadoop 2.0, we’re beginning to see the emergence of new engines for executing data pipelines on top of data stored in HDFS. In addition to MapReduce, there are new projects like Apache Spark and Apache Tez. Developers now have more choices for how to implement and execute their pipelines, and it can be difficult to know in advance which engine is best for your problem, especially since pipelines tend to evolve over time to process more data sources and larger data volumes. This choice means that there is a new reason to use a high-level tool for expressing your data pipeline: as the tools add support for new execution frameworks, you can test the performance of your pipeline on the new framework without having to rewrite your logic against new APIs.

There are many high-level tools available for creating data pipelines on top of Apache Hadoop, and they each have pros and cons depending on the developer and the use case. Apache Hive and Apache Pig define domain-specific languages (DSLs) that are intended to make it easy for data analysts to work with data stored in Hadoop, while Cascading and Apache Crunch develop Java libraries that are aimed at developers who are building pipelines and applications with a focus on performance and testability.

So which tool is right for your problem? If most of your pipeline work involves relational data and operations, than Hive, Pig, or Cascading provide lots of high-level functionality and tools that will make your life easier. If your problem involves working with non-relational data (complex records, HBase tables, vectors, geospatial data, etc.) or requires that you write lots of custom logic via user-defined functions (UDFs), then Crunch is most likely the right choice.

As topic mappers you are likely to work with both relational as well as complex non-relational data so this should be on your reading list.

I didn’t read the prior Apache Crunch documentation so I will have to take Josh Wills at his word that:

A (largely) new and (vastly) improved user guide for Apache Crunch, including details on the new Spark-based impl:

It reads well and makes a good case for investing time in learning Apache Crunch.

I first saw this in a tweet by Josh Wills.

January 12, 2014

The Road to Summingbird:…

Filed under: Hadoop,MapReduce,Summingbird,Tweets — Patrick Durusau @ 8:37 pm

The Road to Summingbird: Stream Processing at (Every) Scale by Sam Ritchie.

Description:

Twitter’s Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at “scale” yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it’s not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup’s code from day one. You’ll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

A slide deck that will make you regret missing the presentation.

I wasn’t able to find a video of Sam’s presentation at Data Day Texas 2014, but I did find a collection of his presentations, including some videos, at: http://sritchie.github.io/.

Valuable lessons for startups and others.

January 3, 2014

Hadoop Map Reduce For Google web graph

Filed under: Graphs,Hadoop,MapReduce — Patrick Durusau @ 2:56 pm

Hadoop Map Reduce For Google web graph

A good question from Stackoverflow:

we have been given as an assignment the task of creating map reduce functions that will output for each node n in the google web graph list the nodes that you can go from node n in 3 hops. (The actual data can be found here: http://snap.stanford.edu/data/web-Google.html)

The answer on Stackover does not provide a solution (it is a homework assignment) but does walk though an explanation of using MapReduce for graph computations.

If you are thinking about using Hadoop for large graph processing, you are likely to find this very useful.

December 30, 2013

Ready to learn Hadoop?

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 5:59 pm

Ready to learn Hadoop?

From the webpage:

Sign up for the challenge of learning the basics of Hadoop in two weeks! You will get one email every day for the next 14 days.

  • Hello World: Overview of Hadoop
  • Data Processing Using Apache Hadoop
  • Setting up ODBC Connections
  • Connecting to Enterprise Applications
  • Data Integration and ETL
  • Data Analytics
  • Data Visualization
  • Hadoop Use Cases: Web
  • Hadoop Use Cases: Business
  • Recap

You could do this entirely on your own but the daily email may help.

If nothing else, it will be a reminder that something fun is waiting for you after work.

Enjoy!

December 24, 2013

‘Hadoop Illuminated’ Book

Filed under: Hadoop,MapReduce — Patrick Durusau @ 9:41 am

‘Hadoop Illuminated’ Book by Mark Kerzner and Sujee Maniyam.

From the webpage:

Gentle Introduction of Hadoop and Big Data

Get the book…

HTML – multipage

HTML : single page

PDF

We are writing a book on Hadoop with following goals and principles.

More of a great outline for a Hadoop book than a great Hadoop book at present.

However, it is also the perfect opportunity for you to try your hand at clear, readable introductory prose on Hadoop. (That isn’t as easy as it sounds.)

As a special treat, there is a Hadoop Coloring Book for Kids. (Send more art for the coloring book as well.)

I especially appreciate the coloring book because I don’t have any coloring books. Did I mention I have a small child coming to visit during the holidays? 😉

PS: Has anyone produced a sort algorithm coloring book?

December 18, 2013

Building Hadoop-based Apps on YARN

Filed under: Hadoop YARN,MapReduce — Patrick Durusau @ 5:36 pm

Building Hadoop-based Apps on YARN

Hortonworks has put together resources that may ease your way to your first Hadoop-base app on YARN.

The resources are organized in steps:

  • STEP 1. Understand the motivations and architecture for YARN.
  • STEP 2. Explore example applications on YARN.
  • STEP 3. Examine real world applications YARN.

Further examples and real work applications would be welcomed by anyone studying YARN.

November 26, 2013

CDH 4.5, Manager 4.8, Impala 1.2.1, Search 1.1

Filed under: Cloudera,Hadoop,Impala,MapReduce — Patrick Durusau @ 3:13 pm

Announcing: CDH 4.5, Cloudera Manager 4.8, Cloudera Impala 1.2.1, and Cloudera Search 1.1

Before your nieces and nephews (at least in the U.S.) start chewing up your bandwidth over the Thanksgiving Holidays, you may want to grab the most recent releases from Cloudera.

If you are traveling, it will give you something to do during airport delays. 😉

November 21, 2013

Putting Spark to Use:…

Filed under: Hadoop,MapReduce,Spark — Patrick Durusau @ 5:43 pm

Putting Spark to Use: Fast In-Memory Computing for Your Big Data Applications by Justin Kestelyn.

From the post:

Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.

Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use.

Fast and Easy Big Data Processing with Spark

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch.

In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. As illustrated in the figure below, this feature enables:

I would not use the following example to promote Spark:

One of Spark’s most useful features is the interactive shell, bringing Spark’s capabilities to the user immediately – no IDE and code compilation required. The shell can be used as the primary tool for exploring data interactively, or as means to test portions of an application you’re developing.

The screenshot below shows a Spark Python shell in which the user loads a file and then counts the number of lines that contain “Holiday”.

Spark Example

Isn’t that just:

grep holiday WarAndPeace.txt | wc -l
15

?

Grep doesn’t require an IDE or compilation either. Of course, grep isn’t reading from an HDFS file.

The “file.filter(lamda line: “Holiday” in.line).count()” works but some of us prefer the terseness of Unix.

Unix text tools for HDFS?

November 20, 2013

Learning MapReduce:…[Of Ethics and Self-Interest]

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:57 pm

Learning MapReduce: Everywhere and For Everyone

From the post:

Tom White, author of Hadoop: The Definitive Guide, recently celebrated his five-year anniversary at Cloudera with a blog post reflecting on the early days of Big Data and what has changed and remained since 2008. Having just seen Tom in New York at the biggest and best Hadoop World to date, I’m struck by the poignancy of his earliest memories. Even then, Cloudera’s projects were focused on broadening adoption and building the community by writing effective training material, integrating with other systems, and building on the core open source. The founding team had a vision to make Apache Hadoop the focal point of an accessible, powerful, enterprise-ready Big Data platform.

Today, Cloudera is working harder than ever to help companies deploy Hadoop as part of an Enterprise Data Hub. We’re just as committed to a healthy and vibrant open-source community, have a lively partner ecosystem over 700, and have contributed innovations that make data access and analysis faster, more secure, more relevant, and, ultimately, more profitable.

However, with all these successes in driving Hadoop towards the mainstream and providing a new and dynamic data engine, the fact remains that broadening adoption at the end-user level remains job one. Even as Cloudera unifies the Big Data stack, the availability of talent to drive operations and derive full value from massive data falls well short of the enormous demand. As more companies across industries adopt Hadoop and build out their Big Data strategies focused on the Enterprise Data Hub, Cloudera has expanded its commitment to educating technologists of all backgrounds on Hadoop, its applications, and its systems.

A Partnership to Cultivate Hadoop Talent

We at Cloudera University are proud to announce a new partnership with Udacity, a leader in open, online professional education. We believe in Udacity’s vision to democratize professional development by making technical training affordable and accessible to everyone, and this model will enable us to reach aspiring Big Data practitioners around the world who want to expand their skills into Hadoop.

Our first Udacity course, Introduction to Hadoop and MapReduce, guides learners from an understanding of Big Data to the basics of Hadoop, all the way through writing your first MapReduce program. We partnered directly with Udacity’s development team to build the most engaging online Hadoop course available, including demonstrative instruction, interactive quizzes, an interview with Hadoop co-founder Doug Cutting, and a hands-on project using live data. Most importantly, the lessons are self-paced, open, and based on Cloudera’s insights into industry best practices and professional requirements.

Cloudera, and to be fair, others, have adopted a strategy of self-interest that is also ethical.

They are literally giving away the knowledge and training to use a free product. Think of it as a rising tide that floats all boats higher.

The more popular and widely use Hadoop/MapReduce become, the greater the demand for professional training and services from Cloudera (and others).

You may experiment or even run a local cluster, but if you are a Hadoop newbie, who are you going to call when it is a mission-critical application? (Hopefully professionals but there’s no guarantee on that.)

You don’t have to build silos or closed communities to be economically viable.

Delivering professional services for a popular technology seems to do the trick.

November 14, 2013

Cloudera + Udacity = Hadoop MOOC!

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 1:54 pm

Cloudera and Udacity partner to offer Data Science training courses by Lauren Hockenson.

From the post:

After launching the Open Education Alliance with some of the biggest tech companies in Silicon Valley, Udacity has forged a partnership with Cloudera to bring comprehensive Data Science curriculum to a massively open online course (MOOC) format in a program called Cloudera University — allowing anyone to learn the intricacies of Hadoop and other Data Science methods.

“Recognizing the growing demand for skilled data professionals, more students are seeking instruction in Hadoop and data science in order to prepare themselves to take advantage of the rapidly expanding data economy,” said Sebastian Thun, founder of Udacity, in a press release. “As the leader in Hadoop solutions, training, and services, Cloudera’s insights and technical guidance are in high demand, so we are pleased to be leveraging that experience and expertise as their partner in online open courseware,”

The first offering to come via Cloudera University will be “Introduction to Hadoop and MapReduce,” a three-lesson course that serves a precursor to the program’s larger, track-based training already in place. While Cloudera already offers many of these courses in Data Science, as well as intensive certificate training programs, in an in-person setting, it seems that the partnership with Udacity will translate curriculum that Cloudera has developed into a more palatable format for online learning.

Looking forward to Cloudera University reflecting all of the Hadoop eco-system.

In the mean time, there are a number of online training resources already available at Cloudera.

November 11, 2013

Hadoop – 100x Faster… [With NO ETL!]

Filed under: ETL,Hadoop,HDFS,MapReduce,Topic Maps — Patrick Durusau @ 8:32 pm

Hadoop – 100x Faster. How we did it… by Nikita Ivanov.

From the post:

Almost two years ago, Dmitriy and I stood in front of a white board at GridGain’s office thinking: “How can we deliver the real-time performance of GridGain’s in-memory technology to Hadoop customers without asking them rip and replace their systems and without asking them to move their datasets off Hadoop?”.

Given Hadoop’s architecture – the task seemed daunting; and it proved to be one of the more challenging engineering puzzles we have had to solve.

After two years of development, tens of thousands of lines of Java, Scala and C++ code, multiple design iterations, several releases and dozens of benchmarks later, we finally built a product that can deliver real-time performance to Hadoop customers with seamless integration and no tedious ETL. Actual customers deployments can now prove our performance claims and validate our product’s architecture.

Here’s how we did it.

The Idea – In-Memory Hadoop Accelerator

Hadoop is based on two primary technologies: HDFS for storing data, and MapReduce for processing these data in parallel. Everything else in Hadoop and the Hadoop ecosystem sits atop these foundation blocks.

Originally, neither HDFS nor MapReduce were designed with real-time performance in mind. In order to deliver real-time processing without moving data out of Hadoop onto another platform, we had to improve the performance of both of these subsystems. (emphasis added)

The highlighted phrase is the key isn’t it?

In order to deliver real-time processing without moving data out of Hadoop onto another platform

ETL is down time, expense and risk of data corruption.

Given a choice between making your current data platform (of whatever type) more robust or risking a migration to a new data platform, which one would you choose?

Bear in mind those 2.5 million spreadsheets that Felienne mentions in her presentation.

Are you really sure you want to ETL on all you data?

As opposed to making your most critical data more robust and enhanced by other data? All while residing where it lives right now.

Are you ready to get off the ETL merry-go-round?

November 3, 2013

Mizan

Filed under: Graphs,MapReduce,Mizan,Pregel — Patrick Durusau @ 3:53 pm

Mizan

From the webpage:

What is Mizan?

Mizan is an advanced clone to Google’s graph processing system Pregel that utilizes online graph vertex migrations to dynamically optimizes the execution of graph algorithms. You can use our Mizan system to develop any vertex centric graph algorithm and run in parallel over a local cluster or over cloud infrastructure. Mizan is compatible with Pregel’s API, written in C++ and uses MPICH2 for communication. You can download a copy of Mizan and start using it today on your local machine or try Mizan on EC2. We also welcome programers who are interested to go deeper into our Mizan code to optimize or tweak.

Mizan is published in EuroSys 13 as “Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing“. We have an earlier work of Mizan as “Mizan: Optimizing Graph Mining in Large Parallel Systems“, which we recently changed it to Libra to avoid confusions. We show below the abstract for Mizan’s EuroSys 13 paper. We also include Mizan’s general architecture and its API available for users.

Abstract

Pregel was recently introduced as a scalable graph mining system that can provide significant performance improvements over traditional MapReduce implementations. Existing implementations focus primarily on graph partitioning as a preprocessing step to balance computation across compute nodes. In this paper, we examine the runtime characteristics of a Pregel system. We show that graph partitioning alone is insufficient for minimizing end-to-end computation. Especially where data is very large or the runtime behavior of the algorithm is unknown, an adaptive approach is needed. To this end, we introduce Mizan, a Pregel system that achieves efficient load balancing to better adapt to changes in computing needs. Unlike known implementations of Pregel, Mizan does not assume any a priori knowledge of the structure of the graph or behavior of the algorithm. Instead, it monitors the runtime characteristics of the system. Mizan then performs efficient fine-grained vertex migration to balance computation and communication. We have fully implemented Mizan; using extensive evaluation we show that—especially for highly-dynamic workloads— Mizan provides up to 84% improvement over techniques leveraging static graph pre-partitioning.

Post like this one make me want to build a local cluster at home. 😉

October 16, 2013

Hadoop Tutorials – Hortonworks

Filed under: Hadoop,HCatalog,HDFS,Hive,Hortonworks,MapReduce,Pig — Patrick Durusau @ 4:49 pm

With the GA release of Hadoop 2, it seems appropriate to list a set of tutorials for the Hortonworks Sandbox.

Tutorial 1: Hello World – An Overview of Hadoop with HCatalog, Hive and Pig

Tutorial 2: How To Process Data with Apache Pig

Tutorial 3: How to Process Data with Apache Hive

Tutorial 4: How to Use HCatalog, Pig & Hive Commands

Tutorial 5: How to Use Basic Pig Commands

Tutorial 6: How to Load Data for Hadoop into the Hortonworks Sandbox

Tutorial 7: How to Install and Configure the Hortonworks ODBC driver on Windows 7

Tutorial 8: How to Use Excel 2013 to Access Hadoop Data

Tutorial 9: How to Use Excel 2013 to Analyze Hadoop Data

Tutorial 10: How to Visualize Website Clickstream Data

Tutorial 11: How to Install and Configure the Hortonworks ODBC driver on Mac OS X

Tutorial 12: How to Refine and Visualize Server Log Data

Tutorial 13: How To Refine and Visualize Sentiment Data

Tutorial 14: How To Analyze Machine and Sensor Data

By the time you finish these, I am sure there will be more tutorials or even proposed additions to the Hadoop stack!

(Updated December 3, 2013 to add #13 and #14.)

August 26, 2013

Apache Hadoop 2 (beta)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:24 am

Announcing Beta Release of Apache Hadoop 2 by Arun Murthy.

From the post:

It’s my great pleasure to announce that the Apache Hadoop community has declared Hadoop 2.x as Beta with the vote closing over the weekend for the hadoop-2.1.0-beta release.

As noted in the announcement to the mailing lists, this is a significant milestone across multiple dimensions: not only is the release chock-full of significant features (see below), it also represents a very stable set of APIs and protocols on which we can continue to build for the future. In particular, the Apache Hadoop community has spent an enormous amount of time paying attention to stability and long-term viability of our APIs and wire protocols for both HDFS and YARN. This is very important as we’ve already seen a huge interest in other frameworks (open-source and proprietary) move atop YARN to process data and run services *in* Hadoop.

It is always nice to start the week with something new.

Your next four steps:

  1. Download and install Hadoop 2.
  2. Experiment with and use Hadoop 2.
  3. Look for and report bugs (and fixes if possible) for Hadoop 2.
  4. Enjoy!

August 2, 2013

How-to: Use Eclipse with MapReduce in Cloudera’s QuickStart VM

Filed under: Cloudera,Eclipse,Hadoop,MapReduce — Patrick Durusau @ 6:26 pm

How-to: Use Eclipse with MapReduce in Cloudera’s QuickStart VM by Jesse Anderson.

From the post:

One of the common questions I get from students and developers in my classes relates to IDEs and MapReduce: How do you create a MapReduce project in Eclipse and then debug it?

To answer that question, I have created a screencast showing you how, using Cloudera’s QuickStart VM:

The QuickStart VM helps developers get started writing MapReduce code without having to worry about software installs and configuration. Everything is installed and ready to go. You can download the image type that corresponds to your preferred virtualization platform.

Eclipse is installed on the VM and there is a link on the desktop to start it.

Nice illustration of walking through the map reduce process.

I continue to be impressed by the use of VMs.

Would be a nice way to distribute topic map tooling.

July 29, 2013

New Community Forums for Cloudera Customers and Users

Filed under: Cloudera,Hadoop,MapReduce,Taxonomy — Patrick Durusau @ 4:34 pm

New Community Forums for Cloudera Customers and Users by Justin Kestelyn.

From the post:

This is a great day for technical end-users – developers, admins, analysts, and data scientists alike. Starting now, Cloudera complements its traditional mailing lists with a new, feature-rich community forums intended for users of Cloudera’s Platform for Big Data! (Login using your existing credentials or click the link to register.)

Although mailing lists have long been a standard for user interaction, and will undoubtedly continue to be, they have flaws. For example, they lack structure or taxonomy, which makes consumption difficult. Search functionality is often less than stellar and users are unable to build reputations that span an appreciable period of time. For these reasons, although they’re easy to create and manage, mailing lists inherently limit access to knowledge and hence limit adoption.

The new service brings key additions to the conversation: functionality, search, structure and scalability. It is now considerably easier to ask questions, find answers (or questions to answer), follow and share threads, and create a visible and sustainable reputation in the community. And for Cloudera customers, there’s a bonus: your questions will be escalated as bonafide support cases under certain circumstances (see below).

Another way for you to participate in the Hadoop ecosystem!

BTW, the discussion taxonomy:

What is the reasoning behind your taxonomy?

We made a sincere effort to balance the requirements of simplicity and thoroughness. Of course, we’re always open to suggestions for improvements.

I don’t doubt the sincerity of the taxonomy authors. Not one bit.

But all taxonomies represent the “intuitive” view of some small group. There is no means to escape the narrow view of all taxonomies.

What we can do, at least with topic maps, is to allow groups to have their own taxonomies and to view data through those taxonomies.

Mapping between taxonomies means that addition via any of the taxonomies results in new data appearing as appropriate in other taxonomies.

Perhaps it was necessary to champion one taxonomy when information systems were fixed, printed representations of data and access systems.

But the need for a single taxonomy, if it ever existed, does not exist now. We are free to have any number of taxonomies for any data set, visible or invisible to other users/taxonomies.

More than thirty (30) years after the invention of the personal computer, we are still laboring under the traditions of printed information systems.

Isn’t it time to move on?

July 12, 2013

Rapid hadoop development with progressive testing

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:45 pm

Rapid hadoop development with progressive testing by Abe Gong.

From the post:

Debugging Hadoop jobs can be a huge pain. The cycle time is slow, and error messages are often uninformative — especially if you’re using Hadoop streaming, or working on EMR.

I once found myself trying to debug a job that took a full six hours to fail. It took more than a week — a whole week! — to find and fix the problem. Of course, I was doing other things at the same time, but the need to constantly check up on the status of the job was a huge drain on my energy and productivity. It was a Very Bad Week.

crushed by elephant

Painful experiences like this have taught me to follow a test-driven approach to hadoop development. Whenever I’m working on a new hadoop-based data pipe, my goal is to isolate six distinct kinds of problems that arise in hadoop development.

(…)

See Abe’s post for the six steps and suggestions for how to do them.

Reformatted a bit with local tool preferences, Abe’s list will make a nice quick reference for Hadoop development.

Hadoop Summit 2013

Filed under: Hadoop,MapReduce — Patrick Durusau @ 8:55 am

Hadoop Summit 2013

Videos and slides from Hadoop Summit 2013!

Forty-two (42) presentations on day one and Forty-one (41) on day two.

Just this week I got news that ISO is hunting down “rogue” copies of ISO standards, written by volunteers, that aren’t behind its paywall.

While others, like the presenters at the Hadoop Summit 2013, are sharing their knowledge in hopes of creating more knowledge.

Which group do you think will be relevant in a technology driven future?

July 9, 2013

Friend Recommendations using MapReduce

Filed under: Hadoop,MapReduce,Recommendation — Patrick Durusau @ 3:26 pm

Friend Recommendations using MapReduce by John Berryman.

From the post:

So Jonathan, one of our interns this summer, asked an interesting question today about MapReduce. He said, “Let’s say you download the entire data set of who’s following who from Twitter. Can you use MapReduce to make recommendations about who any particular individual should follow?” And as Jonathan’s mentor this summer, and as one of the OpenSource Connections MapReduce experts I dutifully said, “uuuhhhhh…”

And then in a stoke of genius … I found a way to stall for time. “Well, young Padawan,” I said to Jonathan, “first you must more precisely define your problem… and only then will the answer be revealed to you.” And then darn it if he didn’t ask me what I meant! Left with no viable alternatives, I squeezed my brain real hard, and this is what came out:

This is a post to work through carefully while waiting for the second post to drop!

Particularly the custom partitioning, grouping and sorting in MapReduce.

June 27, 2013

Trying to get the coding Pig,

Filed under: BigData,Hadoop,Mahout,MapReduce,Pig,Talend — Patrick Durusau @ 3:00 pm

Trying to get the coding Pig, er – monkey off your back?

From the webpage:

Are you struggling with the basic ‘WordCount’ demo, or which Mahout algorithm you should be using? Forget hand-coding and see what you can do with Talend Studio.

In this on-demand webinar we demonstrate how you could become MUCH more productive with Hadoop and NoSQL. Talend Big Data allows you to develop in Eclipse and run your data jobs 100% natively on Hadoop… and become a big data guru over night. Rémy Dubois, big data specialist and Talend Lead developer, shows you in real-time:

  • How to visually create the ‘WordCount’ example in under 5 minutes
  • How to graphically build a big data job to perform sentiment analysis
  • How to archive NoSQL and optimize data warehouse usage

A content filled webinar! Who knew?

Be forewarned that the demos presume familiarity with the Talend interface and the demo presenter is difficult to understand.

From what I got out of the earlier parts of the webinar, very much a step in the right direction to empower users with big data.

Think of the distance between stacks of punch cards (Hadoop/MapReduce a few years ago) and the personal computer (Talend and others).

That was a big shift. This one is likely to be as well.

Looks like I need to spend some serious time with the latest Talend release!

June 26, 2013

Hadoop YARN

Filed under: Hadoop YARN,Hortonworks,MapReduce — Patrick Durusau @ 10:15 am

Hadoop YARN by Steve Loughran, Devaraj Das & Eric Baldeschwieler.

From the post:

A next-generation framework for Hadoop data processing.

Apache™ Hadoop® YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was borne of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce.

yarn

As part of Hadoop 2.0, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines. This also streamlines MapReduce to do what it does best, process data. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management. Many organizations are already building applications on YARN in order to bring them IN to Hadoop.

yarn2

(…)

One of the more accessible explanations of the importance of Hadoop YARN.

Likely not anything new to you but may be helpful when talking to others.

June 22, 2013

AWS: Your Next Utility Bill?

Filed under: Amazon Web Services AWS,Hadoop,MapReduce — Patrick Durusau @ 3:08 pm

Netflix open sources its Hadoop manager for AWS be Derrick Harris.

From the post:

Netflix runs a lot of Hadoop jobs on the Amazon Web Services cloud computing platform, and on Friday the video-streaming leader open sourced its software to make running those jobs as easy as possible. Called Genie, it’s a RESTful API that makes it easy for developers to launch new MapReduce, Hive and Pig jobs and to monitor longer-running jobs on transient cloud resources.

In the blog post detailing Genie, Netflix’s Sriram Krishnan makes clear a lot more about what Genie is and is not. Essentially, Genie is a platform as a service running on top of Amazon’s Elastic MapReduce Hadoop service. It’s part of a larger suite of tools that handles everything from diagnostics to service registration.

It is not a cluster manager or workflow scheduler for building ETL processes (e.g., processing unstructured data from a web source, adding structure and loading into a relational database system). Netflix uses a product called UC4 for the latter, but it built the other components of the Genie system.

It’s not very futuristic to say that AWS (or something very close to it) will be your next utility bill.

Like paying for water, gas, cable, electricity, it will be an auto-pay setup on your bank account.

What will you say when clients ask if the service you are building for them is hosted on AWS?

Are you going to say your servers are more reliable? That you don’t “trust” Amazon?

Both of which may be true but how will you make that case?

Without sounding like you are selling something the client doesn’t need?

As the price of cloud computing drops, those questions are going to become common.

June 17, 2013

I Mapreduced a Neo store [Legacy of CPU Shortages?]

Filed under: Graphs,Hadoop,MapReduce,Neo4j — Patrick Durusau @ 8:40 am

I Mapreduced a Neo store by Kris Geusebroek.

From the post:

Lately I’ve been busy talking at conferences to tell people about our way to create large Neo4j databases. Large means some tens of millions of nodes and hundreds of millions of relationships and billions of properties.

Although the technical description is already on the Xebia blog part 1 and part 2, I would like to give a more functional view on what we did and why we started doing it in the first place.

Our use case consisted of exploring our data to find interesting patterns. The data we want to explore is about financial transactions between people, so the Neo4j graph model is a good fit for us. Because we don’t know upfront what we are looking for we need to create a Neo4j database with some parts of the data and explore that. When there is nothing interesting to find we go enhance our data to contain new information and possibly new connections and create a new Neo4j database with the extra information.

This means it’s not about a one time load of the current data and keep that up to date by adding some more nodes and edges. It’s really about building a new database from the ground up everytime we think of some new way to look at the data.

Deeply interesting work, particularly for its investigation of the internal file structure of Neo4j.

Curious about the

…building a new database from the ground up everytime we think of some new way to look at the data.

To what extent are static database structures a legacy of a shortage of CPU cycles?

With limited CPU cycles, it was necessary to create a static structure, against which query languages could be developed and optimized (again because of a shortage of CPU cycles), and the persisted data structure avoided the overhead of rebuilding the data structure for each user.

It may be that cellphones and tablets need the convenience of static data structures or at least representations of static data structures.

But what of server farms populated by TBs of 3D memory?

Isn’t it time to start thinking beyond the limitations imposed by decades of CPU cycle shortages?

June 15, 2013

Streaming IN Hadoop: Yahoo! release Storm-YARN

Filed under: Hadoop YARN,MapReduce,Storm,Yahoo! — Patrick Durusau @ 2:31 pm

Streaming IN Hadoop: Yahoo! release Storm-YARN by Jim Walker.

From the post:

Over the past year, customers have told us they want to store all their data in one place and interact with it in multiple ways… they want to use Hadoop, but in order to do so, it needs to extend beyond batch. It also needs to be interactive and real-time (among others).

This is the entire principle behind YARN, which together with others in the community, Arun Murthy and the team at Hortonworks have been working on for more than 5 years! The YARN based architecture of Hadoop 2.0 is hugely significant and we have been working closely with many partners to incorporate it into their applications.

Storm-YARN Released as Open Source

Yahoo! has been testing Hadoop 2 and its YARN-based architecture for quite some time. All the while they have worked on the convergence of the streaming framework Storm with Hadoop. This work has resulted in a YARN based version of Storm that will radically improve performance and resource management for streaming.

The release blog post from Yahoo.

Processing of data, even big data, is approaching “interactive and real-time,” although I suspect definitions of those terms vary. What is “interactive” for an automated trader might be too fast for human trader.

What I haven’t seen is concurrent development on the handling of the semantics of big data.

After the initial hysteria over the scope of NSA snooping, except for cases where the NSA was given the identity of a suspect (and not always then), was its data gathering of any use.

In topic map terms, the semantic impedance between the data systems was too great for useful manipulation of the data sets as one.

Streaming in Hadoop is welcome news, but until we can robustly manages the semantics of data in streams, much gold is going to pass uncollected from streams.

Hortonworks Sandbox (1.3): Stinger, Visualizations and Virtualization

Filed under: BigData,Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 2:13 pm

Hortonworks Sandbox: Stinger, Visualizations and Virtualization by Cheryle Custer.

From the post:

A couple of weeks ago, we releases several new Hadoop tutorials showcasing real-life uses cases and you can read about them here.Today, we’re delighted to bring to you the newest release of the Hortonworks Sandbox 1.3. The Hortonworks Sandbox allows you to go from Zero to Big Data in 15 Minutes through step-by-step hands-on Hadoop tutorials. The Sandbox is a fully functional single node personal Hadoop environment, where you can add your own data sets, validate your Hadoop use cases and build a small proof-of-concept.

Update of your favorite way to explore Hadoop!

Get the sandbox here.

June 2, 2013

MapReduce with Python and mrjob on Amazon EMR

Filed under: Amazon EMR,MapReduce,Natural Language Processing,Python — Patrick Durusau @ 10:59 am

MapReduce with Python and mrjob on Amazon EMR by Sujit Pal.

From the post:

I’ve been doing the Introduction to Data Science course on Coursera, and one of the assignments involved writing and running some Pig scripts on Amazon Elastic Map Reduce (EMR). I’ve used EMR in the past, but have avoided it ever since I got burned pretty badly for leaving it on. Being required to use it was a good thing, since I got over the inertia and also saw how much nicer the user interface had become since I last saw it.

I was doing another (this time Python based) project for the same class, and figured it would be educational to figure out how to run Python code on EMR. From a quick search on the Internet, mrjob from Yelp appeared to be the one to use on EMR, so I wrote my code using mrjob.

The code reads an input file of sentences, and builds up trigram, bigram and unigram counts of the words in the sentences. It also normalizes the text, lowercasing, replacing numbers and stopwords with placeholder tokens, and Porter stemming the remaining words. Heres the code, as you can see, its fairly straightforward:

Knowing how to exit and confirm exit from a cloud service are the first things to learn about a cloud system.

May 30, 2013

Hadoop Tutorials: Real Life Use Cases in the Sandbox

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 7:56 pm

Hadoop Tutorials: Real Life Use Cases in the Sandbox by Cheryle Custer.

Six (6) new tutorials from Hortonworks:

  • Tutorial 6 – Loading Data into the Hortonworks Sandbox
  • Tutorials 7 & 11 – Installing the ODBC Driver in the Hortonworks Sandbox (Windows and Mac)
  • Tutorials 8 & 9 – Accessing and Analyzing Data in Excel
  • Tutorial 10 – Visualizing Clickstream Data

You have done the first five (5).

Yes?

Hortonworks Data Platform 1.3 Release

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 7:51 pm

Hortonworks Data Platform 1.3 Release: The community continues to power innovation in Hadoop by Jeff Sposetti.

From the post:

HDP 1.3 release delivers on community-driven innovation in Hadoop with SQL-IN-Hadoop, and continued ease of enterprise integration and business continuity features.

Almost one year ago (50 weeks to be exact) we released Hortonworks Data Platform 1.0, the first 100% open source Hadoop platform into the marketplace. The past year has been dynamic to say the least! However, one thing has remained constant: the steady, predictable cadence of HDP releases. In September 2012 we released 1.1, this February gave us 1.2 and today we’re delighted to release HDP 1.3.

HDP 1.3 represents yet another significant step forward and allows customers to harness the latest innovation around Apache Hadoop and its related projects in the open source community. In addition to providing a tested, integrated distribution of these projects, HDP 1.3 includes a primary focus on enhancements to Apache Hive, the de-facto standard for SQL access in Hadoop as well as numerous improvements that simplify ease of use.

Whatever the magic dust is for a successful open source project, the Hadoop community has it in abundance.

May 28, 2013

Cascading and Scalding

Filed under: Cascading,MapReduce,Pig,Scalding — Patrick Durusau @ 4:17 pm

Cascading and Scalding by Danny Bickson.

Danny has posted some links for Cascading and Scalding, alternatives to Pig.

I continue to be curious about documentation of semantics for Pig scripts or any of its alternatives.

Or for that matter, in any medium to large-sized mapreduce shop, how do you index those semantics?

« Newer PostsOlder Posts »

Powered by WordPress