Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 17, 2013

Hadoop SDK and Tutorials for Microsoft .NET Developers

Filed under: .Net,Hadoop,MapReduce,Microsoft — Patrick Durusau @ 3:39 pm

Hadoop SDK and Tutorials for Microsoft .NET Developers by Marc Holmes.

From the post:

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel. It also covers some Mahout use to build a recommendation engine.
  • Microsoft Hive ODBC Driver. The examples above use this preview driver to enable the connection from Hive to Excel.

If all of the above excites you our Hadoop on Windows for Developers training course also similar content in a lot of depth.

Hadoop is coming to an office/data center near you.

Will you be ready?

Hadoop Toolbox: When to Use What

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:39 pm

Hadoop Toolbox: When to Use What by Mohammad Tariq.

From the post:

Eight years ago not even Doug Cutting would have thought that the tool which he’s naming after his kid’s soft toy would so soon become a rage and change the way people and organizations look at their data. Today Hadoop and Big Data have almost become synonyms to each other. But Hadoop is not just Hadoop now. Over time it has evolved into one big herd of various tools, each meant to serve a different purpose. But glued together they give you a powerpacked combo.

Having said that, one must be careful while choosing these tools for their specific use case as one size doesn’t fit all. What is working for someone might not be that productive for you. So, here I will show you which tool should be picked in which scenario. It’s not a big comparative study but a short intro to some very useful tools. And, this is based totally on my experience so there is always some scope of suggestions. Please feel free to comment or suggest if you have any. I would love to hear from you. Let’s get started :

Not shallow enough to be useful for the c-suite types, not deep enough for decision making.

Nice to use in a survey context, where users need an overview of the Hadoop ecosystem.

May 16, 2013

How-to: Configure Eclipse for Hadoop Contributions

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 12:34 pm

How-to: Configure Eclipse for Hadoop Contributions by Karthik Kambatla.

From the post:

Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

A post to ease your way towards contributing to the Hadoop project!

Or if you simply want to know the code you are running cold.

Or something in between!

May 8, 2013

Natural Language Processing and Big Data…

Filed under: BigData,Hadoop,MapReduce,Natural Language Processing — Patrick Durusau @ 9:47 am

Natural Language Processing and Big Data: Using NLTK and Hadoop – Talk Overview by Benjamin Bengfort.

From the post:

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem — designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word … but trust me, it’s worth it.

If you can’t wait for the future posts, Benjamin’s presentation from April is here. Amusing but fairly sparse slides.

Looking forward to more posts in this series!


Big Data and Natural Language Processing – Part 1

The “Foo” of Big Data – Part 2

Python’s Natural Language Took Kit (NLTK) and Hadoop – Part 3

Hadoop for Preprocessing Language – Part 4

Beyond Preprocessing – Weakly Inferred Meanings – Part 5

May 7, 2013

Cloudera Development Kit (CDK)…

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:01 pm

Cloudera Development Kit (CDK): Hadoop Application Development Made Easier by Eric Sammer & Tom White.

From the post:

At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.

So, we’ve made Hadoop much easier to “consume” for admins and other operators — but what about for developers, whether working for ISVs, SIs, or users? Until now, they’ve largely been on their own.

That’s why we’re really excited to announce the Cloudera Developer Kit (CDK), a new open source project designed to help developers get up and running to build applications on CDH, Cloudera’s open source distribution including Hadoop, faster and easier than before. The CDK is a collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with the platform. Just like CDH, the CDK is 100% free, open source, and licensed under the same permissive Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.

The CDK lives on GitHub where users can freely browse, download, fork, and contribute back to the source. Community contributions are not only welcome but strongly encouraged. Since most Java developers use tools such as Maven (or tools that are compatible with Maven repositories), artifacts are also available from the Cloudera Maven Repository for easy project integration.

The CDK is a collection of libraries, tools, examples, and docs engineered to simplify common tasks.

What’s In There Today

Our goal is to release a number of CDK modules over time. The first module that can be found in the current release is the CDK Data module; a set of APIs to drastically simplify working with datasets in Hadoop filesystems such as HDFS and the local filesystem. The Data module handles automatic serialization and deserialization of Java POJOs as well as Avro Records, automatic compression, file and directory layout and management, automatic partitioning based on configurable functions, and a metadata provider plugin interface to integrate with centralized metadata management systems (including HCatalog). All Data APIs are fully documented with javadoc. A reference guide is available to walk you through the important parts of the module, as well. Additionally, a set of examples is provided to help you see the APIs in action immediately.

Here’s to hoping that vendor support as shown for Hadoop, Lucene/Solr, R, (who am I missing?), continues and spreads to other areas of software development.

Hadoop Webinars (WANdisco)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:28 pm

Hadoop Webinars (WANdisco)

From May to July, webinars on Hadoop:

A Hadoop Overview

Wednesday, May 15
10:00 a.m. PT
1:00 p.m. ET Register Now

In this webinar, we'll provide an overview of Hadoop’s history and architecture.

This session will highlight: 

  • Major components such as HDFS, MapReduce, and HBase – the NoSQL database management system used with Hadoop for real-time applications
  • A summary of Hadoop’s ecosystem
  • A  review of public and private cloud deployment options
  • Common business use cases
  • And more…

Hadoop: A Deep Dive

Wednesday, June 5
10:00 a.m. PT
1:00 p.m. ET Register Now

This session will present: 

  • Various Hadoop misconceptions (not all clusters are comprised of thousands of machines)
  • Information about real world Hadoop deployments
  • A detailed review of Hadoop’s ecosystem (Sqoop, Flume, Nutch, Oozie, etc.)
  • An in-depth look at HDFS
  • An explanation of MapReduce in relation to latency and dependence on other Hadoop activities
  • An introduction to concepts attendees will need as a prerequisite for subsequent training webinars covering MapReduce, HBase and other major components at a deeper technical level

Hadoop: A MapReduce Tutorial

Wednesday, June 19
10:00 a.m. PT
1:00 p.m. ET Register Now

This session will cover: 

  • MapReduce at a deep technical level
  • The history of MapReduce
  • How a MapReduce job works, its logical flow, and the rules and types of MapReduce jobs
  • Writing, de-bugging and testing MapReduce jobs
  • Various available workflow tools
  • And more…

Hadoop: HBase In-Depth

Wednesday, July 10
10:00 a.m. PT
1:00 p.m. ET Register Now

This session is a deep technical review covering:

  • Flexibility
  • Scalability
  • Components (cells, rows, columns, qualifiers)
  • Schema samples
  • Hardware requirements
  • And more…

Hard to say how “deep” the webinars will be able to get in only one (1) hour.

I have registered for all four (4) and will be reporting back on my experience.

May 1, 2013

Have you used Lua for MapReduce?

Filed under: Hadoop,MapReduce,Semantic Diversity — Patrick Durusau @ 1:36 pm

Have you used Lua for MapReduce?

From the post:

Lua as a cross platform programming language has been popularly used in games and embedded systems. However, due to its excellent use for configuration, it has found wider acceptance in other user cases as well.

Lua was inspired from SOL (Simple Object Language) and DEL(Data-Entry Language) and created by Roberto Ierusalimschy, Waldemar Celes, and Luiz Henrique de Figueiredo at the Pontifical Catholic University of Rio de Janeiro, Brazil. Roughly translated to ‘Moon’ in Portuguese, it has found many big takers like Adobe, Nginx, Wikipedia.

Another scripting language to use with MapReduce and Hadoop.

Have you ever noticed the Tower of Babel seems to follow human activity around?

First, it was building a tower to heaven – confuse the workforce.

Then it was other community efforts.

And many, many thens, later, it has arrived at MapReduce/Hadoop configuration languages.

Like a kaleidoscope, it just gets richer the more semantic diversity we add.

Do you wonder what the opposite of semantic diversity must look like?

Or if we are the cause, what would it mean to eliminate semantic diversity?

April 25, 2013

Hadoop Summit North America (June 26-27, 2013)

Filed under: Conferences,Hadoop,MapReduce — Patrick Durusau @ 1:44 pm

Hadoop Summit North America

From the webpage:

Hortonworks and Yahoo! are pleased to host the 6th Annual Hadoop Summit, the leading conference for the Apache Hadoop community. This two-day event will feature many of the Apache Hadoop thought leaders who will showcase successful Hadoop use cases, share development and administration tips and tricks, and educate organizations about how best to leverage Apache Hadoop as a key component in their enterprise data architecture. It will also be an excellent networking event for developers, architects, administrators, data analysts, data scientists and vendors interested in advancing, extending or implementing Apache Hadoop.

Community Choice Selectees:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

If you need another reason to attend, it’s located in San Jose, California.

2nd best US location for a conference. #1 being New Orleans.

Beginner Tips For Elastic MapReduce

Filed under: Cloud Computing,Elastic Map Reduce (EMR),Hadoop,MapReduce — Patrick Durusau @ 1:08 pm

Beginner Tips For Elastic MapReduce by John Berryman.

From the post:

By this point everyone is well acquainted with the power of Hadoop’s MapReduce. But what you’re also probably well acquainted with is the pain that must be suffered when setting up your own Hadoop cluster. Sure, there are some really good tutorials online if you know where to look:

However, I’m not much of a dev ops guy so I decided I’d take a look at Amazon’s Elastic MapReduce (EMR) and for the most part I’ve been very pleased. However, I did run into a couple of difficulties, and hopefully this short article will help you avoid my pitfalls.

I often dream of setting up a cluster that requires a newspaper hat because of the oil from cooling the coils, wait!, that was replica of the early cyclotron, sorry, wrong experiment. 😉

I mean a cluster of computers humming and driving up my cooling bills.

But there are alternatives.

Amazon’s Elastic Map Reduce (EMR) is one.

You can learn Hadoop with Hortonworks Sandbox and when you need production power, EMR awaits.

From a cost effectiveness standpoint, that sounds like a good deal to me.

You?

PS: Someone told me today that Amazon isn’t a reliable cloud because they have downtime. It is true that Amazon does have downtime but that isn’t a deciding factor.

You have to consider the relationship between Amazon’s aggressive pricing and how much reliability you need.

If you are running flight control for a moon launch, you probably should not use a public cloud.

Or for a heart surgery theater. And a few other places like that.

If you mean the webservices for your < 4,000 member NGO, 100% guaranteed uptime is a recipe for someone making money, off of you.

April 18, 2013

Hadoop: The Lay of the Land

Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:47 am

Hadoop: The Lay of the Land by Tom White.

From the post:

The core map-reduce framework for big data consists of several interlocking technologies. This first installment of our tutorial explains what Hadoop does and how the pieces fit together.

Big Data is in the news these days, and Apache Hadoop is one of the most popular platforms for working with Big Data. Hadoop itself is undergoing tremendous growth as new features and components are added, and for this reason alone, it can be difficult to know how to start working with it. In this three-part series, I explain what Hadoop is and how to use it, presenting a simple, hands-on examples that you can try yourself. First, though, let’s look at the problem that Hadoop was designed to solve.

Much later:

Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He is an engineer at Cloudera, a company set up to offer Hadoop tools, support, and training. He is the author of the best-selling O’Reilly book, Hadoop: The Definitive Guide.

If you are getting started with Hadoop or need a good explanation for others, start here.

I first saw this at: Learn How To Hadoop from Tom White in Dr. Dobb’s by Justin Kestelyn.

April 16, 2013

Iterative Map Reduce – Prior Art

Filed under: Hadoop,MapReduce — Patrick Durusau @ 11:59 am

Iterative Map Reduce – Prior Art

From the post:

There have been several attempts in the recent past at extending Hadoop to support efficient iterative data processing on clusters. To facilitate understanding this problem better here is a collection of some prior art relating to this problem space.

Short summaries of:

Other proposals to add to this list?

April 3, 2013

MapR and Ubuntu

Filed under: Hadoop,MapR,MapReduce — Patrick Durusau @ 5:06 am

MapR has posted all of its Hadoop ecosystem source code to Github: MapR Technologies.

MapR has also partnered with Canonical to release the entire Hadoop stack for 12.04 LTS and 12.10 releases of Ubuntu on www.ubuntu.com starting April 25, 2013.

For details see: MapR Teams with Canonical to Deliver Hadoop on Ubuntu.

I first saw this at: MapR Turns to Ubuntu in Bid to Increase Footprint by Isaac Lopez.

March 25, 2013

5 Pitfalls To Avoid With Hadoop

Filed under: Data Integration,Hadoop,MapReduce — Patrick Durusau @ 3:41 pm

5 Pitfalls To Avoid With Hadoop by Syncsort, Inc.

From the registration page:

Hadoop is a great vehicle to extract value from Big Data. However, relying only on Hadoop and common scripting tools like Pig, Hive and Sqoop to achieve a complete ETL solution can hinder success.

Syncsort has worked with early adopter Hadoop customers to identify and solve the most common pitfalls organizations face when deploying ETL on Hadoop.

  1. Hadoop is not a data integration tool
  2. MapReduce programmers are hard to find
  3. Most data integration tools don’t run natively within Hadoop
  4. Hadoop may cost more than you think
  5. Elephants don’t thrive in isolation

Before you give up your email and phone number for the “free ebook,” be aware it is a promotional piece for Syncsort DMX-h.

Which isn’t a bad thing but if you are expecting something different, you will be disappointed.

The observations are trivially true and amount to Hadoop not having a user facing interface, pre-written routines for data integration and tools that data integration users normally expect.

OK, but a hammer doesn’t come with blueprints, nails, wood, etc., but those aren’t “pitfalls.”

It’s the nature of a hammer that those “extras” need to be supplied.

You can either do that piecemeal or you can use a single source (the equivalent of Syncsort DMX-h).

Syncsort should be on your short list of data integration options to consider but let’s avoid loose talk about Hadoop. There is enough of that in the uninformed main stream media.

March 22, 2013

Apache Crunch (Top-Level)

Filed under: Apache Crunch,Hadoop,MapReduce — Patrick Durusau @ 12:34 pm

Apache Crunch (Top Level)

While reading Josh Wills post, Cloudera ML: New Open Source Libraries and Tools for Data Scientists, I saw that Apache Crunch became a top-level project at the Apache Software Foundation last month.

Congratulations to Josh and all the members of the Crunch community!

From the Apache Crunch homepage:

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines, and is based on Google’s FlumeJava library. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Running on top of Hadoop MapReduce, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

You may be interested in: Crunch-133 Add Aggregator support for combineValues ops on secondary keys via maps and collections. It is an “open” issue.

March 17, 2013

M3R: Increased Performance for In-Memory Hadoop Jobs

Filed under: Hadoop,Main Memory Map Reduce (M3R),MapReduce — Patrick Durusau @ 3:42 pm

M3R: Increased Performance for In-Memory Hadoop Jobs by Avraham Shinnar, David Cunningham, Benjamin Herta, Vijay Saraswat.

Abstract:

Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters. It does not support resilience, and supports only those workloads which can fit into cluster memory. In return, it can run HMR jobs unchanged – including jobs produced by compilers for higher-level languages such as Pig, Jaql, and SystemML and interactive front-ends like IBM BigSheets – while providing significantly better performance than the Hadoop engine on several workloads (e.g. 45x on some input sizes for sparse matrix vector multiply). M3R also supports extensions to the HMR API which can enable Map Reduce jobs to run faster on the M3R engine, while not affecting their performance under the Hadoop engine.

The authors start with the assumption of “clean” data that has already been reduced to terabytes in size and that can be stored in main memory for “scores” of nodes as opposed to thousands of nodes. (score = 20)

And they make the point that main memory is only going to increase in the coming years.

While phrased as “interactive analytics (e.g. interactive machine learning),” I wonder if the design point is avoiding non-random memory?

And what the consequences of entirely random memory will have on algorithm design? Or the assumptions that drive algorithmic design?

One way to test the impact of large memory on design would be to award access to cluster with several terabytes of data on a competitive basis, for some time period, with all the code, data, runs, etc., being streamed to a pubic forum.

One qualification being that the user not already have access to that level of computing power at work. 😉

I first saw this at Alex Popescu’s Paper: M3R – Increased Performance for In-Memory Hadoop Jobs.

March 16, 2013

Non-Word Count Hello World

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:11 pm

Finally! A Hadoop Hello World That Isn’t A Lame Word Count! by John Berryman.

From the post:

So I got bored of the old WordCount Hello World, and being a fairly mathy person, I decided to make my own Hello World in which I coaxed Hadoop into transposing a matrix!

What? What’s that you say? You think that a matrix transpose MapReduce is way more lame than a word count? Well I didn’t say that we were going to be saving the world with this MapReduce job, just flexing our mental muscles a little more. Typically, when you run the WordCount example, you don’t even look at the java code. You just pat yourself on the back when the word “the” invariably revealed to be the most popular word in the English language.

The goal of this exercise is to present a new challenge and a simple challenge so that we can practice thinking about solving BIG problems under the sometimes unintuitive constraints of MapReduce. Ultimately I intend to follow this post up with exceedingly more difficult MapReduce problems to challenge you and encourage you to tackle your own problems.

So, without further adieu:

As John says, not much beyond the Word Count examples but it is a different problem.

The promise of more difficult MapReduce problems sounds intriguing.

Need to watch for following posts.

March 15, 2013

YSmart: Yet Another SQL-to-MapReduce Translator

Filed under: MapReduce,SQL — Patrick Durusau @ 4:30 pm

YSmart: Yet Another SQL-to-MapReduce Translator by Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, Xiaodong Zhang.

Abstract:

MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to-MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.

Just in case you aren’t plicking the videos for this weekend.

Alex Popescus points this paper out at: Paper: YSmart – Yet Another SQL-to-MapReduce Translator.

March 9, 2013

The history of Hadoop:…

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:50 pm

The history of Hadoop: From 4 nodes to the future of data by Derrick Harris.

From the post:

Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search-engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.

Alone, Hadoop is a software market that IDC predicts will be worth $813 million in 2016 (although that number is likely very low), but it’s also driving a big data market the research firm predicts will hit more than $23 billion by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups and spurred hundreds of millions in venture capital investment since 2008.

In this four-part series, we’ll explain everything anyone concerned with information technology needs to know about Hadoop. Part I is the history of Hadoop from the people who willed it into existence and took it mainstream. Part II is more graphic; a map of the now-large and complex ecosystem of companies selling Hadoop products. Part III is a look into the future of Hadoop that should serve as an opening salvo for much of the discussion at our Structure: Data conference March 20-21 in New York. Finally, Part IV will highlight some the best Hadoop applications and seminal moments in Hadoop history, as reported by GigaOM over the years.

Whether you hope for insight into what makes a software paradigm successful or just to enrich your knowledge of Hadoop’s history, either way this is a great start on a history of Hadoop!

Enjoy!

March 8, 2013

hadoop illuminated (book)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:06 pm

hadoop illuminated by Mark Kerzner and Sujee Maniyam.

Largely a subjective judgment but I think the explanations of Hadoop are getting better.

Oh, the deep/hard stuff is still there, but the on ramp for getting to that point has become easier.

This book is a case in point.

I first saw this in a tweet by Computer Science.

March 7, 2013

Million Song Dataset in Minutes!

Filed under: Hadoop,MapReduce,Mortar,Pig,Python — Patrick Durusau @ 3:50 pm

Million Song Dataset in Minutes! (Video)

Actually 5:35 as per the video.

The summary of the video reads:

Created Web Project [zero install]

Loaded data from S3

Developed in Pig and Python [watch for the drop down menus of pig fragments]

ILLUSTRATE’d our work [perhaps the most impressive feature, tests code against sample of data]

Ran on Hadoop [drop downs to create a cluster]

Downloaded results [50 “densest songs”, see the video]

It’s not all “hands free” or without intellectual effort on your part.

But, a major step towards a generally accessible interface for Hadoop/MapReduce data processing.

MortarData2013

Filed under: Hadoop,MapReduce,Mortar,Pig — Patrick Durusau @ 3:36 pm

MortarData2013

Mortar has its own YouTube channel!

Unlike the History Channel, the MotorData2013 channel is educational and entertaining.

I leave it to you to guess whether those two adjectives apply to the History Channel. (Hint: Thirty (30) minutes of any Vikings episode should help you answer.)

Not a lot of data at the moment but what is there, well, I am going to cover one of those in a separate post.

March 6, 2013

Hadoop MapReduce: to Sort or Not to Sort

Filed under: Hadoop,MapReduce,Sorting — Patrick Durusau @ 7:22 pm

Hadoop MapReduce: to Sort or Not to Sort by Tendu Yogurtcu.

From the post:

What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.

MapReduce

Opening up the Sort phase to alternative implementations will facilitate new use cases and data flows in the MapReduce framework. Let’s look at some of these use cases:

The use cases include:

  • Optimized sort implementations.
  • Hash-based aggregations.
  • Ability to run a job with a subset of data.
  • Optimized full joins.

See Tendu’s post for the details.

I first saw this at Use Cases for Hadoop’s New Pluggable Sort by Alex Popescu.

PolyBase

Filed under: Hadoop,HDFS,MapReduce,PolyBase,SQL,SQL Server — Patrick Durusau @ 11:20 am

PolyBase

From the webpage:

PolyBase is a fundamental breakthrough in data processing used in SQL Server 2012 Parallel Data Warehouse to enable truly integrated query across Hadoop and relational data.

Complementing Microsoft’s overall Big Data strategy, PolyBase is a breakthrough new technology on the data processing engine in SQL Server 2012 Parallel Data Warehouse designed as the simplest way to combine non-relational data and traditional relational data in your analysis. While customers would normally burden IT to pre-populate the warehouse with Hadoop data or undergo an extensive training on MapReduce in order to query non-relational data, PolyBase does this all seamlessly giving you the benefits of “Big Data” without the complexities.

I must admit I had my hopes up for the videos labeled: “Watch informative videos to understand PolyBase.”

But the first one was only 2:52 in length and the second was about the Jim Gray Systems Lab (2:13).

So, fair to say it was short on details. 😉

The closest thing I found to a clue was in the PolyBase datasheet that reads (under PolyBase Use Cases, if you are reading along) where it says:

PolyBase introduces the concept of external tables to represent data residing in HDFS. An external table defines a schema (that is, columns and their types) for data residing in HDFS. The table’s metadata lives in the context of a SQL Server database and the actual table data resides in HDFS.

I assume that means that the data in HDFS could have multiple external tables for the same data? Depending upon the query?

Curious if the external tables and/or data types are going to have mapreduce capabilities built-in? To take advantage of parallel processing of the data?

BTW, for topic map types, subject identities for the keys and data types would be the same as with more traditional “internal” tables. In case you want to merge data.

Just out of curiosity, any thoughts on possible IP on external schemas being applied to data?

I first saw this at Alex Popescu’s Microsoft PolyBase: Unifying Relational and Non-Relational Data.

March 4, 2013

GraphBuilder – A Scalable Graph Construction Library for Apache™ Hadoop™

Filed under: GraphBuilder,Graphs,Hadoop,MapReduce,Networks — Patrick Durusau @ 2:56 pm

GraphBuilder – A Scalable Graph Construction Library for Apache™ Hadoop™ by Theodore L. Willke, Nilesh Jain and Haijie Gu. (whitepaper)

Abstract:

The exponential growth in the pursuit of knowledge gleaned from data relationships that are expressed naturally as large and complex graphs is fueling new parallel machine learning algorithms. The nature of these computations is iterative and data-dependent. Recently, frameworks have emerged to perform these computations in a distributed manner at commercial scale. But feeding data to these frameworks is a huge challenge in itself. Since graph construction is a data-parallel problem, Hadoop is well-suited for this task but lacks some elements that would make things easier for data scientists that do not have domain expertise in distributed systems engineering. We developed GraphBuilder, a scalable graph construction software library for Apache Hadoop, to address this gap. GraphBuilder offloads many of the complexities of graph construction, including graph formation, tabulation, compression, transformation, partitioning, output formatting, and serialization. It is written in Java for ease of programming and scales using the MapReduce parallel programming model. We describe the motivation for GraphBuilder, its architecture, and present two case studies that provide a preliminary evaluation.

The “whitepaper” introduction to GraphBuilder.

March 3, 2013

Project Rhino

Filed under: Cybersecurity,Hadoop,MapReduce,Project Rhino,Security — Patrick Durusau @ 1:21 pm

Project Rhino

Is Wintel becoming Hintel? 😉

If history is a guide, that might not be a bad thing.

From the project page:

As Hadoop extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with all Hadoop projects and HBase must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

The core of the Apache Hadoop ecosystem as it is commonly understood is:

  • Core: A set of shared libraries
  • HDFS: The Hadoop filesystem
  • MapReduce: Parallel computation framework
  • ZooKeeper: Configuration management and coordination
  • HBase: Column-oriented database on HDFS
  • Hive: Data warehouse on HDFS with SQL-like access
  • Pig: Higher-level programming language for Hadoop computations
  • Oozie: Orchestration and workflow management
  • Mahout: A library of machine learning and data mining algorithms
  • Flume: Collection and import of log and event data
  • Sqoop: Imports data from relational databases

These components are all separate projects and therefore cross cutting concerns like authN, authZ, a consistent security policy framework, consistent authorization model and audit coverage loosely coordinated. Some security features expected by our customers, such as encryption, are simply missing. Our aim is to take a full stack view and work with the individual projects toward consistent concepts and capabilities, filling gaps as we go.

Like I said, might not be a bad thing!

Different from recent government rantings. Focused on a particular stack with the intent to analyze that stack, not the world at large, and to make specific improvements (read measurable results).

March 2, 2013

Hadoop++ and HAIL [and LIAH]

Filed under: Hadoop,HAIL,MapReduce — Patrick Durusau @ 3:33 pm

Hadoop++ and HAIL

From the webpage:

Hadoop++

Hadoop++: Nowadays, working over very large data sets (Petabytes of information) is a common reality for several enterprises. In this context, query processing is a big challenge and becomes crucial. The Apache Hadoop project has been adopted by many famous companies to query their Petabytes of information. Some examples of such enterprises are Yahoo! and Facebook. Recently, some researchers from the database community indicated that Hadoop may suffer from performance issues when running analytical queries. We believe this is not an inherent problem of the MapReduce paradigm but rather some implementation choices done in Hadoop. Therefore, the overall goal of Hadoop++ project is to improve Hadoop’s performance for analytical queries. Already, our preliminary results show an improvement of Hadoop++ over Hadoop by up to a factor 20. In addition, we are currently investigating the impact of a number of other optimizations techniques.

HAIL elephant

HAIL (Hadoop Aggressive Indexing Library) is an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop and even outperforms Hadoop++.

Isn’t that a cool aggressive elephant?

But before you get too excited, consider:

Towards Zero-Overhead Adaptive Indexing in Hadoop by Stefan Richter, Jorge-Arnulfo Quiané-Ruiz, Stefan Schuh, Jens Dittrich.

Abstract:

Several research works have focused on supporting index access in MapReduce systems. These works have allowed users to significantly speed up selective MapReduce jobs by orders of magnitude. However, all these proposals require users to create indexes upfront, which might be a difficult task in certain applications (such as in scientific and social applications) where workloads are evolving or hard to predict. To overcome this problem, we propose LIAH (Lazy Indexing and Adaptivity in Hadoop), a parallel, adaptive approach for indexing at minimal costs for MapReduce systems. The main idea of LIAH is to automatically and incrementally adapt to users’ workloads by creating clustered indexes on HDFS data blocks as a byproduct of executing MapReduce jobs. Besides distributing indexing efforts over multiple computing nodes, LIAH also parallelises indexing with both map tasks computation and disk I/O. All this without any additional data copy in main memory and with minimal synchronisation. The beauty of LIAH is that it piggybacks index creation on map tasks, which read relevant data from disk to main memory anyways. Hence, LIAH does not introduce any additional read I/O-costs and exploit free CPU cycles. As a result and in contrast to existing adaptive indexing works, LIAH has a very low (or invisible) indexing overhead, usually for the very first job. Still, LIAH can quickly converge to a complete index, i.e. all HDFS data blocks are indexed. Especially, LIAH can trade early job runtime improvements with fast complete index convergence. We compare LIAH with HAIL, a state-of-the-art indexing technique, as well as with standard Hadoop with respect to indexing overhead and workload performance. In terms of indexing overhead, LIAH can completely index a dataset as a byproduct of only four MapReduce jobs while incurring a low overhead of 11% over HAIL for the very first MapReduce job only. In terms of workload performance, our results show that LIAH outperforms Hadoop by up to a factor of 52 and HAIL by up to a factor of 24.

The Information Systems Group, Saarland University, Prof. Dr. Jens Dittrich is a place to watch.

March 1, 2013

Pig Eye for the SQL Guy

Filed under: Hadoop,MapReduce,Pig,SQL — Patrick Durusau @ 5:33 pm

Pig Eye for the SQL Guy by Cat Miller.

From the post:

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.

As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.

Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)

This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

Do you speak SQL?

Want to learn to speak Pig?

This is the right post for you!

WANdisco: Free Hadoop Training Webinars

Filed under: Hadoop,HBase,MapReduce — Patrick Durusau @ 5:31 pm

WANdisco: Free Hadoop Training Webinars

WANdisco has four Hadoop webinars to put on your calendar:

A Hadoop Overview

This webinar will include a review of major components including HDFS, MapReduce, and HBase – the NoSQL database management system used with Hadoop for real-time applications. An overview of Hadoop’s ecosystem will also be provided. Other topics covered will include a review of public and private cloud deployment options, and common business use cases.

Register now Weds, March 13, 10:00 a.m. PT/1:00 p.m. ET

A Hadoop Deep Dive

This webinar will cover Hadoop misconceptions (not all clusters are thousands of machines), information about real world Hadoop deployments, a detailed review of Hadoop’s ecosystem (Sqoop, Flume, Nutch, Oozie, etc.), an in-depth look at HDFS, and an explanation of MapReduce in relation to latency and dependence on other Hadoop activities.

This webinar will introduce attendees to concepts they will need as a prerequisite for subsequent training webinars covering MapReduce, HBase and other major components at a deeper technical level.

Register now Weds, March 27, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: A MapReduce Tutorial

This webinar will cover MapReduce at a deep technical level.

This session will cover the history of MapReduce, how a MapReduce job works, its logical flow, the rules and types of MapReduce jobs, de-bugging and testing MapReduce jobs, writing foolproof MapReduce jobs, various workflow tools that are available, and more.

Register now Weds, April 10, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: HBase In-Depth

This webinar will provide a deep technical review of HBase, and cover flexibility, scalability, components (cells, rows, columns, qualifiers), schema samples, hardware requirements and more.

Register now Weds, April 24, 10:00 a.m. PT/1:00 p.m. ET

I first saw this at: WANdisco Announces Free Hadoop Training Webinars.

A post with no link to WANdisco or to registration for any of the webinars.

If you would prefer that I put in fewer hyperlinks to resources, please let me know.

February 27, 2013

Big Data Central

Filed under: BigData,Hadoop,MapReduce — Patrick Durusau @ 5:34 pm

Big Data Central by LucidWorks™

From LucidWorks™ Launches Big Data Central:

The new website, Big Data Central, is meant to become the primary source of educational materials, case studies, trends, and insights that help companies navigate the changing data management landscape. At Big Data Central, visitors can find, and contribute to, a wide variety of information including:

  • Use cases and best practices that highlight lessons learned from peers
  • Industry and analyst reports that track trends and hot topics
  • Q&As that answer some of the most common questions plaguing firms today about Big Data implementations

Definitely one for the news feed!

Apache Pig: It goes to 0.11

Filed under: Hadoop,MapReduce,Pig — Patrick Durusau @ 5:33 pm

Apache Pig: It goes to 0.11

From the post:

After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 — it’s a great way to contribute to open source software.

This blog post hits some of the highlights of the release. Pig users may also find a presentation by Daniel Dai, which includes code and output samples for the new operators, helpful.

And from Hortonworks’ post on the release:

  • A DateTime datatype, documentation here.
  • A RANK function, documentation here.
  • A CUBE operator, documentation here.
  • Groovy UDFs, documentation here.

If you remember Robert Barta’s Cartesian expansion of tuples, you will find it in the CUBE operator.

« Newer PostsOlder Posts »

Powered by WordPress