Hadoop « Another Word For It

February 7, 2013

A Quick Guide to Hadoop Map-Reduce Frameworks

Filed under: Hadoop,Hive,MapReduce,Pig,Python,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 10:45 am

A Quick Guide to Hadoop Map-Reduce Frameworks by Alex Popescu.

Alex has assembled links to guides to MapReduce frameworks:

A Quick Guide to Hadoop Map-Reduce Frameworks by Matthew Rathbone.
Guide of Python frameworks for Hadoop by Uri Laserson.
Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark by Sami Badawi.

Thanks Alex!

Comments Off

February 5, 2013

Understanding MapReduce via Boggle [Topic Map Game Suggestions?]

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:54 pm

Understanding MapReduce via Boggle by Jesse Anderson.

From the post:

Graph theory is a growing part of Big Data. Using graph theory, we can find relationships in networks.

MapReduce is a great platform for traversing graphs. Therefore, one can leverage the power of an Apache Hadoop cluster to efficiently run an algorithm on the graph.

One such graph problem is playing Boggle*. Boggle is played by rolling a group of 16 dice. Each players’ job is find the most number of words spelled out by the dice. These dice are six-sided with a single letter that faces up:

Cool!

Any suggestions for a game that illustrates topic maps?

Perhaps a “discovery” game that leads to more points, etc., as merges occur?

I first saw this at Alex Popescu’s 3 MapReduce and Hadoop Links: Secondary Sorting, Hadoop-Based Letterpress, and Hadoop Vaidya.

Comments Off

Doing More with the Hortonworks Sandbox

Filed under: Data,Dataset,Hadoop,Hortonworks — Patrick Durusau @ 2:01 pm

Doing More with the Hortonworks Sandbox by Cheryle Custer.

From the post:

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

While you wait for more tutorials, Cheryle points to some data sets to keep you busy:

Public Data available on Google

Open Data Initiative from the US Government

Microsoft Bing Spatial Data Services

US Government XML Data sources

Free downloadable datasets from InfoChimps

For advice, see the Sandbox Forums.

BTW, while you are munging across different data sets, be sure to notice any semantic impedance if you try to merge some data sets.

If you don’t want everyone in your office doing that merging one-off, you might want to consider topic maps.

Design and document a merge between data sets once, run many times.

Even if your merging requirements change. Just change that part of the map, don’t re-create the entire map.

What if mapping companies recreated their maps for every new street?

Or would it be better to add the new street to an existing map?

If that looks obvious, try the extra-bonus question:

Which model, new map or add new street, do you use for schema migration?

Comments Off

January 26, 2013

DataFu: The WD-40 of Big Data

Filed under: DataFu,Hadoop,MapReduce,Pig — Patrick Durusau @ 1:42 pm

DataFu: The WD-40 of Big Data by Sam Shah.

From the post:

If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.

No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank, set operations, and bag operations.

It’s helpful to understand the history of the library. Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change. Along came PigUnit, which allowed UDF testing, so we spent the time to clean up these routines by adding documentation and rigorous unit tests. From this “datafoo” package, we thought this would help the community at large, and there you have DataFu.

So what can this library do for you? Let’s look at one of the classical examples that showcase the power and flexibility of Pig: sessionizing a click steam.

DataFu

The UDF bag and set operations are likely to be of particular interest.

Comments Off

January 25, 2013

Hadoop – “State of the Union” – Notes

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 8:15 pm

I took notes on Shaun Connolly’s “Hortonworks State of the Union and Vision for Apache Hadoop in 2013.”

Unless you are an open source/Hadoop innocent, you aren’t going to gain much new information from the webinar.

But there is another reason for watching the webinar.

That is the strategy Hortonworks is pursuing in developing the Hadoop ecosystem.

Shaun refers to it in various ways (warning, some paraphrasing): “Investing in making Hadoop work with existing infrastructure,” add Hadoop to (not replace) traditional data architectures, “customers don’t want more data silos.”

Rather than a rip-and-replace technology, Hortonworks is building a Hadoop ecosystem that interacts with and compliments existing data architectures.

Think about that for a moment.

Works with existing data architectures.

Which means everyone from ordinary users to power users and sysadmins, can work from what they know and gain the benefits of a Hadoop ecosystem.

True enough, over time some (all?) of their traditional data architectures may become more Hadoop based but that will be a gradual process.

In the meantime, the benefits of Hadoop will be made manifest in the context of familiar tooling.

When a familiar tool runs faster, acquires new capabilities, user’s notice the change along with the lack of a learning curve.

Watch the webinar for the strategy and think about how to apply it to your favorite semantic technology.

Comments Off

January 22, 2013

Hortonworks Sandbox — the Fastest On Ramp to Apache Hadoop

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 2:42 pm

Hortonworks Sandbox — the Fastest On Ramp to Apache Hadoop by Cheryle Custer.

From the post:

Today Hortonworks announced the availability of the Hortonworks Sandbox, an easy-to-use, flexible and comprehensive learning environment that will provide you with fastest on-ramp to learning and exploring enterprise Apache Hadoop.

The Hortonworks Sandbox is:

A free download

A complete, self contained virtual machine with Apache Hadoop pre-configured

A personal, portable and standalone Hadoop environment

A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop on your own

The Hortonworks Sandbox is designed to help close the gap between people wanting to learn and evaluate Hadoop, and the complexities of spinning up an evaluation cluster of Hadoop. The Hortonworks Sandbox provides a powerful combination of hands-on, step-by-step tutorials paired with an easy to use Web interface designed to lower the learning curve for people who just want to explore and evaluate Hadoop, as quickly as possible.

BTW, the tutorials can be refreshed to load new tutorials as they are released.

A marketing/teaching strategy that merits imitation by others.

Comments (1)

Mahout on Windows Azure…

Filed under: Azure Marketplace,Hadoop,Hortonworks,Machine Learning,Mahout — Patrick Durusau @ 2:42 pm

Mahout on Windows Azure – Machine Learning Using Microsoft HDInsight by Istvan Szegedi.

From the post:

Our last post was about Microsoft and Hortonworks joint effort to deliver Hadoop on Microsoft Windows Azure dubbed HDInsight. One of the key Microsoft HDInsight components is Mahout, a scalable machine learning library that provides a number of algorithms relying on the Hadoop platform. Machine learning supports a wide range of use cases from email spam filtering to fraud detection to recommending books or movies, similar to Amazon.com features.These algorithms can be divided into three main categories: recommenders/collaborative filtering, categorization and clustering. More details about these algorithms can be read on Apache Mahout wiki.

Are you hearing Hadoop, Mahout, HBase, Hive, etc., as often as I am?

Does it make you wonder about Apache becoming the locus of transferable IT skills?

Something to think about as you are developing topic map ecosystems.

You can hand roll your own solutions.

Or build upon solutions that have widespread vendor support.

PS: Another great post from Istvan.

Comments Off

January 20, 2013

Hadoop Bingo [The More Things Change…, First Game 22nd Jan. 2013]

Filed under: BigData,Hadoop,Hortonworks — Patrick Durusau @ 8:03 pm

Don’t be Tardy for This Hadoop BINGO Party! by Kim Truong.

It had to happen. Virtual “door prizes” for attending webinars, now bingo games.

I’m considering a contest to guess what the next 1950’s/60’s marketing tool will appear next. 😉

From the post:

I’m excited to kick-off our first webinar series for 2013: The True Value of Apache Hadoop.

Get all your friends, co-workers together and be prepared to geek out to Hadoop!

This 4-part series will have a mixture of amazing guest speakers covering topics such as Hortonworks 2013 vision and roadmaps for Apache Hadoop and Big Data, What’s new with Hortonworks Data Platform v1.2, How Luminar (an Entravision company) adopted Apache Hadoop, and use case on Hadoop, R and GoogleVis. This series will provide organizations an opportunity to gain a better understanding of Apache Hadoop and Big Data landscape and practical guidance on how to leverage Hadoop as part of your Big Data strategy.

How is that a party?

Don’t be confused. The True Value of Apache Hadoop is the series name and Hortonworks State of the Union and Vision for Apache Hadoop in 2013 is the first webinar title. My note on the “State of the Union.”

Don’t get me wrong. Entirely appropriate to recycle 1950’s/60’s techniques (or older).

We are people and people haven’t changed in terms of motivations, virtues or vices in recorded history.

If the past works, use it.

Comments Off

January 19, 2013

Hadoop “State of the Union” [Webinar]

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 7:07 pm

Hortonworks State of the Union and Vision for Apache Hadoop in 2013 by Kim Rose.

From the post:

Who: Shaun Connolly, Vice President of Corporate Strategy, Hortonworks

When: Tuesday, January 22, 2013 at 1:00 p.m. ET/10:00am PT

Where: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html

Click to Tweet: #Hortonworks hosting “State of the Union” webinar to discuss 2013 vision for #Hadoop, 1/22 at 1 pm ET. Register here: http://bit.ly/VYJxKX

The “State of the Union” webinar is the first in a four-part Hortonworks webinar series titled, “The True Value of Apache Hadoop,” designed to inform attendees of key trends, future roadmaps, best practices and the tools necessary for the successful enterprise adoption of Apache Hadoop.

During the “State of the Union,” Connolly will look at key company highlights from 2012, including the release of the Hortonworks Data Platform (HDP)—the industry’s online 100-percent open source platform powered by Apache Hadoop—and the further development of the Hadoop ecosystem through partnerships with leading software vendors, such as Microsoft and Teradata. Connolly will also provide insight into upcoming initiatives and projects that the Company plans to focus on this year as well as topical advances in the Apache Hadoop community.

Attendees will learn:

How Hortonworks’ focus contributes to innovation within the Apache open source community while addressing enterprise requirements and ecosystem interoperability;

About the latest releases in the Hortonworks product offering; and

About Hortonworks’ roadmap and major areas of investment across core platform, data and operational services for productive operations and management.

For more information, or to register for the “State of the Union” webinar, please visit: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html.

You will learn more from this “State of the Union” address than any similarly titled presentations with Congressional responses and the sycophantic choruses that accompany them.

Comments (1)

Hadoop in Perspective: Systems for Scientific Computing

Filed under: Hadoop,Scientific Computing — Patrick Durusau @ 7:05 pm

Hadoop in Perspective: Systems for Scientific Computing by Evert Lammerts.

From the post:

When the term scientific computing comes up in a conversation it’s usually just the occasional science geek who shows signs of recognition. But although most people have little or no knowledge of the field’s existence, it has been around since the second half of the twentieth century and has played an increasingly important role in many technological and scientific developments. Internet search engines, DNA analysis, weather forecasting, seismic analysis, renewable energy, and aircraft modeling are just a small number of examples where scientific computing is nowadays indispensible.

Apache Hadoop is a newcomer in scientific computing, and is welcomed as a great new addition to already existing systems. In this post I mean to give an introduction to systems for scientific computing, and I make an attempt at giving Hadoop a place in this picture. I start by discussing arguably the most important concept in scientific computing: parallel computing; what is it, how does it work, and what tools are available? Then I give an overview of the systems that are available for scientific computing at SURFsara, the Dutch center for academic IT and home to some of the world’s most powerful computing systems. I end with a short discussion on the questions that arise when there’s many different systems to choose from.

A good overview of the range of options for scientific computing, where, just as with more ordinary problems, no one solution is the best for all cases.

Comments Off

January 18, 2013

Hortonworks Data Platform 1.2 Available Now!

Filed under: Apache Ambari,Hadoop,HBase,Hortonworks,MapReduce — Patrick Durusau @ 7:18 pm

Hortonworks Data Platform 1.2 Available Now! by Kim Rose.

From the post:

Hortonworks Data Platform (HDP) 1.2, the industry’s only complete 100-percent open source platform powered by Apache Hadoop is available today. The enterprise-grade Hortonworks Data Platform includes the latest version of Apache Ambari for comprehensive management, monitoring and provisioning of Apache Hadoop clusters. By also introducing additional new capabilities for improving security and ease of use, HDP delivers an enterprise-class distribution of Apache Hadoop that is endorsed and adopted by some of the largest vendors in the IT ecosystem.

Hortonworks continues to drive innovation through a range of Hadoop-related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform. Powered by an Apache open source community, Ambari represents the forefront of innovation in Apache Hadoop management. Built on Apache Hadoop 1.0, the most stable and reliable code available today, HDP 1.2 improves the ease of enterprise adoption for Apache Hadoop with comprehensive management and monitoring, enhanced connectivity to high-performance drivers, and increased enterprise-readiness of Apache HBase, Apache Hive and Apache HCatalog projects.

…

The Hortonworks Data Platform 1.2 features a number of new enhancements designed to improve the enterprise viability of Apache Hadoop, including:

Simplified Hadoop Operations—Using the latest release of Apache Ambari, HDP 1.2 now provides both cluster management and the ability to zoom into cluster usage and performance metrics for jobs and tasks to identify the root cause of performance bottlenecks or operations issues. This enables Hadoop users to identify issues and optimize future job processing.

Improved Security and Multi-threaded Query—HDP 1.2 provides an enhanced security architecture and pluggable authentication model that controls access to Hive tables and the metastore. In addition, HDP 1.2 improves scalability by supporting multiple concurrent query connections to Hive from business intelligence tools and Hive clients.

Integration with High-performance Drivers Built for Big Data—HDP 1.2 empowers organizations with a trusted and reliable ODBC connector that enables the integration of current systems with high-performance drivers built for big data. The ODBC driver enables integration with reporting or visualization components through a SQL engine built into the driver. Hortonworks has partnered with Simba to deliver a trusted, reliable high-performance ODBC connector that is enterprise ready and completely free.

HBase Enhancements—By including and testing HBase 0.94.2, HDP 1.2 delivers important performance and operational improvements for customers building and deploying highly scalable interactive applications using HBase.

There goes the weekend!

Comments Off

January 16, 2013

Apache Hive 0.10.0 is Now Available

Filed under: Hadoop,Hive,MapReduce — Patrick Durusau @ 7:57 pm

Apache Hive 0.10.0 is Now Available by Ashutosh Chauhan.

From the post:

We are pleased to announce the the release of Apache Hive version 0.10.0. More than 350 JIRA issues have been fixed with this release. A few of the most important fixes include:

Cube and Rollup: Hive now has support for creating cubes with rollups. Thanks to Namit!

List Bucketing: This is an optimization that lets you better handle skew in your tables. Thanks to Gang!

Better Windows Support: Several Hive 0.10.0 fixes support running Hive natively on Windows. There is no more cygwin dependency. Thanks to Kanna!

‘Explain’ Adds More Info: Now you can do an explain dependency and the explain plan will contain all the tables and partitions touched upon by the query. Thanks to Sambavi!

Improved Authorization: The metastore can now optionally do authorization checks on the server side instead of on the client, providing you with a better security profile. Thanks to Sushanth!

Faster Simple Queries: Some simple queries that don’t require aggregations, and therefore MapReduce jobs, can now run faster.Thanks to Navis!

Better YARN Support: This release contains additional work aimed at making Hive work well with Hadoop YARN. While not all test cases are passing yet, there has been a lot of good progress made with this release. Thanks to Zhenxiao!

Union Optimization: Hive queries with unions will now result in a lower number of MapReduce jobs under certain conditions. Thanks to Namit!

Undo Your Drop Table: While not really truly ‘undo’, you can now reinstate your table after dropping it. Thanks to Andrew!

Show Create Table: The lets you see how you created your table. Thanks to Feng!

Support for Avro Data: Hive now has built-in support for reading/writing Avro data. Thanks to Jakob!

Skewed Joins: Hive’s support for joins involving skewed data is now improved. Thanks to Namit!

Robust Connection Handling at the Metastore Layer: Connection handling between a metastore client and server and also between a metastore server and the database layer has been improved. Thanks to Bhushan and Jean!

More Statistics: Its now possible to collect and store scalar-valued statistics for your tables and partitions. This will enable better query planning in upcoming releases. Thanks to Shreepadma!

Better-Looking HWI : HWI now uses a bootstrap javascript library. It looks really slick. Thanks to Hugo!

If you are excited about some of these new features, I recommend that you download hive-0.10 from: Hive 0.10 Release.

The full Release Notes are available here: Hive 0.10.0 Release Notes

This release saw contributions from many different people. We have numerous folks reporting bugs, writing patches for new features, fixing bugs, testing patches, helping users on mailing lists etc. We would like to give a big thank you to everyone who made hive-0.10 possible.

-Ashutosh Chauhan

A long quote but it helps to give credit where credit is due.

Comments Off

ANN: HBase 0.94.4 is available for download

Filed under: Hadoop,HBase — Patrick Durusau @ 7:55 pm

ANN: HBase 0.94.4 is available for download by lars Hofhansl

Bug fix release with 81 issues resolved plus performance enhancements!

First seen tweet by Stack

Comments Off

January 14, 2013

Using R with Hadoop [Webinar]

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 8:39 pm

Using R with Hadoop by David Smith.

From the post:

In two weeks (on January 24), Think Big Analytics' Jeffrey Breen will present a new webinar on using R with Hadoop. Here's the webinar description:

R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.

Topics include:

How to use R and Hadoop

Hadoop streaming

Various R packages and RHadoop

Hive via JDBC/ODBC

Using Revolution’s RHadoop

Big data warehousing with R and Hive

You can register for the webinar at the link below. If you do plan to attend the live session (where you can ask Jeffrey questions), be sure to sign in early — we're limited to 1000 participants and there are already more than 1000 registrants. If you can't join the live session (or it's just not at a convenient time for you), signing up will also get you a link to the recorded replay and a download link for the slides as soon as they're available after the webinar.

Definitely one for the calendar!

Comments Off

January 13, 2013

Apache Pig 0.10.1 Released

Filed under: Hadoop,Pig — Patrick Durusau @ 8:10 pm

Apache Pig 0.10.1 Released by Daniel Dai.

From the post:

We are pleased to announce that Apache Pig 0.10.1 was recently released. This is primarily a maintenance release focused on stability and bug fixes. In fact, Pig 0.10.1 includes 42 new JIRA fixes since the Pig 0.10.0 release.

Time to update your Pig installation!

Comments (1)

January 10, 2013

Hadoop Summit North America 2013

Filed under: Conferences,Hadoop — Patrick Durusau @ 1:47 pm

Oldest and Largest Apache Hadoop Community Event in North America Opens Call for Papers by Kim Rose.

Dates:

Early Bird Registration ends February 1, 2013

Abstract Deadline: February 22, 2013

Conference: June 26-27, 2013 (San Jose, CA)

From the post:

Hadoop Summit North America 2013, the premier Apache Hadoop community event, will take place at the San Jose Convention Center, June 26-27, 2013. Hosted by Hortonworks, a leading contributor to Apache Hadoop, and Yahoo!, Hadoop Summit brings together the community of developers, architects, administrators, data analysts, data scientists and vendors interested in advancing, extending and implementing Apache Hadoop as the next-generation enterprise data platform.

This 6th Annual Hadoop Summit North America will feature seven tracks and more than 80 sessions focused on building, managing and operating Apache Hadoop from some of the most influential speakers in the industry. Growing 30 percent to more than 2,200 attendees last year, Hadoop Summit reached near sell-out crowds. This year, the Summit is expected to be even larger.

Apache Hadoop is the open source technology that enables organizations to more efficiently and cost-effectively store, process, manage and analyze the ever-increasing volume of data being created and collected every day. Yahoo! pioneered Apache Hadoop and is still a leading user of the big data platform. Hortonworks is a core contributor to the Apache Hadoop technology via the company’s key architects and engineers.

The Hadoop Summit tracks include the following:

Hadoop-Driven Business / Business Intelligence: Will focus on how Apache Hadoop is powering a new generation of business intelligence solutions, including tools, techniques and solutions for deriving business value and competitive advantage from the large volumes of data flowing through today’s enterprise.

Applications and Data Science: Will focus on the practice of data science using Apache Hadoop, including novel applications, tools and algorithms, as well as areas of advanced research and emerging applications that use and extend the Apache Hadoop platform.

Deployment and Operations: Will focus on the deployment, operation and administration of Apache Hadoop clusters at scale, with an emphasis on tips, tricks and best practices.

Enterprise Data Architecture: Will focus on Apache Hadoop as a data platform and how it fits within broader enterprise data architectures.

Future of Apache Hadoop: Will take a technical look at the key projects and research efforts driving innovation in and around the Apache Hadoop platform.

Apache Hadoop (Disruptive) Economics: Focusing on business innovation, this track will provide concrete examples of how Apache Hadoop enables businesses across a wide range of industries to become data-driven, deriving value from data in order to achieve competitive advantage and/or new levels of productivity.

Reference Architectures: Apache Hadoop impacts every level of the enterprise data architecture from storage and operating systems through end-user tools and applications. This track will focus on how the various components of the enterprise ecosystem integrate and interoperate with Apache Hadoop.

The Hadoop Summit North America 2013 call for papers is now open. The deadline to submit an abstract for consideration is February 22, 2013. Track sessions will be voted on by all members of the Apache Hadoop ecosystem using a free voting system called Community Choice. The top ranking sessions in each track will automatically be added to the Hadoop Summit agenda. Remaining sessions will be chosen by a committee of industry experts using their experience and feedback from the Community Choice.

Discounted early bird registration is available now through February 1, 2013. To register for the event or to submit a speaking abstract for consideration, please visit: www.hadoopsummit.org/san-jose/

Sponsorship packages are also now available. For more information on how to sponsor this year’s event please visit: www.hadoopsummit.org/san-jose/sponsors/

I am sure your Hadoop based topic maps solution would be welcome at this conference.

And, it makes a nice warm up for the Balisage conference in August.

Comments Off

January 9, 2013

Cloudera Impala: A Modern SQL Engine for Hadoop [Webinar – 10 Jan 2013]

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 12:04 pm

Cloudera Impala: A Modern SQL Engine for Hadoop

From the post:

Join us for this technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.

Presenter Marcel Kornacker, creator of Impala, will begin with an overview of Impala from the user’s perspective, followed by an overview of Impala’s architecture and implementation, and will conclude with a comparison of Impala with Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.

Looking forward to the comparison part. Picking the right tool for a job is an important first step.

Comments Off

A Guide to Python Frameworks for Hadoop

Filed under: Hadoop,MapReduce,Python — Patrick Durusau @ 12:03 pm

A Guide to Python Frameworks for Hadoop by Uri Laserson.

From the post:

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

Hadoop Streaming

mrjob

dumbo

hadoopy

pydoop

and others

Ultimately, in my analysis, Hadoop Streaming is the fastest and most transparent option, and the best one for text processing. mrjob is best for rapidly working on Amazon EMR, but incurs a significant performance penalty. dumbo is convenient for more complex jobs (objects as keys; multistep MapReduce) without incurring as much overhead as mrjob, but it’s still slower than Streaming.

Read on for implementation details, performance comparisons, and feature comparisons.

A non-word count Hadoop example? Who would have thought? 😉

Enjoy!

Comments Off

January 8, 2013

Designing algorithms for Map Reduce

Filed under: Algorithms,BigData,Hadoop,MapReduce — Patrick Durusau @ 11:48 am

Designing algorithms for Map Reduce by Ricky Ho.

From the post:

Since the emerging of Hadoop implementation, I have been trying to morph existing algorithms from various areas into the map/reduce model. The result is pretty encouraging and I’ve found Map/Reduce is applicable in a wide spectrum of application scenarios.

So I want to write down my findings but then found the scope is too broad and also I haven’t spent enough time to explore different problem domains. Finally, I realize that there is no way for me to completely cover what Map/Reduce can do in all areas, so I just dump out what I know at this moment over the long weekend when I have an extra day.

Notice that Map/Reduce is good for “data parallelism”, which is different from “task parallelism”. Here is a description about their difference and a general parallel processing design methodology.

I’ll cover the abstract Map/Reduce processing model below. For a detail description of the implementation of Hadoop framework, please refer to my earlier blog here.

A bit dated (2010) but still worth your time.

I missed its initial appearance so appreciated Ricky pointing back to it in MapReduce: Detecting Cycles in Network Graph.

You may also want to consult: Designing good MapReduce algorithms by Jeffrey Ullman.

Comments Off

December 21, 2012

Connecting Splunk and Hadoop

Filed under: Hadoop,Splunk — Patrick Durusau @ 6:23 am

Connecting Splunk and Hadoop by Ledion Bitincka.

From the post:

Finally I am getting a some time to write about some cool features of one the projects that I’ve been working on – Splunk Hadoop Connect . This app is our first step in integrating Splunk and Hadoop. In this post I will cover three tips on how this app can help you, all of them are based on the new search command included in the app: hdfs. Before diving into the tips I would encourage that you download, install and configure the app first. I’ve also put together two screencast videos to walk you through the installation process:

Installation and Configuration for Hadoop Connect
Kerberos Configuration

You can also find the full documentation for the app here

Cool!

Is it just me or is sharing data across applications becoming more common?

Thinking the greater the sharing, the greater the need for mapping data semantics for integration.

Comments Off

December 20, 2012

Apache Hadoop: Seven Predictions for 2013 [Topic Maps in 2013 Predictions?]

Filed under: Hadoop — Patrick Durusau @ 8:03 pm

Apache Hadoop: Seven Predictions for 2013 by Herb Cunitz.

From the post:

At Thanksgiving we took a moment to reflect on the past and give thanks for all that has happened to Hortonworks the past year. With the New Year approaching we now take time to look forward and provide our predictions for the Hadoop community in 2013. To compile this list, we queried and collected big data from our team of Hadoop committers and members of the community.

We asked a few luminaries as well and we surfaced many expert opinions and while we had our hearts set on five predictions, we ended up with SEVEN. So, without further adieu, here are the top Top 7 Predictions for Hadoop in 2013.

These are just the first predictions I have seen for 2013. I am sure there have been others and there will be lots between now and year’s end.

Assuming we all make it past the 21^st of December, 2012, ;-), any suggestions for topic maps in 2013?

Comments Off

December 17, 2012

Apache Ambari: Hadoop Operations, Innovation, and Enterprise Readiness

Filed under: Apache Ambari,Hadoop,MapReduce — Patrick Durusau @ 4:23 pm

Apache Ambari: Hadoop Operations, Innovation, and Enterprise Readiness by Shaun Connolly

From the post:

Over the course of 2012, through Hortonworks’ leadership within the Apache Ambari community we have seen the rapid creation of an enterprise-class management platform required for enabling Apache Hadoop to be an enterprise viable data platform. Hortonworks engineers and the broader Ambari community have been working hard on their latest release, and we’d like to highlight the exciting progress that’s been made to Ambari, a 100% open and free solution that delivers the features required from an enterprise-class management platform for Apache Hadoop.

Why is the open source Ambari management platform important?

For Apache Hadoop to be an enterprise viable platform it not only needs the Data Services that sit atop core Hadoop (such as Pig, Hive, and HBase), but it also needs the Management Platform to be developed in an open and free manner. Ambari is a key operational component within the Hortonworks Data Platform (HDP), which helps make Hadoop deployments for our customers and partners easier and more manageable.

Stability and ease of management are two key requirements for enterprise adoption of Hadoop and Ambari delivers on both of these. Moreover, the rate at which this project is innovating is very exciting. In under a year, the community has accomplished what has taken years to complete for other solutions. As expected the “ship early and often” philosophy demonstrates innovation and helps encourage a vibrant and widespread following.

A reminder that tools can’t just be cool or clever.

Tools must fit within enterprise contexts where “those who lead from behind” are neither cool nor clever. But they do pay the bills and so are entitled to predictable and manageable outcomes.

Maybe. 😉 But that is the usual trade-off and if Apache Ambari helps Hadoop meet their requirements, so much the better for Hadoop.

Comments Off

December 14, 2012

How-To: Run a MapReduce Job in CDH4

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 2:56 pm

How-To: Run a MapReduce Job in CDH4 by Sandy Ryza.

From the post:

This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at http://github.com/cloudera/mapreduce-tutorial. The program will run on either MR1 or MR2.

We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation. We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.

The Use Case

There’s been a lot of brawling on our pirate ship recently. Not so rarely, one of the mates will punch another one in the mouth, knocking a tooth out onto the deck. Our poor sailors will wake up the next day with an empty bottle of rum, wondering who’s responsible for the gap between their teeth. All this violence has gotten out of hand, so as a deterrent, we’d like to provide everyone with a list of everyone that’s ever left them with a gap. Luckily, we’ve been able to set up a Flume source so that every time someone punches someone else, it gets written out as a line in a big log file in Hadoop. To turn this data into these lists, we need a MapReduce job that can 1) invert the mapping from attacker to their victim, 2) group by victims, and 3) eliminate duplicates.

Cool!

Imagine using the same technique while you watch the evening news!

On second thought, that would take too much data entry and be depressing.

Stick to the pirates!

Comments Off

December 13, 2012

Big Graph Data on Hortonworks Data Platform

Filed under: Aurelius Graph Cluster,Faunus,Gremlin,Hadoop,Hortonworks,Titan — Patrick Durusau @ 5:24 pm

Big Graph Data on Hortonworks Data Platform by Marko Rodriguez.

The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.

If you like graphs at all or have been looking at graph solutions, you are going to like this post.

Comments Off

December 11, 2012

Solving real world analytics problems with Apache Hadoop [Webinar]

Filed under: Cloudera,Hadoop — Patrick Durusau @ 7:14 pm

Solving real world analytics problems with Apache Hadoop

Thursday December 13, 2012 at 8:30 a.m. PST/11:30 a.m. EST

From the registration page:

Agenda:

Defining big data

What are the most critical components of a big data solution?

The business and technical challenges of delivering a solution

How Cloudera accelerates big data value?

Why partner with HP?

The HP AppSystem powered by Cloudera

Doesn’t look heavy on the technical side but on the other hand, attending means you will be entered in a raffle for an HP Mini Notebook.

Comments Off

December 10, 2012

Apache Gora

Filed under: BigData,Gora,Hadoop,HBase,MapReduce — Patrick Durusau @ 5:26 pm

Apache Gora

From the webpage:

What is Apache Gora?

The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column
stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

Why Apache Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use in-memory data model and persistence for big data framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.

Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.

Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.

Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading

MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

When writing about the Nutch 2.X development path, I discovered my omission of Gora from this blog. Apologies for having overlooked it until now.

Comments Off

December 7, 2012

Building graphs with Hadoop

Filed under: GraphBuilder,Graphs,Hadoop,Networks — Patrick Durusau @ 8:00 pm

Building graphs with Hadoop

From the post:

Faced with a mass of unstructured data, the first step of analysing it should be to organise it, and the first step of that process should be working out in what way it should be organised. But then that mass of data has to be fed into the graph which can take a long time and may be inefficient. That’s why Intel has announced the release of the open source GraphBuilder library, a tool that is meant to help scientists and developers working with large amounts of data build applications that make sense of this data.

The library plugs into Apache Hadoop and is designed to create graphs from big data sets which can then be used in applications. GraphBuilder is written in Java using the MapReduce parallel programming model and takes care of many of the complexities of graph construction. According to the developers, this makes it easier for scientists and developers who do not necessarily have skills in distributed systems engineering to make use of large data sets in their Hadoop applications. They can focus on writing the code that breaks the data up into meaningful nodes and useful edge information which can be run across the distributed architecture where the library also performs a wide range of other useful processes to optimise the data for later analysis.

A nice way to re-use those Hadoop skills you have been busy acquiring!

Definitely on the weekend schedule!

Comments (1)

December 6, 2012

Hadoop for Dummies

Filed under: Hadoop,MapReduce — Patrick Durusau @ 11:37 am

Hadoop for Dummies by Robert D. Schneider.

Courtesy of IBM, it’s what you think it is.

I am torn between thinking that educating c-suite executives is a good idea and wondering what sort of mis-impressions will follow from that education.

I suppose that could be an interesting sociology experiment. IT departments could forward the link to their c-suite executives and then keep track of the number and type of mis-impressions.

Collected at some common website by industry, could create a baseline for c-suite explanations of technology. 😉

Comments Off

December 5, 2012

Impala Beta (0.3) + Cloudera Manager 4.1.2 [Get’m While Their Hot!]

Filed under: Cloudera,Hadoop,Impala,MapReduce — Patrick Durusau @ 5:46 am

Cloudera Impala Beta (version 0.3) and Cloudera Manager 4.1.2 Now Available by Vinithra Varadharajan.

If you are keeping your Hadoop ecosystem skills up to date, drop by Cloudera for the latest Impala beta and a new release of Cloudera Manager.

Vinithra reports that new releases of Impala are going to drop every two to four weeks.

You can either wait for the final release of Impala or read along and contribute to the final product with your testing and comments.

Comments Off

December 4, 2012

New to Hadoop

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 12:08 pm

New to Hadoop

Cloudera has organized a seven step program for learning Hadoop!

It doesn’t list every possible resource but all the ones listed are high quality.

Following this program will build a solid basis for exploring the Hadoop ecosystem on your own.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 7, 2013

February 5, 2013

January 26, 2013

January 25, 2013

January 22, 2013

January 20, 2013

January 19, 2013

January 18, 2013

January 16, 2013

January 14, 2013

January 13, 2013

January 10, 2013

January 9, 2013

January 8, 2013

December 21, 2012

December 20, 2012

December 17, 2012

December 14, 2012

December 13, 2012

December 11, 2012

December 10, 2012

December 7, 2012

December 6, 2012

December 5, 2012

December 4, 2012