Archive for the ‘RHadoop’ Category

Step by step to build my first R Hadoop System

Tuesday, August 20th, 2013

Step by step to build my first R Hadoop System by Yanchang Zhao.

From the post:

After reading documents and tutorials on MapReduce and Hadoop and playing with RHadoop for about 2 weeks, finally I have built my first R Hadoop system and successfully run some R examples on it. My experience and steps to achieve that are presented at Hopefully it will make it easier to try RHadoop for R users who are new to Hadoop. Note that I tried this on Mac only and some steps might be different for Windows.

Before going through the complex steps, you may want to have a look what you can get with R and Hadoop. There is a video showing Wordcount MapReduce in R at

Unfortunately, I can’t get the video sound to work.

On the other hand, the step by step instructions are quite helpful, even without the video.

R and Hadoop Data Analysis – RHadoop

Wednesday, February 27th, 2013

R and Hadoop Data Analysis – RHadoop by Istvan Szegedi.

From the post:

R is a programming language and a software suite used for data analysis, statistical computing and data visualization. It is highly extensible and has object oriented features and strong graphical capabilities. At its heart R is an interpreted language and comes with a command line interpreter – available for Linux, Windows and Mac machines – but there are IDEs as well to support development like RStudio or JGR.

R and Hadoop can complement each other very well, they are a natural match in big data analytics and visualization. One of the most well-known R packages to support Hadoop functionalities is RHadoop that was developed by RevolutionAnalytics.

Nice introduction that walks you through installation and illustrates the use of RHadoop for analysis.

The ability to analyze “big data” is becoming commonplace.

The more that becomes a reality, the greater the burden on the user to critically evaluate the analysis that produced the “answers.”

Yes, repeatable analysis yielded answer X, but that just means applying the same assumptions to the same data gave the same result.

The same could be said about division by zero, although no one would write home about it.

Using R with Hadoop [Webinar]

Monday, January 14th, 2013

Using R with Hadoop by David Smith.

From the post:

In two weeks (on January 24), Think Big Analytics' Jeffrey Breen will present a new webinar on using R with Hadoop. Here's the webinar description:

R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.

Topics include:

  • How to use R and Hadoop
  • Hadoop streaming
  • Various R packages and RHadoop
  • Hive via JDBC/ODBC
  • Using Revolution’s RHadoop
  • Big data warehousing with R and Hive

You can register for the webinar at the link below. If you do plan to attend the live session (where you can ask Jeffrey questions), be sure to sign in early — we're limited to 1000 participants and there are already more than 1000 registrants. If you can't join the live session (or it's just not at a convenient time for you), signing up will also get you a link to the recorded replay and a download link for the slides as soon as they're available after the webinar.

Definitely one for the calendar!

Improving the integration between R and Hadoop: rmr 2.0 released

Wednesday, October 17th, 2012

Improving the integration between R and Hadoop: rmr 2.0 released

David Smith reports:

The RHadoop project, the open-source project supported by Revolution Analytics to integrate R and Hadoop, continues to evolve. Now available is version 2 of the rmr package, which makes it possible for R programmers to write map-reduce tasks in the R language, and have them run within the Hadoop cluster. This update is the "simplest and fastest rmr yet", according to lead developer Antonio Piccolboni. While previous releases added performance-improving vectorization capabilities to the interface, this release simplifies the API while still improving performance (for example, by using native serialization where appropriate). This release also adds some conveniance functions, for example for taking random samples from Big Data stored in Hadoop. You can find further details of the changes here, and download RHadoop here

RHadoop Project: Changelog

As you know, I’m not one to complain, ;-), but I read from above:

…this release simplifies the API while still improving performance [a good thing]

as contradicting the release notes that read in part:

…At the same time, we increased the complexity of the API. With this version we tried to define a synthesis between all the modes (record-at-a-time, vectorized and structured) present in 1.3, with the following goals:

  • bring the footprint of the API back to 1.2 levels.
  • make sure that no matter what the corner of the API one is exercising, he or she can rely on simple properties and invariants; writing an identity mapreduce should be trivial.
  • encourage writing the most efficient and idiomatic R code from the start, as opposed to writing against a simple API first and then developing a vectorized version for speed.

After reading the change notes, I’m going with the “simplifies the API” riff.

Take a close look and see what you think.

RHadoop – rmr – 1.2 released!

Tuesday, February 28th, 2012

RHadoop – rmr – 1.2 released!

From the Changelog:

  • Binary formats
  • Simpler, more powerful I/O format API
  • Native binary format with support for all R data types
  • Worked around an R bug that made large reduces very slow.
  • Backend specific parameters to modify things like number of reducers at the hadoop level
  • Automatic library loading in mappers and reducers
  • Better data frame conversions
  • Adopted a uniform.naming.convention
  • New package options API

If you are using R with Hadoop, this is a project you need to watch.

R and Hadoop

Friday, September 9th, 2011

From Revolution Analytics:

White paper: Advanced ‘Big Data’ Analytics with R and Hadoop

Webinar: Revolution Webinar: Leveraging R in Hadoop Environments 21 September 2011 – 10AM – 10:30AM Pacific Time

RHadoop: RHadoop

From GitHub:

RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. The packages have been implemented and tested in Cloudera’s distribution of Hadoop (CDH3) with R 2.13.0. RHadoop consists of the following packages:

rmr – functions providing Hadoop MapReduce functionality in R
rhdfs – functions providing file management of the HDFS from within R
rhbase – functions providing database management for the HBase distributed database from within R