RHadoop « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 20, 2013

Step by step to build my first R Hadoop System

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 5:16 pm

Step by step to build my first R Hadoop System by Yanchang Zhao.

From the post:

After reading documents and tutorials on MapReduce and Hadoop and playing with RHadoop for about 2 weeks, finally I have built my first R Hadoop system and successfully run some R examples on it. My experience and steps to achieve that are presented at http://www.rdatamining.com/tutorials/rhadoop. Hopefully it will make it easier to try RHadoop for R users who are new to Hadoop. Note that I tried this on Mac only and some steps might be different for Windows.

Before going through the complex steps, you may want to have a look what you can get with R and Hadoop. There is a video showing Wordcount MapReduce in R at http://www.youtube.com/watch?v=hSrW0Iwghtw.

Unfortunately, I can’t get the video sound to work.

On the other hand, the step by step instructions are quite helpful, even without the video.

Comments Off

February 27, 2013

R and Hadoop Data Analysis – RHadoop

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 5:34 pm

R and Hadoop Data Analysis – RHadoop by Istvan Szegedi.

From the post:

R is a programming language and a software suite used for data analysis, statistical computing and data visualization. It is highly extensible and has object oriented features and strong graphical capabilities. At its heart R is an interpreted language and comes with a command line interpreter – available for Linux, Windows and Mac machines – but there are IDEs as well to support development like RStudio or JGR.

R and Hadoop can complement each other very well, they are a natural match in big data analytics and visualization. One of the most well-known R packages to support Hadoop functionalities is RHadoop that was developed by RevolutionAnalytics.

Nice introduction that walks you through installation and illustrates the use of RHadoop for analysis.

The ability to analyze “big data” is becoming commonplace.

The more that becomes a reality, the greater the burden on the user to critically evaluate the analysis that produced the “answers.”

Yes, repeatable analysis yielded answer X, but that just means applying the same assumptions to the same data gave the same result.

The same could be said about division by zero, although no one would write home about it.

Comments Off

January 14, 2013

Using R with Hadoop [Webinar]

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 8:39 pm

Using R with Hadoop by David Smith.

From the post:

In two weeks (on January 24), Think Big Analytics' Jeffrey Breen will present a new webinar on using R with Hadoop. Here's the webinar description:

R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.

Topics include:

How to use R and Hadoop

Hadoop streaming

Various R packages and RHadoop

Hive via JDBC/ODBC

Using Revolution’s RHadoop

Big data warehousing with R and Hive

You can register for the webinar at the link below. If you do plan to attend the live session (where you can ask Jeffrey questions), be sure to sign in early — we're limited to 1000 participants and there are already more than 1000 registrants. If you can't join the live session (or it's just not at a convenient time for you), signing up will also get you a link to the recorded replay and a download link for the slides as soon as they're available after the webinar.

Definitely one for the calendar!

Comments Off

October 17, 2012

Improving the integration between R and Hadoop: rmr 2.0 released

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 9:14 am

Improving the integration between R and Hadoop: rmr 2.0 released

David Smith reports:

The RHadoop project, the open-source project supported by Revolution Analytics to integrate R and Hadoop, continues to evolve. Now available is version 2 of the rmr package, which makes it possible for R programmers to write map-reduce tasks in the R language, and have them run within the Hadoop cluster. This update is the "simplest and fastest rmr yet", according to lead developer Antonio Piccolboni. While previous releases added performance-improving vectorization capabilities to the interface, this release simplifies the API while still improving performance (for example, by using native serialization where appropriate). This release also adds some conveniance functions, for example for taking random samples from Big Data stored in Hadoop. You can find further details of the changes here, and download RHadoop here.

RHadoop Project: Changelog

As you know, I’m not one to complain, ;-), but I read from above:

…this release simplifies the API while still improving performance [a good thing]

as contradicting the release notes that read in part:

…At the same time, we increased the complexity of the API. With this version we tried to define a synthesis between all the modes (record-at-a-time, vectorized and structured) present in 1.3, with the following goals:

bring the footprint of the API back to 1.2 levels.

make sure that no matter what the corner of the API one is exercising, he or she can rely on simple properties and invariants; writing an identity mapreduce should be trivial.

encourage writing the most efficient and idiomatic R code from the start, as opposed to writing against a simple API first and then developing a vectorized version for speed.

After reading the change notes, I’m going with the “simplifies the API” riff.

Take a close look and see what you think.

Comments Off

February 28, 2012

RHadoop – rmr – 1.2 released!

Filed under: R,RHadoop — Patrick Durusau @ 8:41 pm

RHadoop – rmr – 1.2 released!

From the Changelog:

Binary formats

Simpler, more powerful I/O format API

Native binary format with support for all R data types

Worked around an R bug that made large reduces very slow.

Backend specific parameters to modify things like number of reducers at the hadoop level

Automatic library loading in mappers and reducers

Better data frame conversions

Adopted a uniform.naming.convention

New package options API

If you are using R with Hadoop, this is a project you need to watch.

Comments Off

September 9, 2011

R and Hadoop

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 7:06 pm

From Revolution Analytics:

White paper: Advanced ‘Big Data’ Analytics with R and Hadoop

Webinar: Revolution Webinar: Leveraging R in Hadoop Environments 21 September 2011 – 10AM – 10:30AM Pacific Time

RHadoop: RHadoop

From GitHub:

RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. The packages have been implemented and tested in Cloudera’s distribution of Hadoop (CDH3) with R 2.13.0. RHadoop consists of the following packages:

rmr – functions providing Hadoop MapReduce functionality in R
rhdfs – functions providing file management of the HDFS from within R
rhbase – functions providing database management for the HBase distributed database from within R

Comments Off