Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 21, 2012

Getting Started with R and Hadoop

Filed under: BigData,Hadoop,R — Patrick Durusau @ 1:47 pm

Getting Started with R and Hadoop by David Smith.

From the post:

Last week's meeting of the Chicago area Hadoop User Group (a joint meeting the Chicago R User Group, and sponsored by Revolution Analytics) focused on crunching Big Data with R and Hadoop. Jeffrey Breen, president of Atmosphere Research Group, frequently deals with large data sets in his airline consulting work, and R is his "go-to tool for anything data-related". His presentation, "Getting Started with R and Hadoop" focuses on the RHadoop suite of packages, and especially the rmr package to interface R and Hadoop. He lists four advantages of using rmr for big-data analytics with R and Hadoop:

  • Well-designed API: code only needs to deal with basic R objects
  • Very flexible I/O subsystem: handles common formats like CSV, and also allows complex line-by-line parsing
  • Map-Reduce jobs can easily be daisy-chained to build complex workflows
  • Concise code compared to other ways of interfacing R and Hadoop (the chart below compares the number of lines of code required to implement a map-reduce analysis using different systems) 

Slides, detailed examples, presentation, pointers to other resources.

Other than processing your data set, doesn’t look like it leaves much out. 😉

Ironic that we talk about “big data” sets when the Concept Annotation in the CRAFT corpus took two and one-half years (that 30 months for you mythic developer types) to tag ninety-seven (97) medical articles.

That’s an average of a little over three (3) articles per month.

And I am sure the project leads would concede that more could be done.

Maybe “big” data should include some notion of “complex” data?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress