Getting Started with R and Hadoop by David Smith.
From the post:
Last week's meeting of the Chicago area Hadoop User Group (a joint meeting the Chicago R User Group, and sponsored by Revolution Analytics) focused on crunching Big Data with R and Hadoop. Jeffrey Breen, president of Atmosphere Research Group, frequently deals with large data sets in his airline consulting work, and R is his "go-to tool for anything data-related". His presentation, "Getting Started with R and Hadoop" focuses on the RHadoop suite of packages, and especially the rmr package to interface R and Hadoop. He lists four advantages of using rmr for big-data analytics with R and Hadoop:
- Well-designed API: code only needs to deal with basic R objects
- Very flexible I/O subsystem: handles common formats like CSV, and also allows complex line-by-line parsing
- Map-Reduce jobs can easily be daisy-chained to build complex workflows
- Concise code compared to other ways of interfacing R and Hadoop (the chart below compares the number of lines of code required to implement a map-reduce analysis using different systems)
Slides, detailed examples, presentation, pointers to other resources.
Other than processing your data set, doesn’t look like it leaves much out. 😉
Ironic that we talk about “big data” sets when the Concept Annotation in the CRAFT corpus took two and one-half years (that 30 months for you mythic developer types) to tag ninety-seven (97) medical articles.
That’s an average of a little over three (3) articles per month.
And I am sure the project leads would concede that more could be done.
Maybe “big” data should include some notion of “complex” data?