From the post:
I am very happy to introduce a new set of packages that has just hit the CRAN. We are calling it the Programming with Big Data in R Project, or pbdR for short (or as I like to jokingly refer to it, ‘pretty bad for dyslexics’). You can find out more about the pbdR project at http://r-pbd.org/
The packages are a natural programming framework that are, from the user’s point of view, a very simple extension of R’s natural syntax, but running in parallel over MPI and handling big data sets with ease. Much of the parallelism we offer is implicit, meaning that you can use code you are already using while achieving massive performance gains.
The packages are free as in beer, and free as in speech. You could call them “free and open source”, or libre software. The source code is free for everyone to look at, extend, re-use, whatever, forever.
At present, the project consists of 4 packages: pbdMPI, pbdSLAP, pbdBASE, and pbdDMAT. The pbdMPI package offers simplified hooks into MPI, making explicit parallel programming over much simpler, and sometimes much faster than with Rmpi. Next up the chain is pbdSLAP, which is a set of libraries pre-bundled for the R user, to greatly simplify complicated installations. The last two packages, pbdBASE and pbdDMAT, offer high-level R syntax for computing with distributed matrix objects at low-level programming speed. The only system requirements are that you have R and an MPI installation.
We have attempted to extensively document the project in a collection of package vignettes; but really, if you are already using R, then much of the work is already familiar to you. Want to take the svd of a matrix? Just use svd(x) or La.svd(x), only “x” is now a distributed matrix object.
One MPI source: OpenMPI. Interested to hear of experiences with other MPI installations.
If you can’t run MPI or don’t want to, be sure to also check out the RHadoop project.
I first saw this at R-Bloggers.