Archive for the ‘MPI’ Category

Distributed LIBLINEAR:

Thursday, May 15th, 2014

Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments

From the webpage:

MPI LIBLINEAR is an extension of LIBLINEAR on distributed environments. The usage and the data format are the same as LIBLINEAR. Currently only two solvers are supported:

  • L2-regularized logistic regression (LR)
  • L2-regularized L2-loss linear SVM

NOTICE: This extension can only run on Unix-like systems. (We test it on Ubuntu 13.10.) Python and Matlab interfaces are not supported.

Spark LIBLINEAR is a Spark implementation based on LIBLINEAR and integrated with Hadoop distributed file system. This package is developed using Scala. Currently it supports the same two solvers as MPI LIBLINEAR.

If you are unfamiliar with LIBLINEAR:

LIBLINEAR is a linear classifier for data with millions of instances and features. It supports

  • L2-regularized classifiers
    L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
  • L1-regularized classifiers (after version 1.4)
    L2-loss linear SVM and logistic regression (LR)
  • L2-regularized support vector regression (after version 1.9)
    L2-loss linear SVR and L1-loss linear SVR.

Main features of LIBLINEAR include

  • Same data format as LIBSVM, our general-purpose SVM solver, and also similar usage
  • Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
  • Cross validation for model selection
  • Probability estimates (logistic regression only)
  • Weights for unbalanced data
  • MATLAB/Octave, Java, Python, Ruby interfaces

You will also find instructions for creating distributed environments using VirtualBox for both MPI LIBLINEAR and Spark LIBLINEAR. I am going to post on that separately to draw attention to it.

The phrase “standalone computer” is rapidly becoming a misnomer. Forward looking algorithm designers and power users will begin gaining experience with the new distributed “normal,” at every opportunity.

I first saw this in a tweet by Reynold Xin.

R at 12,000 Cores

Wednesday, October 17th, 2012

R at 12,000 Cores

From the post:

I am very happy to introduce a new set of packages that has just hit the CRAN. We are calling it the Programming with Big Data in R Project, or pbdR for short (or as I like to jokingly refer to it, ‘pretty bad for dyslexics’). You can find out more about the pbdR project at

The packages are a natural programming framework that are, from the user’s point of view, a very simple extension of R’s natural syntax, but running in parallel over MPI and handling big data sets with ease. Much of the parallelism we offer is implicit, meaning that you can use code you are already using while achieving massive performance gains.

The packages are free as in beer, and free as in speech. You could call them “free and open source”, or libre software. The source code is free for everyone to look at, extend, re-use, whatever, forever.

At present, the project consists of 4 packages: pbdMPI, pbdSLAP, pbdBASE, and pbdDMAT. The pbdMPI package offers simplified hooks into MPI, making explicit parallel programming over much simpler, and sometimes much faster than with Rmpi. Next up the chain is pbdSLAP, which is a set of libraries pre-bundled for the R user, to greatly simplify complicated installations. The last two packages, pbdBASE and pbdDMAT, offer high-level R syntax for computing with distributed matrix objects at low-level programming speed. The only system requirements are that you have R and an MPI installation.

We have attempted to extensively document the project in a collection of package vignettes; but really, if you are already using R, then much of the work is already familiar to you. Want to take the svd of a matrix? Just use svd(x) or La.svd(x), only “x” is now a distributed matrix object.

One MPI source: OpenMPI. Interested to hear of experiences with other MPI installations.

If you can’t run MPI or don’t want to, be sure to also check out the RHadoop project.

I first saw this at R-Bloggers.