Archive for the ‘MLBase’ Category

Analytics and Machine Learning at Scale [Aug. 29-30]

Tuesday, August 27th, 2013

AMP Camp Three – Analytics and Machine Learning at Scale

From the webpage:

AMP Camp Three – Analytics and Machine Learning at Scale will be held in Berkeley California, August 29-30, 2013. AMP Camp 3 attendees and online viewers will learn to solve big data problems using components of the Berkeley Data Analytics Stack (BDAS) and cutting edge machine learning algorithms.

Live streaming!

Sessions will cover (among other things): Mesos, Spark, Shark, Spark Streaming, BlinkDB, MLbase, Tachyon and GraphX.

Talk about a jolt before the weekend!

Distributed Machine Learning with Spark using MLbase

Sunday, August 18th, 2013

Apache Spark: Distributed Machine Learning with Spark using MLbase by Ameet Talwaker and Evan Sparks.

From the description:

In this talk we describe our efforts, as part of the MLbase project, to develop a distributed Machine Learning platform on top of Spark. In particular, we present the details of two core components of MLbase, namely MLlib and MLI, which are scheduled for open-source release this summer. MLlib provides a standard Spark library of scalable algorithms for common learning settings such as classification, regression, collaborative filtering and clustering. MLI is a machine learning API that facilitates the development of new ML algorithms and feature extraction methods. As part of our release, we include a library written against the MLI containing standard and experimental ML algorithms, optimization primitives and feature extraction methods.

Useful links:

Suggestion: When you make a video of a presentation, don’t include members of the audience eating (pizza in this case). It’s distracting.


  • MLlib: A distributed low-level ML library written directly against the Spark runtime that can be called from Scala and Java. The current library includes common algorithms for classification, regression, clustering and collaboritive filtering, and will be included as part of the Spark v0.8 release.
  • MLI: An API / platform for feature extraction and algorithm development that introduces high-level ML programming abstractions. MLI is currently implemented against Spark, leveraging the kernels in MLlib when possible, though code written against MLI can be executed on any runtime engine supporting these abstractions. MLI includes more extensive functionality and has a faster development cycle than MLlib. It will be released in conjunction with MLlib as a separate project.
  • ML Optimizer: This layer aims to simplify ML problems for End Users by automating the task of model selection. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI. This component is under active development.

The goal of this project, to make machine learning easier for developers and end users is a laudable one.

And it is the natural progression of a technology from being experimental to common use.

On the other hand, I am uneasy about the weight users will put on results, while not understanding biases or uncertainties that are cooked into the data or algorithms.

I don’t think there is a solution to the bias/uncertainty problem other than to become more knowledgeable about machine learning.

Not that you will win an argument with an end users who keeps pointing to a result as though it were untouched by human biases.

But you may be able to better avoid such traps for yourself and your clients.


Saturday, February 23rd, 2013

MLBase by Danny Bickson.

From the post:

Here is an interesting post I got from Ben Lorica, O’Reilly about MLbase:

It is a proof of concept machine learning library on top of Spark, with a custom declarative language called MQL.

Slated for release in August, 2013.

Suggest you digest Lorica’s post and the links therein.