Apache Spark: Distributed Machine Learning with Spark using MLbase by Ameet Talwaker and Evan Sparks.
From the description:
In this talk we describe our efforts, as part of the MLbase project, to develop a distributed Machine Learning platform on top of Spark. In particular, we present the details of two core components of MLbase, namely MLlib and MLI, which are scheduled for open-source release this summer. MLlib provides a standard Spark library of scalable algorithms for common learning settings such as classification, regression, collaborative filtering and clustering. MLI is a machine learning API that facilitates the development of new ML algorithms and feature extraction methods. As part of our release, we include a library written against the MLI containing standard and experimental ML algorithms, optimization primitives and feature extraction methods.
Suggestion: When you make a video of a presentation, don’t include members of the audience eating (pizza in this case). It’s distracting.
- MLlib: A distributed low-level ML library written directly against the Spark runtime that can be called from Scala and Java. The current library includes common algorithms for classification, regression, clustering and collaboritive filtering, and will be included as part of the Spark v0.8 release.
- MLI: An API / platform for feature extraction and algorithm development that introduces high-level ML programming abstractions. MLI is currently implemented against Spark, leveraging the kernels in MLlib when possible, though code written against MLI can be executed on any runtime engine supporting these abstractions. MLI includes more extensive functionality and has a faster development cycle than MLlib. It will be released in conjunction with MLlib as a separate project.
- ML Optimizer: This layer aims to simplify ML problems for End Users by automating the task of model selection. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI. This component is under active development.
The goal of this project, to make machine learning easier for developers and end users is a laudable one.
And it is the natural progression of a technology from being experimental to common use.
On the other hand, I am uneasy about the weight users will put on results, while not understanding biases or uncertainties that are cooked into the data or algorithms.
I don’t think there is a solution to the bias/uncertainty problem other than to become more knowledgeable about machine learning.
Not that you will win an argument with an end users who keeps pointing to a result as though it were untouched by human biases.
But you may be able to better avoid such traps for yourself and your clients.