Pyrallel – Parallel Data Analytics in Python by Olivier Grisel.
From the webpage:
Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.
Scope:
- focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
- focus on small to medium data (with data locality when possible).
- focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
- do not focus on HA / Fault Tolerance (yet).
- do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.
Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.
This project brought to mind two things:
- Experimentation can lead to new approaches, such as “Think like a vertex.” (GraphLab: A Distributed Abstraction…), and
- A conference anecdote about a Python application written so the customer would need to upgrade for higher performance. Prototype performed so well the customer didn’t need the fuller version. I thought that was a tribute to Python and the programmer. Opinions differed.