Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 20, 2013

Pyrallel – Parallel Data Analytics in Python

Filed under: Data Analysis,Parallel Programming,Programming,Python — Patrick Durusau @ 6:12 am

Pyrallel – Parallel Data Analytics in Python by Olivier Grisel.

From the webpage:

Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

Scope:

  • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
  • focus on small to medium data (with data locality when possible).
  • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
  • do not focus on HA / Fault Tolerance (yet).
  • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.

This project brought to mind two things:

  1. Experimentation can lead to new approaches, such as “Think like a vertex.” (GraphLab: A Distributed Abstraction…), and
  2. A conference anecdote about a Python application written so the customer would need to upgrade for higher performance. Prototype performed so well the customer didn’t need the fuller version. I thought that was a tribute to Python and the programmer. Opinions differed.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress