Archive for the ‘Dryad’ Category

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Sunday, November 20th, 2011

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

  1. In or between which of these elements, would human analysis/judgment have the greatest impact?
  2. Would human analysis/judgment be best made by experts or crowds?
  3. What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
  4. Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

Microsoft drops Dryad; bets on Hadoop

Saturday, November 19th, 2011

Microsoft drops Dryad; bets on Hadoop

In a November 11 post on the Windows HPC Team Blog, officials said that Microsoft had provided a minor update to the latest test build of the Dryad code as part of Windows High Performance Computing (HPC) Pack 2008 R2 Service Pack (SP) 3. But they also noted that “this will be the final (Dryad) preview and we do not plan to move forward with a production release.”


But it now appears Microsoft is putting all its big-data eggs in the Hadoop framework basket. Microsoft officials said a month ago that Microsoft was working with Hortonworks to develop both a Windows Azure and a Windows Server distribution of Hadoop. A Community Technology Preview (CTP) of the Windows Azure version is due out before the end of this calendar year; the Windows Server test build of Hadoop is due some time in 2012.

It might be a good time for the Hadoop community, which now includes MS, to talk about studying the syntax and semantics of the Hadoop eco-system that can be standardized.

It would be nice to see competition between Hadoop products on the basis of performance and features, not learning the oddities of particular implementations. The public versions could set a baseline and commercial versions would be pressed to better that.

After all, there those who contend that commercial code is measurably better than other types of code. Perhaps it is time to put their faith to the test.

DataCaml – a first look at distributed dataflow programming in OCaml

Sunday, June 26th, 2011

DataCaml – a first look at distributed dataflow programming in OCaml

From the post:

Distributed programming frameworks like Hadoop and Dryad are popular for performing computation over large amounts of data. The reason is programmer convenience: they accept a query expressed in a simple form such as MapReduce, and automatically take care of distributing computation to multiple hosts, ensuring the data is available at all nodes that need it, and dealing with host failures and stragglers.

A major limitation of Hadoop and Dryad is that they are not well-suited to expressing iterative algorithms or dynamic programming problems. These are very commonly found patterns in many algorithms, such as k-means clustering, binomial options pricing or Smith Waterman for sequence alignment.

Over in the SRG in Cambridge, we developed a Turing-powerful distributed execution engine called CIEL that addresses this. The NSDI 2011 paper describes the system in detail, but here’s a shorter introduction.

The post gives an introduction to the OCaml API.

The CIEL Execution Engine description begins with:

CIEL consists of a master coordination server and workers installed on every host. The engine is job-oriented: a job consists of a graph of tasks which results in a deterministic output. CIEL tasks can run in any language and are started by the worker processes as needed. Data flows around the cluster in the form of references that are fed to tasks as dependencies. Tasks can publish their outputs either as concrete references if they can finish the work immediately or as a future reference. Additionally, tasks can dynamically spawn more tasks and delegate references to them, which makes the system Turing-powerful and suitable for iterative and dynamic programming problems where the task graph cannot be computed statically.

BTW, you can also have opaque references, which progress for a while, then stop.

Deeply interesting work.