Jeff Hammerbacher on Experiences Evolving a New Analytical Platform
Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.
In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:
Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.
Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:
- In or between which of these elements, would human analysis/judgment have the greatest impact?
- Would human analysis/judgment be best made by experts or crowds?
- What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
- Performance with feedback or homeostasis mechanisms?
That is a very crude and uninformed starter set of questions.
Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)