Archive for the ‘Dremel’ Category

Drilling into Big Data with Apache Drill

Wednesday, August 7th, 2013

Drilling into Big Data with Apache Drill by Steven J Vaughan-Nichols.

From the post:

Apache’s Drill goal is striving to do nothing less than answer queries from petabytes of data and trillions of records in less than a second.

You can’t claim that the Apache Drill programmers think small. Their design goal is for Drill to scale to 10,000 servers or more and to process petabyes of data and trillions of records in less than a second.

If this sounds impossible, or at least very improbable, consider that the NSA already seems to be doing exactly the same kind of thing. If they can do it, open-source software can do it.

In at interview at OSCon, the major open source convention in Portland, OR, Ted Dunning, the chief application architect for MapR, a big data company, and a Drill mentor and committer, explained the reason for the project. “There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data in such formats as Avro; Apache Hadoop data serialization system; JSON (JavaScript Object Notation); and Protocol Buffers Google’s data interchange format.”

As Dunning explained, big business wants fast access to big data and none of the traditional solutions, such as a relational database management system (RDBMS), MapReduce, or Hive, can deliver those speeds.

Dunning continued, “This need was identified by Google and addressed internally with a system called Dremel.” Dremel was the inspiration for Drill, which also is meant to complement such open-source big data systems as Apache Hadoop. The difference between Hadoop and Drill is that while Hadoop is designed to achieve very high throughput, it’s not designed to achieve the sub-second latency needed for interactive data analysis and exploration.

(…)

At this point, Drill is very much a work in progress. “It’s not quite production quality at this point, but by third or fourth quarter of 2013 it will become quite usable.” Specifically, Drill should be in beta by the third quarter.

So, if Drill sounds interesting to you, you can start contributing as soon as you get up to speed. To do that, there’s a weekly Google Hangout on Tuesdays at 9am Pacific time and a Twitter feed at @ApacheDrill. And, of course, there’s an Apache Drill Wiki and users’ and developers’ mailing lists.

NSA claims, actually any claims by any government officials, have to be judged by President Obama announcing yesterday: “There is No Spying on Americans.”

It has been creeping along for a long time but the age of Newspeak is here.

But leaving doubtful comments by members of the government to one side, Apache Drill does sound like an exciting project!

How Google’s Dremel Makes Quick Work of Massive Data

Saturday, October 20th, 2012

How Google’s Dremel Makes Quick Work of Massive Data by Ian Armas Foster.

From the post:

The ability to process more data and the ability to process data faster are usually mutually exclusive. According to Armando Fox, professor of computer science at University of California at Berkeley, “the more you do one, the more you have to give up on the other.”

Hadoop, an open-source, batch processing platform that runs on MapReduce, is one of the main vehicles organizations are driving in the big data race.

However, Mike Olson, CEO of Cloudera, an important Hadoop-based vendor, is looking past Hadoop and toward today’s research projects. That includes one named Dremel, possibly Google’s next big innovation that combines the scale of Hadoop with the ever-increasing speed demands of the business intelligence world.

“People have done Big Data systems before,” Fox said “but before Dremel, no one had really done a system that was that big and that fast.”

On Dremel, see: Dremel: Interactive Analysis of Web-Scale Datasets, as well.

Are you looking (or considering looking) beyond Hadoop?

Accuracy and timeliness beyond the average daily intelligence briefing will drive demand for your information product.

Your edge is agility. Use it.

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Sunday, November 20th, 2011

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

  1. In or between which of these elements, would human analysis/judgment have the greatest impact?
  2. Would human analysis/judgment be best made by experts or crowds?
  3. What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
  4. Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

Dremel: Interactive Analysis of Web-Scale
Datasets

Sunday, June 12th, 2011

Google, along with Bing and Yahoo! have been attracting a lot of discussion for venturing into web semantics without asking permission.

However that turns out, please don’t miss:

Dremel: interactive analysis of web-scale datasets

Abstract:

Dremel is a scalable, interactive ad hoc query system for analysis of read-only nested data. By combining multilevel execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

I am still working through the article but “…aggregation queries over trillion-row tables in seconds,” is obviously of interest for a certain class of topic map.