Archive for the ‘Drill’ Category

Evolving Parquet as self-describing data format –

Monday, April 6th, 2015

Evolving Parquet as self-describing data format – New paradigms for consumerization of Hadoop data by Neeraja Rentachintala.

From the post:

With Hadoop becoming more prominent in customer environments, one of the frequent questions we hear from users is what should be the storage format to persist data in Hadoop. The data format selection is a critical decision especially as Hadoop evolves from being about cheap storage to a pervasive query and analytics platform. In this blog, I want to briefly describe self-describing data formats, how they are gaining a lot of interest as a new management paradigm to consumerize Hadoop data in organizations and the work we have been doing as part of the Parquet community to evolve Parquet as fully self-describing format.

About Parquet

Apache Parquet is a columnar storage format for the Hadoop ecosystem. Since its inception about 2 years ago, Parquet has gotten very good adoption due to the highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem. A variety of tools and frameworks including MapReduce, Hive, Impala, and Pig provided the ability to work with Parquet data and a number of data models such as AVRO, Protobuf, and Thrift have been expanded to be used with Parquet as storage format. Parquet is widely adopted by a number of major companies including tech giants such as Twitter and Netflix.

Self-describing data formats and their growing role in analytics on Hadoop/NoSQL

Self-describing data is where schema or structure is embedded in the data itself. The schema is comprised of metadata such as element names, data types, compression/encoding scheme used (if any), statistics, and a lot more. There are a variety of data formats including Parquet, XML, JSON, and NoSQL databases such as HBase that belong to the spectrum of self-describing data and typically vary in the level of metadata they expose about themselves.

While the self-describing data has been in rise with NoSQL databases (e.g., the Mongo BSON model) for a while now empowering developers to be agile and iterative in application development cycle, the prominence of these has been growing in analytics as well when it comes to Hadoop. So what is driving this? The answer is simple – it’s the same reason – i.e., the requirement to be agile and iterative in BI/analytics.

More and more organizations are now using Hadoop as a data hub to store all their data assets. These data assets often contain existing datasets offloaded from the traditional DBMS/DWH systems, but also new types of data from new data sources (such as IOT sensors, logs, clickstream) including external data (such as social data, 3rd party domain specific datasets). The Hadoop clusters in these organizations are often multi-tenant and shared by multiple groups in the organizations. The traditional data management paradigms of creating centralized models/metadata definitions upfront before the data can be used for analytics are quickly becoming bottlenecks in Hadoop environments. The new complex and schema-less data models are hard to map to relational models and modeling data upfront for unknown ad hoc business questions and data discovery needs is challenging and keeping up with the schema changes as the data models evolve is practically impossible.

By pushing metadata to data and then using tools that can understand this metadata available in self-describing formats to expose it directly for end user consumption, the analysis life cycles can become drastically more agile and iterative. For example, using Apache Drill, the world’s first schema-free SQL query engine, you can query self-describing data (in files or NoSQL databases such as HBase/MongoDB) immediately without having to define and manage schema overlay definitions in centralize metastores. Another benefit of this is business self-service where the users don’t need to rely on IT departments/DBAs constantly for adding/changing attributes to centralized models, but rather focus on getting answers to the business questions by performing queries directly on raw data.

Think of it this way, Hadoop scaled processing by pushing processing to the nodes that have data. Analytics on Hadoop/NoSQL systems can be scaled to the entire organization by pushing more and more metadata to the data and using tools that leverage that metadata automatically to expose it for analytics. The more self-describing the data formats are (i.e., the more metadata they contain about data), the smarter the tools that leverage the metadata can be.

The post walks through example cases and points to additional resources.

To become self-describing, Parquet will need to move beyond assigning data types to tokens. In the example given, “amount” has the datatype “double,” but that doesn’t tell me if we are discussing grams, Troy ounces (for precious metals), carats or pounds.

We all need to start following the work on self-describing data formats more closely.

MapR Sandbox Fastest On-Ramp to Hadoop

Monday, March 23rd, 2015

MapR Sandbox Fastest On-Ramp to Hadoop

From the webpage:

The MapR Sandbox for Hadoop provides tutorials, demo applications, and browser-based user interfaces to let developers and administrators get started quickly with Hadoop. It is a fully functional Hadoop cluster running in a virtual machine. You can try our Sandbox now – it is completely free and available as a VMware or VirtualBox VM.

If you are a business intelligence analyst or a developer interested in self-service data exploration on Hadoop using SQL and BI Tools, the MapR Sandbox including Apache Drill will get you started quickly. You can download the Drill Sandbox here.

You of course know about the Hortonworks and Cloudera (at the very bottom of the page) sandboxes as well.

Don’t expect a detailed comparison of all three because the features and distributions change too quickly for that to be useful. And my interest is more in capturing the style or approach that may make a difference to a starting user.


I first saw this in a tweet by Kirk Borne.

Applying the Big Data Lambda Architecture

Sunday, October 27th, 2013

Applying the Big Data Lambda Architecture by Michael Hausenblas.

From the article:

Based on his experience working on distributed data processing systems at Twitter, Nathan Marz recently designed a generic architecture addressing common requirements, which he called the Lambda Architecture. Marz is well-known in Big Data: He’s the driving force behind Storm and at Twitter he  led the streaming compute team, which provides and develops shared infrastructure to support critical real-time applications.

Marz and his team described the underlying motivation for building systems with the lambda architecture as:

  • The need for a robust system that is fault-tolerant, both against hardware failures and human mistakes.
  • To serve a wide range of workloads and use cases, in which low-latency reads and updates are required. Related to this point, the system should support ad-hoc queries.
  • The system should be linearly scalable, and it should scale out rather than up, meaning that throwing more machines at the problem will do the job.
  • The system should be extensible so that features can be added easily, and it should be easily debuggable and require minimal maintenance.

From a bird’s eye view the lambda architecture has three major components that interact with new data coming in and responds to queries, which in this article are driven from the command line:

The goal of the article:

In this article, I employ the lambda architecture to implement what I call UberSocialNet (USN). This open-source project enables users to store and query acquaintanceship data. That is, I want to be able to capture whether I happen to know someone from multiple social networks, such as Twitter or LinkedIn, or from real-life circumstances. The aim is to scale out to several billions of users while providing low-latency access to the stored information. To keep the system simple and comprehensible, I limit myself to bulk import of the data (no capabilities to live-stream data from social networks) and provide only a very simple a command-line user interface. The guts, however, use the lambda architecture.

Something a bit challenging for the start of the week. 😉

Drilling into Big Data with Apache Drill

Wednesday, August 7th, 2013

Drilling into Big Data with Apache Drill by Steven J Vaughan-Nichols.

From the post:

Apache’s Drill goal is striving to do nothing less than answer queries from petabytes of data and trillions of records in less than a second.

You can’t claim that the Apache Drill programmers think small. Their design goal is for Drill to scale to 10,000 servers or more and to process petabyes of data and trillions of records in less than a second.

If this sounds impossible, or at least very improbable, consider that the NSA already seems to be doing exactly the same kind of thing. If they can do it, open-source software can do it.

In at interview at OSCon, the major open source convention in Portland, OR, Ted Dunning, the chief application architect for MapR, a big data company, and a Drill mentor and committer, explained the reason for the project. “There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data in such formats as Avro; Apache Hadoop data serialization system; JSON (JavaScript Object Notation); and Protocol Buffers Google’s data interchange format.”

As Dunning explained, big business wants fast access to big data and none of the traditional solutions, such as a relational database management system (RDBMS), MapReduce, or Hive, can deliver those speeds.

Dunning continued, “This need was identified by Google and addressed internally with a system called Dremel.” Dremel was the inspiration for Drill, which also is meant to complement such open-source big data systems as Apache Hadoop. The difference between Hadoop and Drill is that while Hadoop is designed to achieve very high throughput, it’s not designed to achieve the sub-second latency needed for interactive data analysis and exploration.


At this point, Drill is very much a work in progress. “It’s not quite production quality at this point, but by third or fourth quarter of 2013 it will become quite usable.” Specifically, Drill should be in beta by the third quarter.

So, if Drill sounds interesting to you, you can start contributing as soon as you get up to speed. To do that, there’s a weekly Google Hangout on Tuesdays at 9am Pacific time and a Twitter feed at @ApacheDrill. And, of course, there’s an Apache Drill Wiki and users’ and developers’ mailing lists.

NSA claims, actually any claims by any government officials, have to be judged by President Obama announcing yesterday: “There is No Spying on Americans.”

It has been creeping along for a long time but the age of Newspeak is here.

But leaving doubtful comments by members of the government to one side, Apache Drill does sound like an exciting project!

Apache Drill

Sunday, May 19th, 2013

Michael Hausenblas at NoSQL Matters 2013 does a great lecture on Apache Drill.


Google’s Dremel Paper

Projects “beta” for Apache Drill by second quarter and GA by end of year.

Apache Drill User.

From the rationale:

There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel.

How do you handle ad hoc exploration of data sets as part of planning a topic map?

Being able to “test” merging against data prior to implementation sounds like a good idea.

Apache to Drill for big data in Hadoop

Tuesday, August 21st, 2012

Apache to Drill for big data in Hadoop

From the post:

A new Apache Incubator proposal should see the Drill project offering a new open source way to interactively analyse large scale datasets on distributed systems. Drill is inspired by Google’s Dremel but is designed to be more flexible in terms of supported query languages. Dremel has been in use by Google since 2006 and is now the engine that powers Google’s BigQuery analytics.

The project is being led at Apache by developers from MapR where the early Drill development was being done. Also contributing are Drawn To Scale and Concurrent. Requirement and design documentation will be contributed to the project by MapR. Hadoop is good for batch queries, but by allowing quicker queries of huge data sets, those data sets can be better explored. The Drill technology, like the Google Dremel technology, does not replace MapReduce or Hadoop systems. It works along side them, offering a system which can analyse the output of the batch processing system and its pipelines, or be used to rapidly prototype larger scale computations.

Drill is comprised of a query language layer with parser and execution planner, a low latency execution engine for executing the plan, nested data formats for data storage and a scalable data source layer. The query language layer will focus on Drill’s own query language, DrQL, and the data source layer will initially use Hadoop as its source. The project overall will closely integrate with Hadoop, storing its data in Hadoop and supporting the Hadoop FileSystem and HBase and supporting Hadoop data formats. Apache’s Hive project is also being considered as the basis for the DrQL.

The developers hope that by developing in the open at Apache, they will be able to create and establish Drill’s own APIs and ensure a robust, flexible architecture which will support a broad range of data sources, formats and query languages. The project has been accepted into the incubator and so far has an empty subversion repository.

Q: Is anyone working on/maintaining a map between the various Hadoop related query languages?