AsterixDB: Better than Hadoop? Interview with Mike Carey by Roberto V. Zicari.
The first two questions should be enough incentive to read the full interview and get your blood pumping in the middle of the week:
Q1. Why build a new Big Data Management System?
Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.
To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:
- a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
- a full query language with at least the expressive power of SQL;
- support for data storage, data management, and automatic indexing;
- support for a wide range of query sizes, with query processing cost being proportional to the given query;
- support for continuous data ingestion, hence the accumulation of Big Data;
- the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
- built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.
So that’s what we set out to do.
Q2. What was wrong with the current Open Source Big Data Stack?
Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”
We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.
One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.
Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”
Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)
I knew words would fail me if I tried to describe the AsterixDB logo so I simply reproduce the logo:
Read the interview in full and then grab a copy of AsterixDB.
The latest beta release is 0.8.6. The software appears under the Apache Software 2.0 license.