Archive for the ‘Columnar Database’ Category

Parquet 1.0: Columnar Storage for Hadoop

Tuesday, July 30th, 2013

Announcing Parquet 1.0: Columnar Storage for Hadoop by Justin Kestelyn.

From the post:

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.

Today, we’re happy to tell you about a significant Parquet milestone: a 1.0 release, which includes major features and improvements made since the initial announcement. But first, we’ll revisit why columnar storage is so important for the Hadoop ecosystem.

What is Parquet and Columnar Storage?

Parquet is an open-source columnar storage format for Hadoop. Its goal is to provide a state of the art columnar storage layer that can be taken advantage of by existing Hadoop frameworks, and can enable a new generation of Hadoop data processing architectures such as Impala, Drill, and parts of the Hive ‘Stinger’ initiative. Parquet does not tie its users to any existing processing framework or serialization library.

The idea behind columnar storage is simple: instead of storing millions of records row by row (employee name, employee age, employee address, employee salary…) store the records column by column (all the names, all the ages, all the addresses, all the salaries). This reorganization provides significant benefits for analytical processing:

  • Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied.
  • Since column values are stored consecutively, a query engine can skip loading columns whose values it doesn’t need to answer a query, and use vectorized operators on the values it does load.

These effects combine to make columnar storage a very attractive option for analytical processing.

A little over four (4) months from announcement to a 1.0 release!

Now that’s performance!

The Hadoop ecosystem just keeps getting better.

A Trillion Dollar Math Trick

Saturday, June 1st, 2013

A Trillion Dollar Math Trick by Dick Lipton.

Dick reviews a presentation by Mike Stonebraker at TTI Vanguard meeting on “Ginormous Systems” in DC.

In part:

In Mike’s wonderful talk he made seven points about the past, present, and the future of database technology. He has a great track record, so likely he is mostly right on his guesses. One of his predictions was about a way of re-organizing databases that has several remarkable properties:

  • It speeds up database operations 50x. That is to say, on typical queries—ones that companies actually do—it is fifty times faster than classical database implementations. As a theorist we like speedups, especially asymptotic ones. But 50x is pretty cool. That is enough to change a query from an hour to a minute.
  • It is not a new idea. But the time is finally right, and Mike predicts that future databases will use this method.
  • It is an idea that no one seems to know who invented it. I asked Mike, I asked other experts at the conference, and all shrugged and said effectively: “I have no idea.” Curious.

Let’s look quickly at the way databases work, and then consider the trick.

I won’t spoil the surprise for you, see Dick’s post for the details.

BTW, read the comments on historical uses of the same idea.

Then think about how to apply to topic maps.

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

Best Practices…Columnar Databases

Wednesday, September 21st, 2011

Best Practices in the Use of Columnar Databases: How to select the workloads for columnar databases based on the benefits” provided by William Mcknight. (pdf)

Focuses on Calpont’s InfiniDB.

It is a nice summary of the principles of columnar databases.

Also has amusing observations such as:

MapReduce is a method of parallel reduction of tasks; a 25 year old idea that came out of the Lisp programming language. There are popular implementations of the framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.

It does make me curious about the use of columnar store databases for particular situations.

Read the whitepaper and see what you think. Comments welcome!