Archive for the ‘Column-Oriented’ Category

A Trillion Dollar Math Trick

Saturday, June 1st, 2013

A Trillion Dollar Math Trick by Dick Lipton.

Dick reviews a presentation by Mike Stonebraker at TTI Vanguard meeting on “Ginormous Systems” in DC.

In part:

In Mike’s wonderful talk he made seven points about the past, present, and the future of database technology. He has a great track record, so likely he is mostly right on his guesses. One of his predictions was about a way of re-organizing databases that has several remarkable properties:

  • It speeds up database operations 50x. That is to say, on typical queries—ones that companies actually do—it is fifty times faster than classical database implementations. As a theorist we like speedups, especially asymptotic ones. But 50x is pretty cool. That is enough to change a query from an hour to a minute.
  • It is not a new idea. But the time is finally right, and Mike predicts that future databases will use this method.
  • It is an idea that no one seems to know who invented it. I asked Mike, I asked other experts at the conference, and all shrugged and said effectively: “I have no idea.” Curious.

Let’s look quickly at the way databases work, and then consider the trick.

I won’t spoil the surprise for you, see Dick’s post for the details.

BTW, read the comments on historical uses of the same idea.

Then think about how to apply to topic maps.

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

Google open sources Supersonic query engine

Wednesday, October 17th, 2012

Google open sources Supersonic query engine

From the post:

Google has released Supersonic, a query engine designed to work efficiently with column-oriented databases. The announcement suggests that Supersonic would be “extremely useful for creating a column oriented database back-end”, and that it aims to offer “second-to-none execution times”. As part of achieving that design goal, the C++ library uses many low-level, cache-aware optimisations, SIMD instructions and vectorised execution so that it can make the best use of modern pipelined CPUs, while still working as a single process.

Supersonic can perform “Operations” on columnar data such as Compute, Filter, Sort, HashJoin, and more; on views these operations can be chained together to produce a final result. Data for these operations is currently held in memory; there is no current built-in data storage format, but the developers say that there is “a strong intention of developing one”. Other work in progress includes the provision of wide test coverage for the library. A tarball archive of the code is available to download, while the source can be git cloned from the Google Code project pages.

Do you ever wonder what “secret” software must be like to have packages like this in open source?

Best Practices…Columnar Databases

Wednesday, September 21st, 2011

Best Practices in the Use of Columnar Databases: How to select the workloads for columnar databases based on the benefits” provided by William Mcknight. (pdf)

Focuses on Calpont’s InfiniDB.

It is a nice summary of the principles of columnar databases.

Also has amusing observations such as:

MapReduce is a method of parallel reduction of tasks; a 25 year old idea that came out of the Lisp programming language. There are popular implementations of the framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.

It does make me curious about the use of columnar store databases for particular situations.

Read the whitepaper and see what you think. Comments welcome!

Groonga

Friday, September 2nd, 2011

Groonga

From the webpage:

Groonga is an open-source fulltext search engine and column store. It lets you write high-performance applications that requires fulltext search.

The latest release is 1.2.5, released 2011-08-29.

Most of the documentation is in Japanese so I can’t comment on it.

Think of this as an opportunity to (hopefully) learn some Japanese. Given the rate of computer science research in Japan it will not be wasted effort.

PS: If you already read Japanese, feel free to contribute some comments on Groonga.

Query Execution in Column-Oriented Database Systems

Tuesday, August 23rd, 2011

Query Execution in Column-Oriented Database Systems by Daniel Abadi (Ph.D. thesis).

Apologies for the length of the quote but this is an early dissertation on column-oriented data systems and I want to entice you into reading it. Not so much for the techniques, which are now common but the analysis.

Abstract:

There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it performs best on the most common application for database systems: business transactional data processing. However, there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read through the data to gain new insight and use it to drive decision making and planning.

In this dissertation, we study the problem of poor performance of row-by-row data layout for these emerging applications, and evaluate the column-by-column data layout opportunity as a solution to this problem. There have been a variety of proposals in the literature for how to build a database system on top of column-by-column layout. These proposals have different levels of implementation effort, and have different performance characteristics. If one wanted to build a new database system that utilizes the column-by-column data layout, it is unclear which proposal to follow. This dissertation provides (to the best of our knowledge) the only detailed study of multiple implementation approaches of such systems, categorizing the different approaches into three broad categories, and evaluating the tradeoffs between approaches. We conclude that building a query executer specifically designed for the column-by-column query layout is essential to achieve good performance.

Consequently, we describe the implementation of C-Store, a new database system with a storage layer and query executer built for column-by-column data layout. We introduce three new query execution techniques that significantly improve performance. First, we look at the problem of integrating compression and execution so that the query executer is capable of directly operating on compressed data. This improves performance by improving I/O (less data needs to be read off disk), and CPU (the data need not be decompressed). We describe our solution to the problem of executer extensibility – how can new compression techniques be added to the system without having to rewrite the operator code? Second, we analyze the problem of tuple construction (stitching together attributes from multiple columns into a row-oriented ”tuple”). Tuple construction is required when operators need to access multiple attributes from the same tuple; however, if done at the wrong point in a query plan, a significant performance penalty is paid. We introduce an analytical model and some heuristics to use that help decide when in a query plan tuple construction should occur. Third, we introduce a new join technique, the “invisible join” that improves performance of a specific type of join that is common in the applications for which column-by-column data layout is a good idea.

Finally, we benchmark performance of the complete C-Store database system against other column-oriented database system implementation approaches, and against row-oriented databases. We benchmark two applications. The first application is a typical analytical application for which column-by-column data layout is known to outperform row-by-row data layout. The second application is another emerging application, the Semantic Web, for which column-oriented database systems are not currently used. We find that on the first application, the complete C-Store system performed 10 to 18 times faster than alternative column-store implementation approaches, and 6 to 12 times faster than a commercial database system that uses a row-by-row data layout. On the Semantic Web application, we find that C-Store outperforms other state-of-the-art data management techniques by an order of magnitude, and outperforms other common data management techniques by almost two orders of magnitude. Benchmark queries, which used to take multiple minutes to execute, can now be answered in several seconds.