Announcing Parquet 1.0: Columnar Storage for Hadoop by Justin Kestelyn.
From the post:
In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.
Today, we’re happy to tell you about a significant Parquet milestone: a 1.0 release, which includes major features and improvements made since the initial announcement. But first, we’ll revisit why columnar storage is so important for the Hadoop ecosystem.
What is Parquet and Columnar Storage?
Parquet is an open-source columnar storage format for Hadoop. Its goal is to provide a state of the art columnar storage layer that can be taken advantage of by existing Hadoop frameworks, and can enable a new generation of Hadoop data processing architectures such as Impala, Drill, and parts of the Hive ‘Stinger’ initiative. Parquet does not tie its users to any existing processing framework or serialization library.
The idea behind columnar storage is simple: instead of storing millions of records row by row (employee name, employee age, employee address, employee salary…) store the records column by column (all the names, all the ages, all the addresses, all the salaries). This reorganization provides significant benefits for analytical processing:
- Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied.
- Since column values are stored consecutively, a query engine can skip loading columns whose values it doesn’t need to answer a query, and use vectorized operators on the values it does load.
These effects combine to make columnar storage a very attractive option for analytical processing.
A little over four (4) months from announcement to a 1.0 release!
Now that’s performance!
The Hadoop ecosystem just keeps getting better.