Archive for the ‘Parquet’ Category

Evolving Parquet as self-describing data format –

Monday, April 6th, 2015

Evolving Parquet as self-describing data format – New paradigms for consumerization of Hadoop data by Neeraja Rentachintala.

From the post:

With Hadoop becoming more prominent in customer environments, one of the frequent questions we hear from users is what should be the storage format to persist data in Hadoop. The data format selection is a critical decision especially as Hadoop evolves from being about cheap storage to a pervasive query and analytics platform. In this blog, I want to briefly describe self-describing data formats, how they are gaining a lot of interest as a new management paradigm to consumerize Hadoop data in organizations and the work we have been doing as part of the Parquet community to evolve Parquet as fully self-describing format.

About Parquet

Apache Parquet is a columnar storage format for the Hadoop ecosystem. Since its inception about 2 years ago, Parquet has gotten very good adoption due to the highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem. A variety of tools and frameworks including MapReduce, Hive, Impala, and Pig provided the ability to work with Parquet data and a number of data models such as AVRO, Protobuf, and Thrift have been expanded to be used with Parquet as storage format. Parquet is widely adopted by a number of major companies including tech giants such as Twitter and Netflix.

Self-describing data formats and their growing role in analytics on Hadoop/NoSQL

Self-describing data is where schema or structure is embedded in the data itself. The schema is comprised of metadata such as element names, data types, compression/encoding scheme used (if any), statistics, and a lot more. There are a variety of data formats including Parquet, XML, JSON, and NoSQL databases such as HBase that belong to the spectrum of self-describing data and typically vary in the level of metadata they expose about themselves.

While the self-describing data has been in rise with NoSQL databases (e.g., the Mongo BSON model) for a while now empowering developers to be agile and iterative in application development cycle, the prominence of these has been growing in analytics as well when it comes to Hadoop. So what is driving this? The answer is simple – it’s the same reason – i.e., the requirement to be agile and iterative in BI/analytics.

More and more organizations are now using Hadoop as a data hub to store all their data assets. These data assets often contain existing datasets offloaded from the traditional DBMS/DWH systems, but also new types of data from new data sources (such as IOT sensors, logs, clickstream) including external data (such as social data, 3rd party domain specific datasets). The Hadoop clusters in these organizations are often multi-tenant and shared by multiple groups in the organizations. The traditional data management paradigms of creating centralized models/metadata definitions upfront before the data can be used for analytics are quickly becoming bottlenecks in Hadoop environments. The new complex and schema-less data models are hard to map to relational models and modeling data upfront for unknown ad hoc business questions and data discovery needs is challenging and keeping up with the schema changes as the data models evolve is practically impossible.

By pushing metadata to data and then using tools that can understand this metadata available in self-describing formats to expose it directly for end user consumption, the analysis life cycles can become drastically more agile and iterative. For example, using Apache Drill, the world’s first schema-free SQL query engine, you can query self-describing data (in files or NoSQL databases such as HBase/MongoDB) immediately without having to define and manage schema overlay definitions in centralize metastores. Another benefit of this is business self-service where the users don’t need to rely on IT departments/DBAs constantly for adding/changing attributes to centralized models, but rather focus on getting answers to the business questions by performing queries directly on raw data.

Think of it this way, Hadoop scaled processing by pushing processing to the nodes that have data. Analytics on Hadoop/NoSQL systems can be scaled to the entire organization by pushing more and more metadata to the data and using tools that leverage that metadata automatically to expose it for analytics. The more self-describing the data formats are (i.e., the more metadata they contain about data), the smarter the tools that leverage the metadata can be.

The post walks through example cases and points to additional resources.

To become self-describing, Parquet will need to move beyond assigning data types to tokens. In the example given, “amount” has the datatype “double,” but that doesn’t tell me if we are discussing grams, Troy ounces (for precious metals), carats or pounds.

We all need to start following the work on self-describing data formats more closely.

Open-sourcing tools for Hadoop

Saturday, November 22nd, 2014

Open-sourcing tools for Hadoop by Colin Marc.

From the post:

Stripe’s batch data infrastructure is built largely on top of Apache Hadoop. We use these systems for everything from fraud modeling to business analytics, and we’re open-sourcing a few pieces today:

Timberlake

Timberlake is a dashboard that gives you insight into the Hadoop jobs running on your cluster. Jeff built it as a replacement for YARN’s ResourceManager and MRv2’s JobHistory server, and it has some features we’ve found useful:

  • Map and reduce task waterfalls and timing plots
  • Scalding and Cascading awareness
  • Error tracebacks for failed jobs

Brushfire

Avi wrote a Scala framework for distributed learning of ensemble decision tree models called Brushfire. It’s inspired by Google’s PLANET, but built on Hadoop and Scalding. Designed to be highly generic, Brushfire can build and validate random forests and similar models from very large amounts of training data.

Sequins

Sequins is a static database for serving data in Hadoop’s SequenceFile format. I wrote it to provide low-latency access to key/value aggregates generated by Hadoop. For example, we use it to give our API access to historical fraud modeling features, without adding an online dependency on HDFS.

Herringbone

At Stripe, we use Parquet extensively, especially in tandem with Cloudera Impala. Danielle, Jeff, and Avi wrote Herringbone (a collection of small command-line utilities) to make working with Parquet and Impala easier.

More open source tools for your Hadoop installation!

I am considering creating a list of closed source tools for Hadoop. It would be shorter and easier to maintain than a list of open source tools for Hadoop. 😉

Convert Existing Data into Parquet

Friday, May 23rd, 2014

Convert Existing Data into Parquet by Uri Laserson.

From the post:

Learn how to convert your data to the Parquet columnar format to get big performance gains.

Using a columnar storage format for your data offers significant performance advantages for a large subset of real-world queries. (Click here for a great introduction.)

Last year, Cloudera, in collaboration with Twitter and others, released a new Apache Hadoop-friendly, binary, columnar file format called Parquet. (Parquet was recently proposed for the ASF Incubator.) In this post, you will get an introduction to converting your existing data into Parquet format, both with and without Hadoop.

Actually, between Uri’s post and my pointing to it, Parquet has been accepted into the ASF Incubator!

All the more reason to start following this project.

Enjoy!

Hadoop Weekly – October 28, 2013

Tuesday, October 29th, 2013

Hadoop Weekly – October 28, 2013 by Joe Crobak.

A weekly blog post that tracks all things in the Hadoop ecosystem.

I will keep posting on Hadoop things of particular interest for topic maps but will also be pointing to this blog for those who want/need more Hadoop coverage.

Parquet 1.0: Columnar Storage for Hadoop

Tuesday, July 30th, 2013

Announcing Parquet 1.0: Columnar Storage for Hadoop by Justin Kestelyn.

From the post:

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.

Today, we’re happy to tell you about a significant Parquet milestone: a 1.0 release, which includes major features and improvements made since the initial announcement. But first, we’ll revisit why columnar storage is so important for the Hadoop ecosystem.

What is Parquet and Columnar Storage?

Parquet is an open-source columnar storage format for Hadoop. Its goal is to provide a state of the art columnar storage layer that can be taken advantage of by existing Hadoop frameworks, and can enable a new generation of Hadoop data processing architectures such as Impala, Drill, and parts of the Hive ‘Stinger’ initiative. Parquet does not tie its users to any existing processing framework or serialization library.

The idea behind columnar storage is simple: instead of storing millions of records row by row (employee name, employee age, employee address, employee salary…) store the records column by column (all the names, all the ages, all the addresses, all the salaries). This reorganization provides significant benefits for analytical processing:

  • Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied.
  • Since column values are stored consecutively, a query engine can skip loading columns whose values it doesn’t need to answer a query, and use vectorized operators on the values it does load.

These effects combine to make columnar storage a very attractive option for analytical processing.

A little over four (4) months from announcement to a 1.0 release!

Now that’s performance!

The Hadoop ecosystem just keeps getting better.

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop

Thursday, March 14th, 2013

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop by Justin Kestelyn.

From the post:

We’d like to introduce a new columnar storage format for Hadoop called Parquet, which started as a joint project between Twitter and Cloudera engineers.

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Parquet is built from the ground up with complex nested data structures in mind. We adopted the repetition/definition level approach to encoding such data structures, as described in Google’s Dremel paper; we have found this to be a very efficient method of encoding data in non-trivial object schemas.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing Parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

Under heavy development so watch closely!