Evolving Parquet as self-describing data format – New paradigms for consumerization of Hadoop data by Neeraja Rentachintala.
From the post:
With Hadoop becoming more prominent in customer environments, one of the frequent questions we hear from users is what should be the storage format to persist data in Hadoop. The data format selection is a critical decision especially as Hadoop evolves from being about cheap storage to a pervasive query and analytics platform. In this blog, I want to briefly describe self-describing data formats, how they are gaining a lot of interest as a new management paradigm to consumerize Hadoop data in organizations and the work we have been doing as part of the Parquet community to evolve Parquet as fully self-describing format.
About Parquet
Apache Parquet is a columnar storage format for the Hadoop ecosystem. Since its inception about 2 years ago, Parquet has gotten very good adoption due to the highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem. A variety of tools and frameworks including MapReduce, Hive, Impala, and Pig provided the ability to work with Parquet data and a number of data models such as AVRO, Protobuf, and Thrift have been expanded to be used with Parquet as storage format. Parquet is widely adopted by a number of major companies including tech giants such as Twitter and Netflix.
Self-describing data formats and their growing role in analytics on Hadoop/NoSQL
Self-describing data is where schema or structure is embedded in the data itself. The schema is comprised of metadata such as element names, data types, compression/encoding scheme used (if any), statistics, and a lot more. There are a variety of data formats including Parquet, XML, JSON, and NoSQL databases such as HBase that belong to the spectrum of self-describing data and typically vary in the level of metadata they expose about themselves.
While the self-describing data has been in rise with NoSQL databases (e.g., the Mongo BSON model) for a while now empowering developers to be agile and iterative in application development cycle, the prominence of these has been growing in analytics as well when it comes to Hadoop. So what is driving this? The answer is simple – it’s the same reason – i.e., the requirement to be agile and iterative in BI/analytics.
More and more organizations are now using Hadoop as a data hub to store all their data assets. These data assets often contain existing datasets offloaded from the traditional DBMS/DWH systems, but also new types of data from new data sources (such as IOT sensors, logs, clickstream) including external data (such as social data, 3rd party domain specific datasets). The Hadoop clusters in these organizations are often multi-tenant and shared by multiple groups in the organizations. The traditional data management paradigms of creating centralized models/metadata definitions upfront before the data can be used for analytics are quickly becoming bottlenecks in Hadoop environments. The new complex and schema-less data models are hard to map to relational models and modeling data upfront for unknown ad hoc business questions and data discovery needs is challenging and keeping up with the schema changes as the data models evolve is practically impossible.
By pushing metadata to data and then using tools that can understand this metadata available in self-describing formats to expose it directly for end user consumption, the analysis life cycles can become drastically more agile and iterative. For example, using Apache Drill, the world’s first schema-free SQL query engine, you can query self-describing data (in files or NoSQL databases such as HBase/MongoDB) immediately without having to define and manage schema overlay definitions in centralize metastores. Another benefit of this is business self-service where the users don’t need to rely on IT departments/DBAs constantly for adding/changing attributes to centralized models, but rather focus on getting answers to the business questions by performing queries directly on raw data.
Think of it this way, Hadoop scaled processing by pushing processing to the nodes that have data. Analytics on Hadoop/NoSQL systems can be scaled to the entire organization by pushing more and more metadata to the data and using tools that leverage that metadata automatically to expose it for analytics. The more self-describing the data formats are (i.e., the more metadata they contain about data), the smarter the tools that leverage the metadata can be.
…
The post walks through example cases and points to additional resources.
To become self-describing, Parquet will need to move beyond assigning data types to tokens. In the example given, “amount” has the datatype “double,” but that doesn’t tell me if we are discussing grams, Troy ounces (for precious metals), carats or pounds.
We all need to start following the work on self-describing data formats more closely.