HAIL – Only Aggressive Elephants are Fast Elephants
From the post:
Typically we store data based on any one of the different physical layouts (such as row, column, vertical, PAX etc). And this choice determines its suitability for a certain kind of workload while making it less optimal for other kinds of workloads. Can we store data under different layouts at the same time? Especially within a HDFS environment where each block is replicated a few times. This is the big idea that HAIL (Hadoop Aggressive Indexing Library) pursues.
At a very high level it looks like to understand the working of HAIL we will have to look at the three distinct workflows the system is organized around namely –
- The data/file upload pipeline
- The indexing pipeline
- The query pipeline
Every unit of information makes its journey through these three pipelines.
Be sure to see the original paper.
How much of what we “know” about modeling is driven by the needs of ancestral storage layouts?
Given the performance of modern chips, are those “needs” still valid considerations?
Or perhaps better, at what size data store or processing requirement do the physical storage model needs re-assert themselves?
Not just a performance question but also one of uniformity of identification.
What was once a “performance” requirement, that data have some common system of identification, may not longer be the case.