Apache Hadoop and Data Agility by Ofer Mendelevitch.
From the post:
In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.
In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.
Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.
If a schema is supplied “on read,” how is data validation accomplished?
I don’t mean in terms of datatypes such as string, integer, double, etc. That are trivial forms of data validation.
How do we validate the semantics of data when a schema is supplied on read?”
Mistakes do happen in RDBMS systems but with a schema, which defines data semantics, applications can attempt to police those semantics.
I don’t doubt that schema “on read” supplies a lot of useful flexibility, but how do we limit the damage that flexibility can cause?
For example, many years ago, area codes (for telephones) in the USA were tied to geographic exchanges. Data from the era still exists in the bowels of some data stores. That is no longer true in many cases.
Let’s assume I have older data that has area codes tied to geographic areas and newer data that has area codes that are not. Without a schema to define the area code data in both cases, how would I know to treat the area code data differently?
I concede that schema “on read” can be quite flexible.
On the other hand, let’s not discount the value of schema “on write” as well.