Using Apache Avro by Boris Lublinsky.
From the post:
Avro[1] is a recent addition to Apache’s Hadoop family of projects. Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.
Avro provides functionality that is similar to the other marshalling systems such as Thrift, Protocol Buffers, etc. The main differentiators of Avro include[2]:
- “Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
- Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
- No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.”
I wonder about the symbolic resolution of differences using field names?
At least being repeatable and shareable.
By repeatable I mean that six months or even six weeks from now one understands the resolution. Not much use if the transformation is opaque to its author.
And shareable should mean that I can transfer the resolution to someone else who can then decide to follow, not follow or modify the resolution.
In another lifetime I was a sysadmin. I can count on less than one finger the number of times I would have followed a symbolic resolution that was not transparent. Simply not done.
Wait until the data folks, who must be incredibly trusting (anyone have some candy?), encounter someone who cares about critical systems and data.
Topic maps can help with that encounter.