RProtoBuf: Efficient Cross-Language Data Serialization in R by Dirk Eddelbuettel, Murray Stokely, and Jeroen Ooms.
Abstract:
Modern data collection and analysis pipelines often involve a sophisticated mix of applications written in general purpose and specialized programming languages. Many formats commonly used to import and export data between different programs or systems, such as
CSV
orJSON
, are verbose, inefficient, not type-safe, or tied to a specific programming language. Protocol Buffers are a popular method of serializing structured data between applications|while remaining independent of programming languages or operating systems. They oer a unique combination of features, performance, and maturity that seems particulary well suited for data-driven applications and numerical computing. The RProtoBuf package provides a complete interface to Protocol Buers from the R environment for statistical computing. This paper outlines the general class of data serialization requirements for statistical computing, describes the implementation of the RProtoBuf package, and illustrates its use with example applications in large-scale data collection pipelines and web services.
Anyone using RProtoBuf, or any other encoding where a “schema” is separated from data, needs to assign someone the task of reuniting the data with its schema.
Sandra Blakeslee reported on the consequences of failing to document data in Lost on Earth: Wealth of Data Found in Space some fourteen (14) years ago this coming March 20th.
In attempting to recover data from an Viking mission, one NASA staffer observed:
After tracking down the data, Mr. Eliason looked up the NASA documents that described how they were entered. ”It was written in technical jargon,” he said. ”Maybe it was clear to the person who wrote it but it was not clear to me 20 years later.” (emphasis added)
You may say, “…but that’s history, we know better now…,” but can you name who is responsible for documentation on your data and or the steps for processing it? Is it current?
I have no problem with binary formats for data interchange in processing pipelines. But, data going into a pipeline should be converted from a documented format and data coming out of a pipeline should be serialized into a documented format.