Provenance for Database Transformations by Val Tannen. (video)
Description:
Database transformations (queries, views, mappings) take apart, filter,and recombine source data in order to populate warehouses, materialize views,and provide inputs to analysis tools. As they do so, applications often need to track the relationship between parts and pieces of the sources and parts and pieces of the transformations’ output. This relationship is what we call database provenance.
This talk presents an approach to database provenance that relies on two observations. First, provenance is a kind of annotation, and we can develop a general approach to annotation propagation that also covers other applications, for example to uncertainty and access control. In fact, provenance turns out to be the most general kind of such annotation,in a precise and practically useful sense. Second, the propagation of annotation through a broad class of transformations relies on just two operations: one when annotations are jointly used and one when they are used alternatively.This leads to annotations forming a specific algebraic structure, a commutative semiring.
The semiring approach works for annotating tuples, field values and attributes in standard relations, in nested relations (complex values), and for annotating nodes in (unordered) XML. It works for transformations expressed in the positive fragment of relational algebra, nested relational calculus, unordered XQuery, as well as for Datalog, GLAV schema mappings, and tgd constraints. Finally, when properly extended to semimodules it works for queries with aggregates. Specific semirings correspond to earlier approaches to provenance, while others correspond to forms of uncertainty, trust, cost, and access control.
What does happen when you subtract from a merge? (Referenced here as an “aggregation.”)
Although possible to paw through logs to puzzle out a result, Val suggests there are more robust methods at our disposal.
I watched this over the weekend and be forewarned, heavy sledding ahead!
This is an active area of research and I have only begun to scratch the surface for references.
I may discover differently, but the “aggregation” I have seen thus far relies on opaque strings.
Not that all uses of opaque strings are inappropriate, but imagine the power of treating a token as an opaque string for one use case and exploding that same token into key/value pairs for another.
Enjoy!