If You Have Too Much Data, then “Good Enough” Is Good Enough by Pat Helland.
This is a must read article where the author concludes:
The database industry has benefited immensely from the seminal work on data theory started in the 1970s. This work changed the world and continues to be very relevant, but it is apparent now that it captures only part of the problem.
We need a new theory and taxonomy of data that must include:
- Identity and versions. Unlocked data comes with identity and optional versions.
- Derivation. Which versions of which objects contributed to this knowledge? How is their schema interpreted? Changes to the source would drive a recalculation just as in Excel. If a legal reason means the source data may not be used, you should forget about using the knowledge derived from it.
- Lossyness of the derivation. Can we invent a bounding that describes the inaccuracies introduced by derived data? Is this a multidimensional inaccuracy? Can we differentiate loss from the inaccuracies caused by sheer size?
- Attribution by pattern. Just like a Mulligan stew, patterns can be derived from attributes that are derived from patterns (and so on). How can we bound taint from knowledge that we are not legally or ethically supposed to have?
- Classic locked database data. Let’s not forget that any new theory and taxonomy of data should include the classic database as a piece of the larger puzzle.
The example of data relativity, a local “now” in data systems, which may not be consistent with the state at some other location, was particularly good.