Field Notes: Building Data Dictionaries by Caitlin Hudon.
From the post:
The scariest ghost stories I know take place when the history of data — how it’s collected, how it’s used, and what it’s meant to represent — becomes an oral history, passed down as campfire stories from one generation of analysts to another like a spooky game of telephone.
These stories include eerie phrases like “I’m not sure where that comes from”, “I think that broke a few years ago and I’m not sure if it was fixed”, and the ever-ominous “the guy who did that left”. When hearing these stories, one can imagine that a written history of the data has never existed — or if it has, it’s overgrown with ivy and tech-debt in an isolated statuary, never to be used again.
The best defense I’ve found against relying on an oral history is creating a written one.
Enter the data dictionary. A data dictionary is a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format”, and provides us with a framework to store and share all of the institutional knowledge we have about our data.
…
Unless you have taken over the administration of an undocumented network, you cannot really appreciate Hudon’s statement:
…
As part of my role as a lead data scientist at a start-up, building a data dictionary was one of the first tasks I took on (started during my first week on the job).
…
I have taken over undocumented Novell and custom-written membership systems. They didn’t remain that way but moving to fully documented systems was perilous and time-consuming.
The first task for any such position is to confirm an existing data dictionary and/or build one if it doesn’t exist. No other task, except maybe the paperwork for HR so you can get paid, is more important.
Hudon’s outline of her data dictionary process is as good as any, but it doesn’t allow for variant and/or possibly conflicting data dictionaries. Or for detecting when “variants” are only apparent and not real.
Where Hudon has Field notes, consider inserting structured properties that you can then query for “merging” purposes.
It’s not necessary to work out how to merge all the other fields automatically, especially if you are exploring data or data dictionaries.
Or to put it differently, not every topic map results in a final, publishable, editorial product. Sometimes you only want enough subject identity to improve your data set or results. That’s not a crime.