From the post:
Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).
The following seven V’s represent characteristics and challenges of open data:
- Validity: data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
- Value: new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
- Variety: the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
- Voice: your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
- Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
- Vulnerability: the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
- proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data): maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.
Open Data has many benefits when the 7 V’s are answered!
Kirk doesn’t address who pay the cost of the 7 V’s being answered.
The most obvious one for topic maps:
#5 Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use….
Yes, “…when you provide the data for others to use.” If I can use my data without documenting the semantics and schema (data models), who covers the cost of my creating that documentation and schemas?
In any sufficiently large enterprise, when you ask for assistance, the response will ask for the contract number to which the assistance should be billed.
If you know your Heinlein, then you know the acronym TANSTaaFL (“There ain’t no such thing as a free lunch”) and its application here is obvious.
Or should I say its application is obvious from the repeated calls for better documentation and models and the continued absence of the same?
Who do you think should be paying for better documentation and data models?