What’s all the fuss about Dark Data? Big Data’s New Best Friend by Martyn Jones.
From the post:
Dark data, what is it and why all the fuss?
First, I’ll give you the short answer. The right dark data, just like its brother right Big Data, can be monetised – honest, guv! There’s loadsa money to be made from dark data by ‘them that want to’, and as value propositions go, seriously, what could be more attractive?
Let’s take a look at the market.
Gartner defines dark data as "the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes" (IT Glossary – Gartner)
Techopedia describes dark data as being data that is "found in log files and data archives stored within large enterprise class data storage locations. It includes all data objects and types that have yet to be analyzed for any business or competitive intelligence or aid in business decision making." (Techopedia – Cory Jannsen)
Cory also wrote that "IDC, a research firm, stated that up to 90 percent of big data is dark data."
In an interesting whitepaper from C2C Systems it was noted that "PST files and ZIP files account for nearly 90% of dark data by IDC Estimates." and that dark data is "Very simply, all those bits and pieces of data floating around in your environment that aren’t fully accounted for:" (Dark Data, Dark Email – C2C Systems)
Elsewhere, Charles Fiori defined dark data as "data whose existence is either unknown to a firm, known but inaccessible, too costly to access or inaccessible because of compliance concerns." (Shedding Light on Dark Data – Michael Shashoua)
Not quite the last insight, but in a piece published by Datameer, John Nicholson wrote that "Research firm IDC estimates that 90 percent of digital data is dark." And went on to state that "This dark data may come in the form of machine or sensor logs" (Shine Light on Dark Data – Joe Nicholson via Datameer)
Finally, Lug Bergman of NGDATA wrote this in a sponsored piece in Wired: "It" – dark data – "is different for each organization, but it is essentially data that is not being used to get a 360 degree view of a customer.
Well, I would say that 90% of 2.7 Zetabytes (as of last October) of data being dark is a reason to be concerned.
But like the Wizard of Oz, Martyn knows what you are lacking, a data inventory:
…
You don’t need a Chief Data Officer in order to be able to catalogue all your data assets. However, it is still good idea to have a reliable inventory of all your business data, including the euphemistically termed Big Data and dark data.
If you have such an inventory, you will know:
What you have, where it is, where it came from, what it is used in, what qualitative or quantitative value it may have, and how it relates to other data (including metadata) and the business.
…
Really? A data inventory? Relief to know the MDM (master data management) folks have been struggling for the past two decades for no reason. All they needed was a data inventory!
You might want to recall AnHai Doan’s observation for a single enterprise mapping project:
…the manual creation of semantic mappings has long been known to be extremely laborious and error-prone. For example, a recent project at the GTE telecommunications company sought to integrate 40 databases that have a total of 27,000 elements (i.e., attributes of relational tables) [LC00]. The project planners estimated that, without the database creators, just finding and documenting the semantic mappings among the elements would take more than 12 person years.
That’s right. One enterprise, 40 databases, 12 person years.
How that works out: PersonYears x 2.7 Zetabytes = ???, no one knows.
Oh, why did I lose the 90% as “dark data?” Simple enough, the data AnHai was mapping wasn’t entirely “dark.” At least it had headers that were meaningful to someone. Unstructured data has no headers at all.
What Martyn is missing?
What is known about data is the measure of its darkness, not usage.
But supplying opaque terms (all terms are opaque to someone) for data, only puts you into the AnHai situation. Either you enlist people who know the meanings of the terms and/or you create new meanings for them from scratch. Hopefully in the latter case you approximate the original meanings assigned to the terms.
If you want to improve on opaque terms, you need to provide alternative opaque terms that may be recognized by some future user instead of the primary opaque term you would use otherwise.
Make no mistake, it isn’t possible to escape opacity but you can increase your odds that your data can be useful at some future point in time. How many alternatives = some degree of future usefulness isn’t known.
So far as I know, the question hasn’t been researched. Every new set of opaque terms (read ontology, classification, controlled vocabulary) presents itself as possessing semantics for the ages. Given the number of such efforts, I find their confidence misplaced.