Shedding Light on the Dark Data in the Long Tail of Science by P. Bryan Heidorn. (P. Bryan Heidorn. “Shedding Light on the Dark Data in the Long Tail of Science.” Library Trends 57.2 (2008): 280-299. Project MUSE. Web. 28 Feb. 2013. .)
Abstract:
One of the primary outputs of the scientific enterprise is data, but many institutions such as libraries that are charged with preserving and disseminating scholarly output have largely ignored this form of documentation of scholarly activity. This paper focuses on a particularly troublesome class of data, termed dark data. “Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost. The article discusses how the concepts from long-tail economics can be used to understand potential solutions for better curation of this data. The paper describes why this data is critical to scientific progress, some of the properties of this data, as well as some social and technical barriers to proper management of this class of data. Many potentially useful institutional, social, and technical solutions are under development and are introduced in the last sections of the paper, but these solutions are largely unproven and require additional research and development.
From the article:
In this paper we will use the term dark data to refer to any data that is not easily found by potential users. Dark data may be positive or negative research findings or from either “large” or “small” science. Like dark matter, this dark data on the basis of volume may be more important than that which can be easily seen. The challenge for science policy is to develop institutions and practices such as institutional repositories, which make this data useful for society.
Dark Data = Any data that is not easily found by potential users.
A number of causes are discussed, not the least of which is our old friend, the Tower of Babel.
A final barrier that cannot be overlooked is the Digital Tower of Babel that we have created with seemingly countless proprietary as well as open data formats. This can include versions of the same software products that are incompatible. Some of these formats are very efficient for the individual applications for which they were designed including word processing, databases, spreadsheets, and others, but they are ineffective to support interoperability and preservation.
As you know already, I don’t think the answer to data curation, long term, lies in uniform formats.
Uniform formats are very useful but are domain, project and time bound.
The questions always are:
“What do we do when we change data formats?”
“Do we dump data in old formats that we spent $$$ developing?”
“Do we migrate data in old formats, assuming anyone remembers the old format?”
“Do we document and map across old and new formats, preparing for the next ‘new’ format?”
None of the answers are automatic or free.
But it is better to make in informed choice than a default one of letting potentially valuable data rot.