Metadata Management in Scientific Computing by Eric L. Seidel.
Abstract:
Complex scientific codes and the datasets they generate are in need of a sophisticated categorization environment that allows the community to store, search, and enhance metadata in an open, dynamic system. Currently, data is often presented in a read-only format, distilled and curated by a select group of researchers. We envision a more open and dynamic system, where authors can publish their data in a writeable format, allowing users to annotate the datasets with their own comments and data. This would enable the scientific community to collaborate on a higher level than before, where researchers could for example annotate a published dataset with their citations.
Such a system would require a complete set of permissions to ensure that any individual’s data cannot be altered by others unless they specifically allow it. For this reason datasets and codes are generally presented read-only, to protect the author’s data; however, this also prevents the type of social revolutions that the private sector has seen with Facebook and Twitter.
In this paper, we present an alternative method of publishing codes and datasets, based on Fluidinfo, which is an openly writeable and social metadata engine. We will use the specific example of the Einstein Toolkit, a shared scientific code built using the Cactus Framework, to illustrate how the code’s metadata may be published in writeable form via Fluidinfo.
There are a number of interesting aspects to the proposal, such as nodes that collect tags but have no ownership or semantics. Not to mention that metadata can be made not only readable but writeable by others. I would disagree with ever allowing recorded metadata to change but that is a debatable point.
This is a rare paper that concludes:
Scientic research is increasingly dependent on the simulation of complex processes and, by extension, on the ability to organize, search, and refer to the datasets generated by simulations. We propose using writable metadata to distribute and maintain scientic metadata, and have shown one possible method of implementing such a system. More work will be required to investigate alternative systems, schemas, and interfaces, as well as to determine what would be an optimal solution. We hope that the scientic community will take this opportunity to start a conversation about how to manage the large amounts of data currently being generated by our research on a daily basis.
A little spirit of continuing investigation goes a long way.