Data Capture for the Real World by Cameron Neylon.
From the post:
Many efforts at building data infrastructures for the “average researcher” have been funded, designed and in some cases even built. Most of them have limited success. Part of the problem has always been building systems that solve problems that the “average researcher” doesn’t know that they have. Issues of curation and metadata are so far beyond the day to day issues that an experimental researcher is focussed on as to be incomprehensible. We clearly need better tools, but they need to be built to deal with the problems that researchers face. This post is my current thinking on a proposal to create a solution that directly faces the researcher, but offers the opportunity to address the broader needs of the community. What is more it is designed to allow that average researcher to gradually realise the potential of better practice and to create interfaces that will allow technical systems to build out better systems.
Solve the immediate problem – better backups
The average experimental lab consists of lab benches where “wet work” is done and instruments that are run off computers. Sometimes the instruments are in different rooms, sometimes they are shared. Sometimes they are connected to networks and backed up, often they are not. There is a general pattern of work – samples are created through some form of physical manipulation and then placed into instruments which generate digital data. That data is generally stored on a local hard disk. This is by no means comprehensive but it captures a large proportion of a lot of the work.
The problem a data manager or curator sees here is one of cataloguing the data created, creating a schema that represents where it came from and what it is. We build ontologies and data models and repositories to support them to solve the problem of how all these digital objects relate to each other.
The problem a researcher sees is that the data isn’t backed up. More than that, its hard to back up because institutional systems and charges make it hard to use the central provision (“it doesn’t fit our unique workflows/datatypes”) and block what appears to be the easiest solution (“why won’t central IT just let me buy a bunch of hard drives and keep them in my office?”). An additional problem is data transfer – the researcher wants the data in the right place, a problem generally solved with a USB drive. Networks are often flakey, or not under the control of the researcher so they use what is to hand to transfer data from instrument to their working computer.
The challenge therefore is to build systems under group/researcher control that the needs for backup and easy file transfer. At the same time they should at least start to solve the metadata capture problem and satisfy the requirements of institutional IT providers.
…
Cameron goes on to make a great plea for approaching data collection from labs staring with the most basic need: backups. Sure, data needs metadata, standard formats, etc. but those are secondary concerns (if that) to the researchers generating the data.
Only backup up data is likely to persist long enough for us to be concerned about metadata and standard formats. Even there Cameron argues that researchers need to see the pay-off from metadata before expecting them to enter it. Formats are more a matter of interchange of data and not a problem for local data.
Cameron’s payoff argument alludes to something that isn’t often discussed. From the perspective of a metadata person, metadata for data is extremely important, but they are not the person being asked to capture the metadata. From the perspective of a format person, an interchangeable format for data is extremely important, but they are not the person being asked to use the “correct” format.
The point is that we are all quite free with the time of others. That is we have all manner of suggestions that increases the work load of others and we not only expect them to use those suggestions but to be grateful we pointed the error of their ways out. That’s expecting a bit much.
As you know, metadata and formats are only two of many data issues that are very near and dear to me. But focusing on the failure of scientists to pay attention to such matters isn’t going to be as effective as creating tools that help scientists with their day to day work and return benefits to them. A much easier sell for issues that are of interest to others.
I first saw this in Nat Torkington’s Four short links: 19 November 2014.