Fifteen ideas about data validation (and peer review)
From the post:
Many open issues drift around data publication, but validation is both the biggest and the haziest. Some form of validation at some stage in a data publication process is essential; data users need to know that they can trust the data they want to use, data creators need a stamp of approval to get credit for their work, and the publication process must avoid getting clogged with unusable junk. However, the scientific literature’s validation mechanisms don’t translate as directly to data as its mechanism for, say, citation.
This post is in part a very late response to a data publication workshop I attended last February at the International Digital Curation Conference (IDCC). In a breakout discussion of models for data peer review, there were far more ideas about data review than time to discuss them. Here, for reference purposes, is a longish list of non-parallel, sometimes-overlapping ideas about how data review, validation, or quality assessment could or should work. I’ve tried to stay away from deeper consideration of what data quality means (which I’ll discuss in a future post) and from the broader issues of peer review associated with the literature, but they inevitably pop up anyway.
A good starting point for discussion of data validation concerns.
Perfect data would be preferred but let’s accept that perfect data is possible only for trivial or edge cases.
If you start off by talking about non-perfect data, it may be easier to see some of the consequences for when having non-perfect data makes a system fail. What are the consequences of that failure? For the data owner as well as others? Are those consequences acceptable?
Make those decisions up front and documented as part of planning data validation.