Looking after Datasets by Antony Unwin.
Some examples that Antony uses to illustrate the problems with datasets in R:
…
You might think that supplying a dataset in an R package would be a simple matter: You include the file, you write a short general description mentioning the background and giving the source, you define the variables. Perhaps you provide some sample analyses and discuss the results briefly. Kevin Wright's agridat package is exemplary in these respects.
As it happens, there are a couple of other issues that turn out to be important. Is the dataset or a version of it already in R and is the name you want to use for the dataset already taken? At this point the experienced R user will correctly guess that some datasets have the same name but are quite different (e.g., movies, melanoma) and that some datasets appear in many different versions under many different names. The best example I know is the Titanic dataset, which is availble in the datasets package. You will also find titanic (COUNT, prLogistic, msme), titanic.dat (exactLoglinTest), titan.Dat (elrm), titgrp (COUNT), etitanic (earth), ptitanic (rpart.plot), Lifeboats (vcd), TitanicMat (RelativeRisk), Titanicp (vcdExtra), TitanicSurvival (effects), Whitestar (alr4), and one package, plotrix, includes a manually entered version of the dataset in one of its help examples. The datasets differ on whether the crew is included or not, on the number of cases, on information provided, on formatting, and on discussion, if any, of analyses. Versions with the same names in different packages are not identical. There may be others I have missed.
The issue came up because I was looking for a dataset of the month for the website of my book "Graphical Data Analysis with R". The plan is to choose a dataset from one of the recently released or revised R packages and publish a brief graphical analysis to illustrate and reinforce the ideas presented in the book while showing some interesting information about the data. The dataset finch in dynRB looked rather nice: five species of finch with nine continuous variables and just under 150 cases. It looked promising and what’s more it is related to Darwin’s work and there was what looked like an original reference from 1904.
…
As if Antony’s list of issues wasn’t enough, how do you capture your understanding of a problem with a dataset?
That is you have discovered the meaning of a variable that isn’t recorded with the dataset. Where are you going to put that information?
You could modify the original dataset to capture that new information but then people will have to discover your version of the original dataset. Not to mention you need to avoid stepping on something else in the original dataset.
Antony concludes:
…returning to Moore’s definition of data, wouldn’t it be a help to distinguish proper datasets from mere sets of numbers in R?
Most people have an intersecting idea of a “proper dataset” but I would spend less time trying to define that and more on capturing the context of whatever appears to me to be a “proper dataset.”
More data is never a bad thing.