10 Simple rules for design, provision, and reuse of persistent identifiers for life science data by Julie A. McMurray, et al. (35 others).
From the introduction:
When we interact, we use names to identify things. Usually this works well, but there are many familiar pitfalls. For example , the “morning star” and “evening star” are both names for the planet Venus. “The luminiferous ether” is a name for an entity which no one still thinks exists. There are many women named “Margaret”, some of whom go by “Maggie” and some of whom have changed their surnames. We use everyday conversational mechanisms to work around these problems successfully. Naming problems have plagued the life sciences since Linnaeus pondered the Norway spruce; in the much larger conversation that underlies the life sciences, problems with identifiers (Box 1) impede the flow and integrity of information. This is especially challenging within “synthesis research” disciplines such as systems biology, translational medicine, and ecology. Implementation – driven initiatives such as ELIXIR , BD2K, and others (Text S1) have therefore been actively working to understand and address underlying problems with identifiers.
Good, global-scale, persistent identifier design is harder than it appears, and is essential for data to be Findable, Accessible, Interoperable, and Reusable (Data FAIRport principles [1]). Digital entities (e.g., files), physical entities (e.g., biosamples), and descriptive entities (e.g., ‘mitosis’) have different requirements for identifiers. Identifiers are further complicated by imprecise terminology and different forms (Box 1).
Of the identifier forms, Local Resource Identifiers (LRI) and their corresponding full Uniform Resource Identifiers (URIs) are still among the most commonly used and most problematic identifiers in the bio-data ecosystem. Other forms of identifiers such as Uniform Resource Name (URNs) are less impactful because of their current lack of uptake. Here, we build on emerging conventions and existing general recommendations [2,3] and summarise the identifier characteristics most important to optimising the flow and integrity of life-science data (Table 1). We propose actions to take in the identifier ‘green field’ and offer guidance for using real-world identifiers from diverse sources.
…
Truth be told, global, persistent identifier design is overreaching.
First, some identifiers are more widely used than others, but there are no globally accepted identifiers of any sort.
Second, “persistent” is undefined. Present identifiers (curies or URIs) have not persisted pre-Web identifiers. On what basis would you claim that future generations will persist our identifiers?
However, systems expect to be able to make references by single, opaque, identifiers and so the hunt goes on for a single identifier.
The more robust and in fact persistent approach is to have a bag of identifiers for any subject, where each identifier itself has a bag of properties associated with it.
That avoids the exclusion of old identifiers and hence historical records and avoids pre-exclusion of future identifiers, which come into use long after our identifier is no long the most popular one.
Systems can continue to use a single identifier, locally as it were but software where semantic integration is important, should use sets of identifiers to facilitate integration across data sources.