Identifiers as Shorthand for Identifications

I closed Identifiers vs. Identifications? saying:

Many questions remain, such as how to provide for collections of sets “of properties which provide clues for establishing identity?,” how to make those collections extensible?, how to provide for constraints on such sets?, where to record “matching” (read “merging”) rules?, what other advantages can be offered?

In answering those questions, I think we need to keep in mind that identifiers and identifications lie along a continuum that runs from where we “know” what is meant by an identifier to where we ourselves need a full identification to know what is being discussed. A useful answer won’t be one or the other, but a pairing that suits a particular circumstance and use case.

You can also think of identifiers as a form of shorthand for an identification. If we were working together in a fairly small office, you would probably ask, “Is Patrick in?” rather than listing all the properties that would serve as an identification for me. So all the properties that make up an identification are unspoken but invoked by the use of the identifier.

Works quite well in a small office because to some varying degree, we would all share the identifications that are represented by the identifiers we use in everyday conversation.

That sharing of identifications behind identifiers doesn’t happen in information systems, unless we have explicitly added identifications behind those identifiers.

One problem we need to solve is how to associate an identification with an identifier or identifiers. Looking only slightly ahead, we could use an explicit mechanism like a TMDM association, if we wanted to be able to talk about the subject of the relationship between an identifier and the identification that lies behind it.

But we are not compelled to talk about such a subject and could declare by rule that within a container, an identifier is a shorthand for properties of an identification in the same container. That assumes the identifier is distinguished from the properties that make up the identification. I don’t think we need to reinvent the notions of essential vs. accidental properties but merging rules should call out what properties are required for merging.

The wary reader will have suspected before now that many (if not all) of the terms in such a container could be considered as identifiers in and of themselves. Suddenly they are trying to struggle uphill from a swamp of subject recursion. It is “elephants all the way down.”

Have no fear! Just as we can avoid using TMDM associations to mark the relationship between an identifier and the properties making up an identification, we need use containers for identifiers and identifications only when and where we choose.

In some circumstances we may use bare identifiers, sans any identifications and yet add identifications when circumstances warrant it.

No level, identifiers, an identification, an identification that explodes other identifiers, etc., is right for every purpose. Each may be appropriate for some particular purpose.

We need to allow for downward expansion in the form of additional containers along side the containers we author, as well as extension of containers to add sub-containers for identifiers and identifications we did not or chose not to author.

I do have an underlying assumption that may reassure you about the notion of downward expansion of identifier/identification containers:

Processing of one or more containers of identifiers and identifications can choose the level of identifiers + identifications to be processed.

For some purposes I may only want to choose “top level” identifiers and identifications or even just parts of identifications. For example, think of the simple mapping of identifiers that happens in some search systems. You may examine the identifications for identifiers and then produce a bare mapping of identifiers for processing purposes. Or you may have rules for identifications that produce a mapping of identifiers.

Let’s assume that I want to create a set of the identifiers for Pentane and so I query for the identifiers that have the molecular property C5H12. Some of the identifiers (with their scopes) returned will be: Beilstein Reference 969132, CAS Registry Number 109-66-0, ChEBI CHEBI:37830, ChEMBL ChEMBL16102, ChemSpider 7712, DrugBank DB03119.

Each one of those identifiers may have other properties in their associated identifications, but there is no requirement that I produce them.

I mentioned that identifiers have scope. If you perform a search on “109-66-0” (CAS Registry Number) or 7712 (ChemSpider) you will quickly find garbage. Some identifiers are useful only with particular data sources or in circumstances where the data source is identified. (The idea of “universal” identifiers is a recurrent human fiction. See The Search for the Perfect Language, Eco.)

Which means, of course, we will need to capture the scope of identifiers.

Comments are closed.