Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 10, 2012

Author Identifiers (arXiv.org) [> one (1) identifier per subject]

Filed under: Identification,Identifiers,Subject Identifiers — Patrick Durusau @ 10:25 am

I happened upon an author who used an arXiv.org author identifier at their webpage.

From the arXiv.org page:

It is a long-term goal of arXiv to accurately identify and disambiguate all authors of all articles in arXiv. Such identification would provide accurate results for queries such as "show me all the other papers by the particular John Smith that wrote this paper", something that can be done only approximately with text-based searches. It would also permit construction of an author-article graph which is useful for relevance
assessment and bibliometric analysis.

Since 2005 arXiv has used authority records that associate user accounts with articles authored by that user. These records support the endorsement system. The use of public author identifiers as a way to build services upon this data is new in 2009. Initially, users must opt-in to have a public author identifier and to expose the record of their articles on arXiv for use in other services. At some later date we hope to be able to improve our authority records to the point where we can create public author identifiers for all authors of arXiv articles without needing to enlist the help of each author to check their record before opting in.

The services we offer based on author identifiers are:

Significant enough in its own right but note the plans for the future:

The following enhancements and interoperability features are planned:

  • arXiv will permit authors to record other identifiers they have in other schemes and include these in the data feeds. This will allow agents and systems to link together the same author in different databases.
  • arXiv will support mechanisms for handling name changes, combination of accidentally created duplicates and separation of accidentally combined identifiers.

Recoding other identifiers? What? Acknowledge that there can be more than one identifier (yours) per subject? Blasphemy! 😉

Go arXiv.org!

I am sure suggestions, support, contributions, etc., would be most welcome.

September 4, 2012

Author Identifiers (At Least for CS)

Filed under: Bibliography,Identifiers — Patrick Durusau @ 1:33 pm

I enhanced the VLDB 2012 program with author queries to the DBLP Computer Science Bibliography for my own purposes.

After using that listing myself for a few days, it occurred to me that I should be using DBLP entries as author identifiers throughout my posts, at least when such entries exist.

For several reasons, but mostly:

  • DBLP maintains the publication listings (not by me!)
  • DBLP maintains pointers to other databases and resources (also not by me!)
  • DBLP maintains advanced search capabilities beyond authors (again, not by me!)

If you noticed not by me forming a pattern, you would be correct. There is a pattern.

The pattern?

Using DBLP author pages as identifiers, I leverage on (not duplicate) the work of the DBLP project.

To the benefit of my readers. (Not to mention myself.)

The DBLP link brings an author’s publication history, their co-authors, and additional bibliographic resources. (That’s a triple I like.)

It takes a moment to insert the link but the payoff is substantial.

When you cite a CS author in your blog, include their DBLP link. We will all thank you for it.

(I did that once upon a time but lapsed. Will be cleaning up older entries and trying to do better in the future.)

PS: Similar sources of identifiers for other disciplines?

June 29, 2012

Bruce: How Well Does Current Legislative Identifier Practice Measure Up?

Filed under: Identifiers,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 3:15 pm

Bruce: How Well Does Current Legislative Identifier Practice Measure Up?

From Legal Informatics:

Tom Bruce of the Legal Information Institute at Cornell University Law School (LII) has posted Identifiers, Part 3: How Well Does Current Practice Measure Up?, on LII’s new legislative metadata blog, Making Metasausage.

In this post, Tom surveys legislative identifier systems currently in use. He recommends the use of URIs for legislative identifiers, rather than URLs or URNs.

He cites favorably the URI-based identifier system that John Sheridan and Dr. Jeni Tennison developed for the Legislation.gov.uk system. Tom praises Sheridan’s (here) and Tennison’s (here and here) writings on legislative URIs and Linked Data.

Tom also praises the URI system implemented by Dr. Rinke Hoekstra in the Leibniz Center for Law‘s Metalex Document Server for facilitating point-in-time as well as point-in-process identification of legislation.

Tom concludes by making a series of recommendations for a legislative identifier system:

See the post for his recommendations (in case you are working on such a system) and for other links.

I would point out that existing legislation has identifiers from before it receives the “better” identifiers specified here.

And those “old” identifiers will have been incorporated into other texts, legal decisions and the like.

Oh.

We can’t re-write existing identifiers so it’s a good thing topic maps accept subjects having identifiers, plural.

June 10, 2012

Deconstructing the Google Knowledge Graph

Filed under: Google Knowledge Graph,Identifiers — Patrick Durusau @ 8:17 pm

Deconstructing the Google Knowledge Graph

Mike Bergman has some interesting observations on the Google Knowledge Graph, first on its coverage and then on how it is constructing URLs for nodes in its graph.

I have to second his call for Google to release its identifiers via an API. That would be a real boon for common entities.

I say common entities because having “millions” of identifiers is fairly trivial when you consider the number of objects captured every night by optical astronomers alone. Or sequencing genomes.

Not to discount the value of a common identifier for Lady Gaga but uncommon entities need identifiers too.

Gabriel Hopmans pointed me to this post. (Morpheus)

May 26, 2012

Outlier detection in two review articles (Part 2) (TM use case on Identifiers)

Filed under: Identifiers,Outlier Detection,Topic Maps — Patrick Durusau @ 5:58 pm

Outlier detection in two review articles (Part 2) by Sandro Saitta.

From the post:

Here we go with the second review article about outlier detection (this post is the continuation of Part I).

A Survey of Outlier Detection Methodologies

This paper, from Hodge and Austin, is also an excellent review of the field. Authors give a list of keywords in the field: outlier detection, novelty detection, anomaly detection, noise detection, deviation detection and exception mining. For the authors, “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs (Grubbs, 1969)”. Before listing several application in the field, authors mention that an outlier can be “surprising veridical data“. It may only be situated in the wrong class.

An interesting list of possible reasons for outliers is given: human error, instrument error, natural deviations in population, fraudulent behavior, changes in behavior of system and faults in system. Like in the first article, Hodge and Austin define three types of approaches to outlier detection (unsupervised, supervised and semi-supervised). In the last one, they mention that some algorithms can allow a confidence in the fact that the observation is an outlier. Main drawback of the supervised approach is its inability to discover new types of outliers.

While you are examining the techniques, do note the alternative ways to identify the problem.

Can you say topic map? 😉

Simple query expansion, assuming that any single term return hundreds of papers, isn’t all that helpful. Instead of several hundred papers you get several thousand. Gee, thanks.

But that isn’t an indictment of alternative identifications of subjects, that is a problem of granularity.

Returning documents forces users to wade through large amounts of potentially irrelevant content.

The question is how to retain alternative identifications of subjects while returning a manageable (or configurable) amount of content?

Suggestions?

May 25, 2012

Bruce on Legislative Identifier Granularity

Filed under: Identifiers,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 10:23 am

Bruce on Legislative Identifier Granularity

From the post:

In this post, Tom [Bruce] explores legislative identifier granularity, or the level of specificity at which such an identifier functions. The post discusses related issues such as the incorporation of semantics in identifiers; the use of “pure” (semantics-free) legislative identifiers; and how government agency authority and procedural rules influence the use, “persistence, and uniqueness” of identifiers. The latter discussion leads Tom to conclude that

a “gold standard” system of identifiers, specified and assigned by a relatively independent body, is needed at the core. That gold standard can then be extended via known, stable relationships with existing identifier systems, and designed for extensible use by others outside the immediate legislative community.

Interesting and useful reading.

Even though a “gold standard” of identifiers for something as dynamic as legislation, isn’t likely.

Or rather, isn’t going to happen.

There are too many stakeholders in present systems for any proposal to carry the day.

Not to mention decades, if not centuries, of references in other systems.

May 9, 2012

Bruce on the Functions of Legislative Identifiers

Filed under: Identifiers,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 12:06 pm

Bruce on the Functions of Legislative Identifiers

From Legal Informatics:

In this post, Tom [Bruce] discusses the multiple functions that legislative document identifiers serve. These include “unique naming,” “navigational reference,” “retrieval hook / container label,” “thread tag / associative marker,” “process milestone,” and several more.

A promised second post will examine issues of identifier design.

Enjoy and pass along!

May 3, 2012

20 More Reasons You Need Topic Maps

Filed under: Identification,Identifiers,Identity,Marketing,Topic Maps — Patrick Durusau @ 6:23 pm

Well, Ed Lindsey did call his column 20 Commom Data Errors and Variation but when you see the PNG of the 20 errors, here, you will agree my title works better (for topic maps anyway).

Not only that, but Ed’s opening paragraphs work for identifying a subject by more than one attribute (although this is “subject” in the police sense of the word):

A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag.

So what does this sergeant have to do? Start looking at the women of Chicago. He only needs the women. Actually, he would start with women with blond hair (but judging from my daughter’s constant change of hair color he might skip that attribute). So he might start with women in a certain height range and in a certain weight group. He would bring those women in to the station for questioning.

As it turns out, when they finally arrested the woman at her son’s soccer game, she had brown hair, was 5’5″ tall and weighed 120 pounds. She was wearing an Oklahoma University sweatshirt, jeans and sneakers. When the original witness saw her she said yes that’s the same woman. Iit turns out she was wearing four inch heels and the pantsuit made her look bigger.

So what can we learn from this episode that has to do with matching? Well the first thing we need to understand is that each of the attributes of the witness can be used in matching the suspect and then immediately we must also recognize that not all the attributes that the witness gave the sergeant were extremely accurate. So later on when we start talking about matching, will use the term fuzzy matching. This means that when you look at an address, there could be a number of different types of errors in the address from one system that are not identical to an address in another system. Figure 1 shows a number of the common errors that can happen.

So, there you have it: 20 more reasons to use topic maps, a lesson on identifying a subject and proof that yes, a pinstripped pantsuit can make you look bigger.

April 29, 2012

Legal Entity Identifier – Preparing for the Inevitable

Filed under: Identifiers,Law,Legal Entity Identifier (LEI),Legal Informatics — Patrick Durusau @ 2:04 pm

Legal Entity Identifier – Preparing for the Inevitable by Peter Ku.

From the post:

Most of the buzz around the water cooler for those responsible for enterprise reference data in financial services has been around the recent G20 meeting in Switzerland on the details of the proposed Legal Entity Identifier (LEI). The LEI is designed to help regulators manage and monitor systemic risk in the financial markets by creating a unique ID to recognize legal entities/counterparties shared by the global financial companies and government regulators. Agreement to adoption is expected to be decided at the G20 leaders’ summit coming up in June in Mexico as regulators decide the details as to the administration, implementation and enforcement of the standard. Will the new LEI solve the issues that led to the recent financial crisis?

Looking back at history, this is not the first time the financial industry has attempted to create a unique ID system for legal entities, remember the Data Universal Numbering System (DUNS) identifier as an example? What is different from the past is that the new LEI standard is set at a global vs. regional level which had caused past attempts to fail. Unfortunately, the LEI standard will not replace existing IDs that firms deal with every day. Instead, it creates further challenges requiring companies to map existing IDs to the new LEI, reconciling naming differences, maintain legal hierarchy relationships between parent and subsidiary entities from ongoing corporate actions, and also link it to the securities and loans to the legal entities.

….

While many within the industry are waiting to see what the regulators decide in June, existing issues related to the quality, consistency, and delivery of counterparty reference data and the downstream impact on managing risk needs to be dealt with regardless if LEI is passed. In the same report, I shared the challenges firms will face incorporating the LEI including:

  • Accessing, reconciling, and relating existing counterparty information and IDs to the new LEI
  • Effectively identifying and resolving data quality issues from external and internal systems
  • Accurately identifying legal hierarchy relationships which LEI will not maintain in its first instantiation.
  • Cross referencing legal entities with financial and securities instruments
  • Extending both counterparty and securities instruments to downstream front, mid, and back office systems.

As a topic map person, do any of these issues sound familiar to you?

In particular creating a new identifier to solve problems with resolving multiple “old” ones?

Being mindful that all data systems are capable of and/or contain errors, intentional (dishonest) and otherwise.

Presuming perfect records, and perfect data in those records, not only guarantees failure, but avenues for abuse.

Peter cites resources you will need to read.

April 20, 2012

Past, Present and Future – The Quest to be Understood

Filed under: Identification,Identifiers,Identity — Patrick Durusau @ 6:27 pm

Without restricting it to being machine readable, I think we would all agree there are three ages of data:

  1. Past data
  2. Present data
  3. Future data

And we have common goals for data (or parts of it):

  1. Past data – To understand past data.
  2. Present data – To be understood by others.
  3. Future data – For our present data to persist and be understood by then users.

Common to those ages and goals is the need for management of identifiers for our data. (Where identifiers may be data as well.)

I say “management of identifiers” because we cannot control identifiers used in the past, identifiers used by others in the present, or identifiers that may be used in the future.

You would think in an obviously multi-lingual world that multiple identifier identification would be the default position.

Just a personal observation but hardly a day passes without someone or some group saying the equivalent of:

I know! I will create a list of identifiers that everyone must use! That’s the answer to the confusion (Babel) of identifiers.

Such efforts are always defeated by past identifiers, other identifiers in the present and future identifiers.

Managing tides of identifiers is a partial solution but more workable than trying to stop the tide.

What do you think?

April 18, 2012

Bad Names, Renaming, …?

Filed under: Identifiers,Names — Patrick Durusau @ 6:06 pm

David Loshin as a series of posts going at the Data Roundtable:

The Perils of Bad Names

and

The Impact of Data Element Renaming…

In “Bad Names,” David cites this example:

An example of this might be a column named “STREET_ADDRESS,” but that instead of that field holding a street number and name, it contains a set of flags indicating the types of customer correspondences that are to be sent to a home address instead of an email address. From one perspective, our assumption about what was stored in that field were mistaken, but on the other hand, conventional wisdom might have suggested otherwise.

I would agree, that at least looks like a bad name. Moreover, its one that is likely to trip up successors who have to deal with the data set.

David goes on to argue in “Renaming,” that finding and replacing all the uses of this name may lead to worse problems.

Ah, after thinking about it for a bit, I can see he has a point.

How about you?

April 6, 2012

“Give me your tired, your poor, your huddled identifiers yearning to be used.”

Filed under: Identifiers,RDF,Semantic Web — Patrick Durusau @ 6:52 pm

I was reminded of the title quote when I read Richard Wallis’s: A Fundamental Linked Data Debate.

Contrary to Richard’s imaginings, the vast majority of people on and off the Web are not waiting for the debates on the W3C’s Technical Architecture (TAG) or Linked Open Data (public-lod) mailing lists to be resolved.

Why?

They had identifiers for subjects long before the WWW, Semantic Web, Linked Data or whatever and will have identifiers for subjects long after those efforts and their successors are long forgotten.

Some of those identifiers are still in use today and will survive well into the future. Others are historical curiosities.

Moreover, when it was necessary to distinguish between identifiers and the things identified, that need was met.

Entire the WWW and its poster child, Tim Berners-Lee.

It was Tim Berners-Lee who created the problem Richard frames as: “the difference between a thing and a description of that thing.”

Amazing how much fog of discussion there has been to cover up that amateurish mistake.

The problem isn’t one of conflicting world views (a la Jeni Tennison) but rather how given a bare URI, how to interpret it? Given the bad choices made in the Garden of the Web as it were.

That we simply abandon bare URIs as a solution has never darkened their counsel. They would rather impose the 303/TBL burden on everyone rather than admit to fundamental error.

I have a better solution.

The rest of us should carry on with the identifiers that we want to use, whether they be URIs or not. Whether they are prior identifiers or new ones. And we should put forth statements/standards/documents to establish how in our contexts, those identifiers should be used.

If IBM, Oracle, Microsoft and a few other adventurers decide that IT can benefit from some standard terminology, I am sure they can influence others to use it. Whether composed of URIs or not. And the same can be said for many other domains, most of who will do far better than the W3C at fashioning identifiers for themselves.

Take heart TAG and LOD advocates.

As the poem says: “Give me your tired, your poor, your huddled identifiers yearning to be used.”

Someday your identifiers will be preserved as well.

URN:LEX: New Version 06 Available

Filed under: Identifiers,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 6:47 pm

URN:LEX: New Version 06 Available

From the purpose of the namespace “lex:”

The purpose of the “lex” namespace is to assign an unequivocal identifier, in standard format, to documents that are sources of law. To the extent of this namespace, “sources of law” include any legal document within the domain of legislation, case law and administrative acts or regulations; moreover potential “sources of law” (acts under the process of law formation, as bills) are included as well. Therefore “legal doctrine” is explicitly not covered.

The identifier is conceived so that its construction depends only on the characteristics of the document itself and is, therefore, independent from the document’s on-line availability, its physical location, and access mode.

This identifier will be used as a way to represent the references (and more generally, any type of relation) among the various sources of law. In an on-line environment with resources distributed among different Web publishers, uniform resource names allow simplified global interconnection of legal documents by means of automated hypertext linking.

If creating names just for law “sources” sounds like low-lying fruit to you, take some time to become familiar with the latest draft.

March 15, 2012

Data and Reality

Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.

I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”

His abstract there read:

The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.

I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.

The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.

I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.

More to follow.

March 13, 2012

Then BI and Data Science Thinking Are Flawed, Too

Filed under: Identification,Identifiers,Marketing,Subject Identifiers,Subject Identity — Patrick Durusau @ 8:15 pm

Then BI and Data Science Thinking Are Flawed, Too

Steve Miller writes:

I just finished an informative read entitled “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us,” by social scientist Duncan Watts.

Regular readers of Open Thoughts on Analytics won’t be surprised I found a book with a title like this noteworthy. I’ve written quite a bit over the years on challenges we face trying to be the rational, objective, non-biased actors and decision-makers we think we are.

So why is a book outlining the weaknesses of day-to-day, common sense thinking important for business intelligence and data science? Because both BI and DS are driven from a science of business framework that formulates and tests hypotheses on the causes and effects of business operations. If the thinking that produces that testable understanding is flawed, then so will be the resulting BI and DS.

According to Watts, common sense is “exquisitely adapted to handling the kind of complexity that arises in everyday situations … But ‘situations’ involving corporations, cultures, markets, nation-states, and global institutions exhibit a very different kind of complexity from everyday situations. And under these circumstances, common sense turns out to suffer from a number of errors that systematically mislead us. Yet because of the way we learn from experience … the failings of commonsense reasoning are rarely apparent to us … The paradox of common sense, therefore, is that even as it helps us make sense of the world, it can actively undermine our ability to understand it.”

The author argues that common sense explanations to complex behavior fail in three ways. The first error is that the mental model of individual behavior is systematically flawed. The second centers on explanations for collective behavior that are even worse, often missing the “emergence” – one plus one equals three – of social behavior. And finally, “we learn less from history than we think we do, and that misperception skews our perception of the future.”

Reminds me of Thinking, Fast and Slow by Daniel Kahneman.

Not that two books with a similar “take” proves anything but you should put them on your reading list.

I wonder when/where our perceptions of CS practices have been skewed?

Or where that has played a role in our decision making about information systems?

February 1, 2012

Multiple Recognitions: Reconsidered

Filed under: Context,Identification,Identifiers,Semantics — Patrick Durusau @ 4:39 pm

Yesterday I closed with these lines:

Requirement: A system of identification must support the same identifiers resolving to different identifications.

The consequences of deciding otherwise on such a requirement, I will try to take up tomorrow. (Multiple Recognitions)

Rereading that for today’s post, I don’t agree with myself.

The requirement isn’t a requirement at all but an observation that the same identifier may have multiple resolutions.

Better to say that the designer of systems of identification should be aware of that observation. To avoid situations like I posed yesterday with “I will call you a cab” example.

A fortuitous mistake because it leads to the next issue that I wanted to address: Do identifiers have contexts in which they have only a single resolution?

Yesterday’s mistake has made me more wary of sweeping pronouncements so I am posing the context issue as a question. 😉

Can you think of any counter-examples?

The easiest place to look would be in comedy, where mistaken identity (such as in Shakespeare), double meanings, etc., are bread and butter of the art. Two or more people hear or see the same identifier and reach different resolutions.

In those cases, if we had a rule that identifiers could only have a single resolution, we would have to simply skip over those cases. That seems like an inelegant solution.

Or would you shrink the context down to the individuals who had the different resolutions of an identifier?

Perhaps, perhaps but then what is your solution when later in the play one or more individuals discover their mistake and now hold a common resolution but still remember the one that was in error? Or perhaps more than one that was in error? How do we describe the context(s) there?

There is a long history of such situations in comedy. You may be tempted to say that recreational literature can be excluded. That “fictional” work isn’t the first place we want semantic technologies to work.

Perhaps but remember that comedy and “fiction” have their origin in our day to day affairs. The misunderstandings they parody are our misunderstandings.

The saying: “what did X know and when did they know it?” takes on new meaning when we take about the interpretation of identifiers. Perhaps “freedom fighter” is a more sympathetic term until you “know” those forces are operating death squads. And may have different legal consequences.

How do you think boundaries for contexts should be set/designated? Seems like that would be an important issue to take up.

January 18, 2012

Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang

Filed under: Erlang,Identifiers — Patrick Durusau @ 7:54 pm

Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang

From the post:

At Boundary we have developed a system for unique id generation. This started with two basic goals:

  • Id generation at a node should not require coordination with other nodes.
  • Ids should be roughly time-ordered when sorted lexicographically. In other words they should be k-ordered 1, 2.

All that is required to construct such an id is a monotonically increasing clock and a location 3. K-ordering dictates that the most-significant bits of the id be the timestamp. UUID-1 contains this information, but arranges the pieces in such a way that k-ordering is lost. Still other schemes offer k-ordering with either a questionable representation of ‘location’ or one that requires coordination among nodes.

Just in case you are looking for a decentralized source of K-ordered unique IDs. 😉

First seen at: myNoSQL as: Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang.

December 27, 2011

Thinking, Fast and Slow

Thinking, Fast and Slow by Daniel Kahneman, Farrar, Straus and Giroux, New York, 2011.

I got a copy of “Thinking, Fast and Slow” for Christmas and it has already proven to be an enjoyable read.

Kahneman says early on (page 28):

The premise of this book is that it is easier to recognize other people’s mistakes than our own.

I thought about that line when I read a note from a friend that topic maps needed more than my:

tagging everything with “Topic Maps….”

Which means I haven’t been clear about the reasons for the breath of materials I have and will be covering in this blog.

One premise of this blog is that the use and recognition of identifiers is essential for communication.

Another premise of this blog is that it is easier for us to study the use and recognition of identifiers by others, much for the same reasons we can recognize the mistakes of others more easily.

The use and recognition of identifiers by others aren’t mistakes but they may be different from those we would make. In cases where they differ from ours, we have a unique opportunity to study the choices made and the impacts of those choices. And we may learn patterns in those choices that we can eventually see in our own choices.

Understanding the use and recognition of identifiers in a particular circumstance and the requirements for the use and recognition of identifiers, is the first step towards deciding whether topic maps would be useful in some circumstance and in what way?

For example, processing social security records in the United States, anything other than “bare” identifiers like a social security number may be unnecessary and add load with no corresponding benefit. Aligning social security records with bank records, might need to reconsider the judgement to use only social security numbers. (Some information sharing is “against the law.” But as the Sheriff in “Oh Brother where art thou?” says: “The law is a man made thing.” Laws change, or you can commission absurdist interpretations of it.)

Topic maps aren’t everywhere but identifiers and recognition of identifiers are.

Understanding identifiers and their recognition will help you choose the most appropriate solution to a problem

October 24, 2011

OCLC Developer Network

Filed under: Identification,Identifiers,Library Associations,OCLC Number — Patrick Durusau @ 6:42 pm

OCLC Developer Network

From the webpage:

The OCLC Developer Network is a community of developers collaborating to propose, discuss and test OCLC Web Services. This open source, code-sharing infrastructure improves the value of OCLC data for all users by encouraging new OCLC Web Service uses.

Thought while I was looking at OCLC resources I might as well give a shout out to the OCLC Developer Network. A community that has an interest in identifiers and identification for the purpose of furthering access to information. Who could be more sympathetic to topic maps?

WorldCat Identities Network

Filed under: Associations,Identification,Identifiers — Patrick Durusau @ 6:41 pm

WorldCat Identities Network

A project of OCLC Research, the WorldCat Identities Network is described as:

The WorldCat Identity Network uses the WorldCat Identities Web Service and the WorldCat Search API to create an interactive Related Identity Network Map for each Identity in the WorldCat Identities database. The Identity Maps can be used to explore the interconnectivity between WorldCat Identities.

A WorldCat Identity can be a person, a thing (e.g., the Titanic), a fictitious character (e.g., Harry Potter), or a corporation (e.g., IBM).

I can’t claim to be a fan of jumpy network node displays but that isn’t a criticism, more a matter of personal taste. Some people find that sort of display quite useful.

The information conveyed, leaving display to one side, is quite interesting. It has just enough fuzziness (to me at any rate) to approach the experience of serendipitous discovery using more traditional library tools. I suspect that will vary from topic to topic but that was my experience with briefly using the interface.

Despite my misgivings about the interface, I will be returning to explore this service fairly often.

BTW, the service is obviously mis-named. What is being delivered is what we used to call “see also” or related references, thus: WorldCat “See Also” Network would be a more accurate title.

For class:

  1. Spend at least an hour or more with the service and write a 2 page summary of what you liked/disliked about it. (no citations)
  2. What subject/relationship did you choose to follow? Discover anything you did not expect? 1 page (no citations)

September 8, 2011

Summing up Properties with subjectIdentifiers/URLs?

Filed under: Identification,Identifiers,Intelligence,Subject Identifiers,Subject Identity — Patrick Durusau @ 6:06 pm

I was picking tomatoes in the garden when I thought about telling Carol (my wife) the plants are about to stop producing.

Those plants are at a particular address, in the backyard, middle garden bed of three, are of three different varieties, but I am going to sum up those properties by saying: “The tomatoes are about to stop producing.”

It occurred to me that a subjectIdentifier could be assigned to a topic element on the basis of summing up properties of the topic.* That would have the advantage of enabling merging on the basis of subjectIdentifiers as opposed to more complex tests upon properties of a topic.

Disclosure of the basis for assignment of a subjectIdentifier is an interesting question.

It could be that a service wishes to produce subjectIdentifiers and index information based upon complex property measures, producing for consumption, the subjectIdentifiers and merge-capable indexes on one or more information sets. The basis for merging being the competitive edge offered by the service.

If promoting merging with a vendor’s process or format, which is seeking to become the TCP/IP of some area, the basis for merging and tools to assist with it will be supplied.

Or if you are an intelligence agency and you want an inward and outward facing interface that promotes merging of information but does not disclose your internal basis for identification, variants of this technique may be of interest.

*The notion of summing up imposes no prior constraints on the tests used or the location of the information subjected to those tests.

August 13, 2011

CAS Registry Number & The Semantic Web

Filed under: Cheminformatics,Identifiers,Indexing — Patrick Durusau @ 3:47 pm

CAS Registry Number

Another approach to the problem of identification, assign an arbitrary identifier for which you hold the key.

If you start early enough in a particular era, you can gain enough of an advantage to deter most competitors. Particularly if you curate the professional literature so that you can provide effective searching based on your (and other) identifiers.

The similarity to the Semantic Web’s assignment of a URL to every subject is not accidental.

The main differences with the Semantic Web:

  1. Economically important activity was focus of the project.
  2. Professional literature base with obvious value-add potential for research and production.
  3. Single source curators of the identifiers (did not whine at others to create them).
  4. Identification where there was user demand to support the effort.

The Wiki page reports (in part):

CAS Registry Numbers are unique numerical identifiers assigned by the “Chemical Abstracts Service” to every chemical described in the open scientific literature (currently including those described from at least 1957 through the present) and including elements, isotopes, organic and inorganic compounds, organometallics, metals, alloys, coordination compounds, minerals, and salts; as well as standard mixtures, compounds, polymers; biological sequences including proteins & nucleic acids; nuclear particles, and nonstructurable materials (aka ‘UVCB’s- i.e., materials of Unknown, Variable Composition, or Biological origin). They are also referred to as CAS RNs, CAS Numbers, etc.

The Registry maintained by CAS is an authoritative collection of disclosed chemical substance information. Currently the CAS Registry identifies more than 56 million organic and inorganic substances and 62 million sequences, plus additional information about each substance; and the Registry is updated with an approximate 12,000 additional new substances daily.

Historically, chemicals have been identified by a wide variety of synonyms. Frequently these are arcane and constructed according to regional naming conventions relating to chemical formulae, structures or origins. Well-known chemicals may additionally be known via multiple generic, historical, commercial, and/or black-market names.

PS: The index is now at 61+ million substances.

InChl – IUPAC International Chemical Identifier

Filed under: Cheminformatics,Identifiers — Patrick Durusau @ 3:47 pm

The Semantic Chemical Entity Specification was useful in pointing me towards InChl – IUPAC International Chemical Identifiers (Wiki page).

From the Wiki page:

The identifiers describe chemical substances in terms of layers of information — the atoms and their bond connectivity, tautomeric information, isotope information, stereochemistry, and electronic charge information. Not all layers have to be provided; for instance, the tautomer layer can be omitted if that type of information is not relevant to the particular application.

InChIs differ from the widely used CAS registry numbers in three respects:

  • they are freely usable and non-proprietary;
  • they can be computed from structural information and do not have to be assigned by some organization;
  • most of the information in an InChI is human readable (with practice).

I like the compute from structural information aspect. Reminds me of Eric Freese and his topic map example that calculated extended family relationships based on parent/child, sibling relationships.

What other areas would benefit from computable identifications and how would you go about constructing them? Such that the same set of inputs results in the same identifier?

The Wiki page cites a number of other resources on chemical identification that will be useful if you are straying into work with chemical databases.

June 29, 2011

Providing and discovering definitions of URIs

Filed under: Identifiers,Linked Data,LOD,OWL,RDF,Semantic Web — Patrick Durusau @ 9:10 am

Providing and discovering definitions of URIs by Jonathan A. Rees.

Abstract:

The specification governing Uniform Resource Identifiers (URIs) [rfc3986] allows URIs to mean anything at all, and this unbounded flexibility is exploited in a variety contexts, notably the Semantic Web and Linked Data. To use a URI to mean something, an agent (a) selects a URI, (b) provides a definition of the URI in a manner that permits discovery by agents who encounter the URI, and (c) uses the URI. Subsequently other agents may not only understand the URI (by discovering and consulting the definition) but may also use the URI themselves.

A few widely known methods are in use to help agents provide and discover URI definitions, including RDF fragment identifier resolution and the HTTP 303 redirect. Difficulties in using these methods have led to a search for new methods that are easier to deploy, and perform better, than the established ones. However, some of the proposed methods introduce new problems, such as incompatible changes to the way metadata is written. This report brings together in one place information on current and proposed practices, with analysis of benefits and shortcomings of each.

The purpose of this report is not to make recommendations but rather to initiate a discussion that might lead to consensus on the use of current and/or new methods.

The criteria for success:

  1. Simple. Having too many options or too many things to remember makes discovery fragile and impedes uptake.
  2. Easy to deploy on Web hosting services. Uptake of linked data depends on the technology being accessible to as many Web publishers as possible, so should not require control over Web server behavior that is not provided by typical hosting services.
  3. Easy to deploy using existing Web client stacks. Discovery should employ a widely deployed network protocol in order to avoid the need to deploy new protocol stacks.
  4. Efficient. Accessing a definition should require at most one network round trip, and definitions should be cacheable.
  5. Browser-friendly. It should be possible to configure a URI that has a discoverable definition so that ‘browsing’ to it yields information useful to a human.
  6. Compatible with Web architecture. A URI should have a single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name.

.

I had to look it up to get the page number but I remembered Karl Wiegers in Software Requirements saying:

Feasible

It must be possible to implement each requirement within the known capabilities and limitations of the system and its environment.

The single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name requirement is not feasible. It will stymie this project, despite the array of talent on hand, until it is no longer a requirement.

Need proof? Name one URI with a single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name.

Not one that the W3C TAG, or TBL or anyone else thinks/wants/prays has a single agree meaning globally, … but one that in fact has such a global meaning.

It’s been more than ten years. Let’s drop the last requirement and let the rather talented group working on this come up with a solution that meets the other five (5) requirements.

It won’t be a universal solution but then neither is the WWW.

May 24, 2011

Persistent Identifiers?

Filed under: Identifiers — Patrick Durusau @ 10:27 am

Lutz Maicher tweeted about Identifier Persistence: Fundamentals yesterday.

It claims two foundations for identifier persistence:

  1. Identifier persistence requires an organizational commitment. Persistence cannot be ensured by a few renegades in the skunk-works, nor can it be mandated from on high without the support of those who manage the identifiers or produce web resources. All individuals involved in the life-cycle of web resources must be committed to persistence in perpetuity if true persistence of identifiers is to be achieved.
  2. No technology, no standard, no identifier scheme, no information architecture will get you persistence. Whether you choose native URIs, Handles, DOIs, PURLs, ARKs, UUIDs, or XRIs, you will never achieve identifier persistence without active management of your identifiers and web resources. This requires the aforementioned organizational commitment since such management cannot occur without sufficient resources. Management of web resources and identifiers requires time and due diligence and those don’t come for free.

(emphasis in original)

So, identifier persistence requires active management of identifiers and web resources?

But when I think of persistent identifiers, I have something more like:

Cleopatra Cartouch, by Trevor Lowe

in mind. (Creative Commons License Trevor Lowe)

It has been, what?, over 2,000 years without active management of identifiers and web resources and it still persists as an identifier.

And that is a fairly recent identifier in the great scheme of identifiers. There are those that are far older.

I don’t deny the convenience or utility of web identifiers. But in terms of persistence, where should we look for a digital Rosetta stone when the maintenance of opaque identifiers and 303 redirects have fallen into disuse? I have heard it mentioned that fifteen or twenty years is persistence for a web identifier. Perhaps so but realize that the persistence of the identifier for Cleopatra that appears above is more than two orders of magnitude greater.

How your business would be different today if there were a cone of information darkness only fifteen or twenty years (optimistic estimate) behind you? And with each passing year, another year drops into a digital abyss. Some things persist, others don’t. Usually the ones you want/need don’t. Or so it always seems.

My suggestion isn’t yet-another-persistence-proposal (YAPP). The ones that involve multi-century funding/staffing and proposals to bind future generations to our present notions of persistence syntax.

Let’s write web identifiers using (in part) identifiers that are already meaningful in our professions, occupations and hobbies. Identifiers that are not dependent particular resolution mechanisms or technologies. Identifiers that will persist long after their maintenance has failed. That is a step towards persistence.

March 1, 2011

Indexing by Properties

Filed under: Identifiers,Names,Properties — Patrick Durusau @ 10:09 am

When I was researching the …grain of salt post I happened across the entry for sodium chloride at Wikipedia.

I don’t know how many times I have looked at Wikipedia pages but that day I noticed the headings in the sidebar that read:

IUPAC name (International Union of Pure and Applied Chemistry nomenclature)
Other names
Identifiers
Properties
Structure
Hazards
Related Compounds
Supplementary data page

Think about it for a minute.

Substances don’t arrive in labs, say for example the fictional labs seen on CSI with IUPAC names, other names, or even identifiers.

How are they identified? Can you say by their properties?

Now there is an odd dis-connect between indexing and identification.

That is indexing is by names and identifiers, both of which are known to be weak, rather than by properties.

Now there is an idea, an indexer that marshals properties for any index entry and can report why a particular entry was made.

We would not accept any less from a lab analysis, I wonder why we accept it from our indexers?

Subjects, other than substances, also have properties, including relationships to other subjects.

Identifiers and locators in topic maps are quick and convenient ways to navigate topic maps and the subjects represented therein.

We should now allow that convenience to blind us to the deeper complexity of reliable identification of subjects by their properties.

Indexing based upon more than names and identifiers looks like a largely unexplored landscape and one where topic maps could make an original contribution to the art of indexing.

Well, to be honest, topic maps would be making explicit what indexers have been doing for years. Which would make it even more valuable.

Indexing by Properties. Has a nice ring to it doesn’t it?

Has a number of implications for semantic web technologies, but more on that anon.

« Newer Posts

Powered by WordPress