Archive for the ‘Identification’ Category

Finding Subject Identifiers

Monday, April 1st, 2013

A recent comment made it clear that tooling, or the lack thereof, is a real issue for topic maps.

Here is my first suggestion of a tool you can use while authoring a topic map:

Wikipedia.

Seriously, think about it. You want a URL that identifies subject X.

Granting that Wikipedia is a fairly limited set of subjects, it is at least a starting point.

Example: I want a subject identifier for “Donald Duck,” a cartoon character.

I can use the search box at Wikipedia or I can type in a browser:

http://en.wikipedia.org/wiki/Donald%20Duck

Go ahead, try it.

If I don’t know the full name:

http://en.wikipedia.org/wiki/Donald

What do you think?

Allows you to disambiguate Donalds, at least the ones that Wikipedia knows about.

Not to mention giving you access to other subjects and relationships that may be of interest for your topic map.

To include foreign language materials (outside of English only non-thinking zones in the U.S.), try a different language Wikipedia:

http://de.wikipedia.org/wiki/Donald%20Duck

Finding subject identifiers won’t write your topic map for you but can make the job easier.

There are other sources of subject identifiers so send in your suggestions and any syntax short-cuts for accessing them.


You have no doubt read that URIs used as identifiers are supposed to be semi-permanent, “cool,” etc.

But identifiers change over time. It’s one of the reasons for historical semantic diversity.

URIs as identifiers will change as well.

Good thing topic maps enable you to have multiple identifiers for any subject.

Means old references to old identifiers still work.

Glad we dodged having to redo and reproof all those old connections.

Aren’t you?

New DataCorps Project: Refugees United

Sunday, January 27th, 2013

New DataCorps Project: Refugees United

From the post:

We are thrilled to announce the kick-off of a new DataKind project with Refugees United! Refugees United is a fantastic organization that uses mobile and web technologies to help refugees find their missing loved ones. Currently, RU’s system allows people to post descriptions of their family and friends as well as to search for them on the site. As you might imagine, lots of data flows through this system – data that could be used to greatly improve the way people find each other. Lead by the ever-brilliant Max Shron, the DataKind team is collaborating with Refugees United to explore what their data can tell them about how people are using the site, how they’re connecting to one another and, ultimately, how it can be used to help people find each other more effectively.

We are incredibly excited to work on this project and will be posting updates for you all as things unfoled. In the meantime, learn a bit more about Max and Refugees United.

I can’t comment on the identity practices because:

Q: 1.08 Why isn’t Refugees United open source yet?

Refugees United was born as an “offline” open source project. When we started, we were two guys (now six guys and a girl in Copenhagen, joined by a much larger team worldwide) with a great idea that had the potential to positively impact thousands, if not millions, of lives. The open source approach came from the fact that we wanted to build the world’s smallest refugee agency with the largest outreach, and to have the highest impact at the lowest cost.

One way to reach our objectives is to work with corporations around that world, including Ericsson, SAP, FedEx and others. The invaluable advice and expertise provided by these successful businesses – both the largest corporations and the smallest companies – have helped us to apply the structure and strategy of business to the passion and vision of an NGO.

Now the time has come for us to apply same structure to our software, and we have begun to collaborate with some of the wonderfully brilliant minds out there who wish to contribute and help us make a difference in the development of our technologies.

I am not sure what ‘”offline” open source’ means? The rest of the quoted prose doesn’t help.

Perhaps the software will become available online. At some point.

Would be a interesting data point to see how they are managing personal subject identity.

The Adams Workflow

Saturday, January 26th, 2013

The Adams Workflow

From the webpage:

The Advanced Data mining And Machine learning System (ADAMS) is a novel, flexible workflow engine aimed at quickly building and maintaining real-world, complex knowledge workflows.

Same source as WEKA.

What if we think about identification as workflow?

Whatever stability we attribute to an identification is the absence of additional data that would create a change.

Looking backwards over prior identifications, we fit them into the schema of our present identification and that eliminates any movement from the past. The past is fixed and terminates in our present identification.

That view fails to appreciate the world isn’t going to end with any of us individually. The world and its information systems will continue, as will the workflow that defines identifications.

Replacing our identifications with newer ones.

The question we face is whether our actions will support or impede re-use of our identifications in the future.

I first saw Adams Workflow at Nat Torkington’s Four short links: 24 January 2013.

Chemical datuments as scientific enablers

Friday, January 25th, 2013

Chemical datuments as scientific enablers by Henry S Rzepa. (Journal of Cheminformatics 2013, 5:6 doi:10.1186/1758-2946-5-6)

Abstract:

This article is an attempt to construct a chemical datument as a means of presenting insights into chemical phenomena in a scientific journal. An exploration of the interactions present in a small fragment of duplex Z-DNA and the nature of the catalytic centre of a carbon-dioxide/alkene epoxide alternating co-polymerisation is presented in this datument, with examples of the use of three software tools, one based on Java, the other two using Javascript and HTML5 technologies. The implications for the evolution of scientific journals are discussed.

From the background:

Chemical sciences are often considered to stand at the crossroads of paths to many disciplines, including molecular and life sciences, materials and polymer sciences, physics, mathematical and computer sciences. As a research discipline, chemistry has itself evolved over the last few decades to focus its metaphorical microscope on both far larger and more complex molecular systems than previously attempted, as well as uncovering a far more subtle understanding of the quantum mechanical underpinnings of even the smallest of molecules. Both these extremes, and everything in between, rely heavily on data. Data in turn is often presented in the form of visual or temporal models that are constructed to illustrate molecular behaviour and the scientific semantics. In the present article, I argue that the mechanisms for sharing both the underlying data, and the (semantic) models between scientists need to evolve in parallel with the increasing complexity of these models. Put simply, the main exchange mechanism, the scientific journal, is accepted [1] as seriously lagging behind in its fitness for purpose. It is in urgent need of reinvention; one experiment in such was presented as a data-rich chemical exploratorium [2]. My case here in this article will be based on my recent research experiences in two specific areas. The first involves a detailed analysis of the inner kernel of the Z-DNA duplex using modern techniques for interpreting the electronic properties of a molecule. The second recounts the experiences learnt from modelling the catalysed alternating co-polymerisation of an alkene epoxide and carbon dioxide.

Effective sharing of data, in scientific journals or no, requires either a common semantic (we know that’s uncommon) or a mapping between semantics (how may times must we repeat the same mappings, separately?).

Embedding notions of subject identity and mapping between identifications in chemical datuments could increase the reuse of data, as well as its longevity.

Open Source Natural Language Spell-Checker [Disambiguation at the point of origin.]

Friday, October 26th, 2012

Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.

I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?

My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.

It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”

That confusion resolved, I read:

Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.

Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”

Interested parties can check out this demo or read the tech overview and grab the source code here.

I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.

Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?

Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.

Disambiguation at the point of origin.

The title of the original article could become:

“<a href=”http://automattic.com/”>Automattic</a> Open Sources Natural Language Spell-Checker <a href=”http://www.afterthedeadline.com/”>After the Deadline</a>”

Seems less ambiguous to me.

Certainly less ambiguous to a search engine.

You?

Visual Clues: A Brain “feature,” not a “bug”

Saturday, September 29th, 2012

You will read in When Your Eyes Tell Your Hands What to Think: You’re Far Less in Control of Your Brain Than You Think that:

You’ve probably never given much thought to the fact that picking up your cup of morning coffee presents your brain with a set of complex decisions. You need to decide how to aim your hand, grasp the handle and raise the cup to your mouth, all without spilling the contents on your lap.

A new Northwestern University study shows that, not only does your brain handle such complex decisions for you, it also hides information from you about how those decisions are made.

“Our study gives a salient example,” said Yangqing ‘Lucie’ Xu, lead author of the study and a doctoral candidate in psychology at Northwestern. “When you pick up an object, your brain automatically decides how to control your muscles based on what your eyes provide about the object’s shape. When you pick up a mug by the handle with your right hand, you need to add a clockwise twist to your grip to compensate for the extra weight that you see on the left side of the mug.

“We showed that the use of this visual information is so powerful and automatic that we cannot turn it off. When people see an object weighted in one direction, they actually can’t help but ‘feel’ the weight in that direction, even when they know that we’re tricking them,” Xu said. (emphasis added)

I never quite trusted my brain and now I have proof that it is untrustworthy. Hiding stuff indeed! ;-)

But that’s the trick of subject identification/identity isn’t it?

That our brains “recognize” all manner of subjects without any effort on our part.

Another part of the effortless features of our brains. But it hides the information we need to integrate information stores from ourselves and others.

Or rather, making it more work than we are usually willing to devote to digging it out.

When called upon to be “explicit” about subject identification, or even worse, to imagine how other people identify subjects, we prefer to stay at home consuming passive entertainment.

Two quick points:

First, need to think about how to incorporate this “feature” into delivery interfaces for users.

Second, what subjects would users pay others to mine/collate/identify for them? (Delivery being a separate issue.)

Identities and Identifications: Politicized Uses of Collective Identities

Monday, September 17th, 2012

Identities and Identifications: Politicized Uses of Collective Identities

Deadline for Panels 15 January 2013
Deadline for Papers 1 March 2013
Conference 18-20 April 2013, Zagreb, Croatia

From the call for panels and papers:

Identity is one of the crown jewelleries in the kingdom of ‘contested concepts’. The idea of identity is conceived to provide some unity and recognition while it also exists by separation and differentiation. Few concepts were used as much as identity for contradictory purposes. From the fragile individual identities as self-solidifying frameworks to layered in-group identifications in families, orders, organizations, religions, ethnic groups, regions, nation-states, supra-national entities or any other social entities, the idea of identity always shows up in the core of debates and makes everything either too dangerously simple or too complicated. Constructivist and de-constructivist strategies have led to the same result: the eternal return of the topic. Some say we should drop the concept, some say we should keep it and refine it, some say we should look at it in a dynamic fashion while some say it’s the reason for resistance to change.

If identities are socially constructed and not genuine formations, they still hold some responsibility for inclusion/exclusion – self/other nexuses. Looking at identities in a research oriented manner provides explanatory tolls for a wide variety of events and social dynamics. Identities reflect the complex nature of human societies and generate reasonable comprehension for processes that cannot be explained by tracing pure rational driven pursuit of interests. The feelings of attachment, belonging, recognition, the processes of values’ formation and norms integration, the logics of appropriateness generated in social organizations are all factors relying on a certain type of identity or identification. Multiple identifications overlap, interact, include or exclude, conflict or enhance cooperation. Identities create boundaries and borders; define the in-group and the out-group, the similar and the excluded, the friend and the threatening, the insider and the ‘other’.

Beyond their dynamic fuzzy nature that escapes exhaustive explanations, identities are effective instruments of politicization of social life. The construction of social forms of organization and of specific social practices together with their imaginary significations requires all the time an essentialist or non-essentialist legitimating act of belonging; a social glue that extracts its cohesive function from the identification of the in-group and the power of naming the other. Identities are political. Multicultural slogans populate extensively the twenty-first century yet the distance between the ideal and the real multiculturalism persists while the virtues of inclusion coexist with the adversity of exclusion. Dealing with the identities means to integrate contestation into contestation until potentially a n degree of contestation. Due to the confusion between identities and identifications some scholars demanded that the concept of identity shall be abandoned. Identitarian issues turned out to be efficient tools for politicization of a ‘constraining dissensus’ while universalizing terms included in the making of the identities usually tend or intend to obscure the localized origins of any identitarian project. Identities are often conceptually used as rather intentional concepts: they don’t say anything about their sphere but rather defining the sphere makes explicit the aim of their usage. It is not ‘identity of’ but ‘identity to’.

Quick! Someone get them a URL! ;-) Just teasing.

Enjoy the conference!

Author Identifiers (arXiv.org) [> one (1) identifier per subject]

Monday, September 10th, 2012

I happened upon an author who used an arXiv.org author identifier at their webpage.

From the arXiv.org page:

It is a long-term goal of arXiv to accurately identify and disambiguate all authors of all articles in arXiv. Such identification would provide accurate results for queries such as "show me all the other papers by the particular John Smith that wrote this paper", something that can be done only approximately with text-based searches. It would also permit construction of an author-article graph which is useful for relevance
assessment and bibliometric analysis.

Since 2005 arXiv has used authority records that associate user accounts with articles authored by that user. These records support the endorsement system. The use of public author identifiers as a way to build services upon this data is new in 2009. Initially, users must opt-in to have a public author identifier and to expose the record of their articles on arXiv for use in other services. At some later date we hope to be able to improve our authority records to the point where we can create public author identifiers for all authors of arXiv articles without needing to enlist the help of each author to check their record before opting in.

The services we offer based on author identifiers are:

Significant enough in its own right but note the plans for the future:

The following enhancements and interoperability features are planned:

  • arXiv will permit authors to record other identifiers they have in other schemes and include these in the data feeds. This will allow agents and systems to link together the same author in different databases.
  • arXiv will support mechanisms for handling name changes, combination of accidentally created duplicates and separation of accidentally combined identifiers.

Recoding other identifiers? What? Acknowledge that there can be more than one identifier (yours) per subject? Blasphemy! ;-)

Go arXiv.org!

I am sure suggestions, support, contributions, etc., would be most welcome.

‘The Algorithm That Runs the World’ [Optimization, Identity and Polytopes]

Tuesday, August 28th, 2012

“The Algorithm That Runs the World” by Erwin Gianchandani.

From the post:

New Scientist published a great story last week describing the history and evolution of the simplex algorithm — complete with a table capturing “2000 years of algorithms”:

The simplex algorithm directs wares to their destinations the world over [image courtesy PlainPicture/Gozooma via New Scientist].Its services are called upon thousands of times a second to ensure the world’s business runs smoothly — but are its mathematics as dependable as we thought?

YOU MIGHT not have heard of the algorithm that runs the world. Few people have, though it can determine much that goes on in our day-to-day lives: the food we have to eat, our schedule at work, when the train will come to take us there. Somewhere, in some server basement right now, it is probably working on some aspect of your life tomorrow, next week, in a year’s time.

Perhaps ignorance of the algorithm’s workings is bliss. The door to Plato’s Academy in ancient Athens is said to have borne the legend “let no one ignorant of geometry enter”. That was easy enough to say back then, when geometry was firmly grounded in the three dimensions of space our brains were built to cope with. But the algorithm operates in altogether higher planes. Four, five, thousands or even many millions of dimensions: these are the unimaginable spaces the algorithm’s series of mathematical instructions was devised to probe.

Perhaps, though, we should try a little harder to get our heads round it. Because powerful though it undoubtedly is, the algorithm is running into a spot of bother. Its mathematical underpinnings, though not yet structurally unsound, are beginning to crumble at the edges. With so much resting on it, the algorithm may not be quite as dependable as it once seemed [more following the link].

A fund manager might similarly want to arrange a portfolio optimally to balance risk and expected return over a range of stocks; a railway timetabler to decide how best to roster staff and trains; or a factory or hospital manager to work out how to juggle finite machine resources or ward space. Each such problem can be depicted as a geometrical shape whose number of dimensions is the number of variables in the problem, and whose boundaries are delineated by whatever constraints there are (see diagram). In each case, we need to box our way through this polytope towards its optimal point.

This is the job of the algorithm.

Its full name is the simplex algorithm, and it emerged in the late 1940s from the work of the US mathematician George Dantzig, who had spent the second world war investigating ways to increase the logistical efficiency of the U.S. air force. Dantzig was a pioneer in the field of what he called linear programming, which uses the mathematics of multidimensional polytopes to solve optimisation problems. One of the first insights he arrived at was that the optimum value of the “target function” — the thing we want to maximise or minimise, be that profit, travelling time or whatever — is guaranteed to lie at one of the corners of the polytope. This instantly makes things much more tractable: there are infinitely many points within any polytope, but only ever a finite number of corners.

If we have just a few dimensions and constraints to play with, this fact is all we need. We can feel our way along the edges of the polytope, testing the value of the target function at every corner until we find its sweet spot. But things rapidly escalate. Even just a 10-dimensional problem with 50 constraints — perhaps trying to assign a schedule of work to 10 people with different expertise and time constraints — may already land us with several billion corners to try out.

Apologies but I saw this article too late to post within the “free” days allowed by New Scientist.

But, I think from Erwin’s post and long quote from the original article, you can see how the simplex algorithm may be very useful where identity is defined in multidimensional space.

The literature in this area is vast and it may not offer an appropriate test for all questions of subject identity.

For example, the possessor of a credit card is presumed to be the owner of the card. Other assumptions are possible, but fraud costs are recouped from fees paid by customers. Creating a lack of interest in more stringent identity tests.

On the other hand, if your situation requires multidimensional identity measures, this may be a useful approach.


PS: Be aware that naming confusion, the sort that can be managed (not solved) by topic maps abounds even in mathematics:

The elements of a polytope are its vertices, edges, faces, cells and so on. The terminology for these is not entirely consistent across different authors. To give just a few examples: Some authors use face to refer to an (n−1)-dimensional element while others use face to denote a 2-face specifically, and others use j-face or k-face to indicate an element of j or k dimensions. Some sources use edge to refer to a ridge, while H. S. M. Coxeter uses cell to denote an (n−1)-dimensional element. (Polytope)

UCR Insect Classification Contest [Classification by Ear]

Friday, July 6th, 2012

UCR Insect Classification Contest Ends November 16, 2012

As I have said before, subject identity is everywhere! ;-)

From the details PDF file:

Phase I: July to November 16th 2012 (this contest)

  • The task is to produce the best distance (similarity) measure for insect flight sounds.
  • The contest will be scored by 1-nearest neighbor classification.
  • The prizes include $500 cash and engraved trophies.

I was amused to read in the FAQ:

Note that the “sound” is measured with an optical sensor, rather than an acoustic one. This is done for various pragmatic reasons, however we don’t believe it makes any difference to the task at hand. The sampling rate is 16000 Hz

If you have a bee keeper nearby, can you do an empirical comparison of optical versus acoustic sensors for the capturing the “sound” of insects?

That seems like a first step in establishing computational entomology. BTW, use a range of frequencies, from sub to super sonic. (You are aware they have discovered sub-sonic sounds from whales can travel thousands of miles? Unlikely with insects but just because our ears can’t hear something doesn’t mean other ears cannot as well.)

I first saw this at KDNuggets.

Who Do You Say You Are?

Friday, May 11th, 2012

In Data Governance in Context, Jim Ericson outlines several paths of data governance, or as I put it: Who Do You Say You Are?:

On one path, more enterprises are dead serious about creating and using data they can trust and verify. It’s a simple equation. Data that isn’t properly owned and operated can’t be used for regulatory work, won’t be trusted to make significant business decisions and will never have the value organizations keep wanting to ascribe it on the balance sheet. We now know instinctively that with correct and thorough information, we can jump on opportunities, unite our understanding and steer the business better than before.

On a similar path, we embrace tested data in the marketplace (see Experian, D&B, etc.) that is trusted for a use case even if it does not conform to internal standards. Nothing wrong with that either.

And on yet another path (and areas between) it’s exploration and discovery of data that might engage huge general samples of data with imprecise value.

It’s clear that we cannot and won’t have the same governance standards for all the different data now facing an enterprise.

For starters, crowd sourced and third party data bring a new dimension, because “fitness for purpose” is by definition a relative term. You don’t need or want the same standard for how many thousands or millions of visitors used a website feature or clicked on a bundle in the way you maintain your customer or financial info.

Do mortgage-backed securities fall into the “…huge general samples of data with imprecise value?” I ask because I don’t work in the financial industry. Or do they not practice data governance, except to generate numbers for the auditors?

I mention this because I suspect that subject identity governance would be equally useful for topic map authoring.

For some topic maps, say on drug trials, need to have a high degree of reliability and auditability. As well as precise identification (even if double-blind) of the test subjects.

Or there may be different tests for subject identity, some of which appear to be less precise than others.

For example, merging all the topics entered by a particular operator in a day to look for patterns that may indicate they are not following data entry protocols. (It is hard to be as random as real data.)

As with most issues, there isn’t any hard and fast rule that works for all cases. You do need to document the rules you are following and for how long. It will help you test old rules and to formulate new ones. (“Document” meaning to write down. The vagaries of memory are insufficient.)

20 More Reasons You Need Topic Maps

Thursday, May 3rd, 2012

Well, Ed Lindsey did call his column 20 Commom Data Errors and Variation but when you see the PNG of the 20 errors, here, you will agree my title works better (for topic maps anyway).

Not only that, but Ed’s opening paragraphs work for identifying a subject by more than one attribute (although this is “subject” in the police sense of the word):

A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag.

So what does this sergeant have to do? Start looking at the women of Chicago. He only needs the women. Actually, he would start with women with blond hair (but judging from my daughter’s constant change of hair color he might skip that attribute). So he might start with women in a certain height range and in a certain weight group. He would bring those women in to the station for questioning.

As it turns out, when they finally arrested the woman at her son’s soccer game, she had brown hair, was 5’5″ tall and weighed 120 pounds. She was wearing an Oklahoma University sweatshirt, jeans and sneakers. When the original witness saw her she said yes that’s the same woman. Iit turns out she was wearing four inch heels and the pantsuit made her look bigger.

So what can we learn from this episode that has to do with matching? Well the first thing we need to understand is that each of the attributes of the witness can be used in matching the suspect and then immediately we must also recognize that not all the attributes that the witness gave the sergeant were extremely accurate. So later on when we start talking about matching, will use the term fuzzy matching. This means that when you look at an address, there could be a number of different types of errors in the address from one system that are not identical to an address in another system. Figure 1 shows a number of the common errors that can happen.

So, there you have it: 20 more reasons to use topic maps, a lesson on identifying a subject and proof that yes, a pinstripped pantsuit can make you look bigger.

Past, Present and Future – The Quest to be Understood

Friday, April 20th, 2012

Without restricting it to being machine readable, I think we would all agree there are three ages of data:

  1. Past data
  2. Present data
  3. Future data

And we have common goals for data (or parts of it):

  1. Past data – To understand past data.
  2. Present data – To be understood by others.
  3. Future data – For our present data to persist and be understood by then users.

Common to those ages and goals is the need for management of identifiers for our data. (Where identifiers may be data as well.)

I say “management of identifiers” because we cannot control identifiers used in the past, identifiers used by others in the present, or identifiers that may be used in the future.

You would think in an obviously multi-lingual world that multiple identifier identification would be the default position.

Just a personal observation but hardly a day passes without someone or some group saying the equivalent of:

I know! I will create a list of identifiers that everyone must use! That’s the answer to the confusion (Babel) of identifiers.

Such efforts are always defeated by past identifiers, other identifiers in the present and future identifiers.

Managing tides of identifiers is a partial solution but more workable than trying to stop the tide.

What do you think?

Then BI and Data Science Thinking Are Flawed, Too

Tuesday, March 13th, 2012

Then BI and Data Science Thinking Are Flawed, Too

Steve Miller writes:

I just finished an informative read entitled “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us,” by social scientist Duncan Watts.

Regular readers of Open Thoughts on Analytics won’t be surprised I found a book with a title like this noteworthy. I’ve written quite a bit over the years on challenges we face trying to be the rational, objective, non-biased actors and decision-makers we think we are.

So why is a book outlining the weaknesses of day-to-day, common sense thinking important for business intelligence and data science? Because both BI and DS are driven from a science of business framework that formulates and tests hypotheses on the causes and effects of business operations. If the thinking that produces that testable understanding is flawed, then so will be the resulting BI and DS.

According to Watts, common sense is “exquisitely adapted to handling the kind of complexity that arises in everyday situations … But ‘situations’ involving corporations, cultures, markets, nation-states, and global institutions exhibit a very different kind of complexity from everyday situations. And under these circumstances, common sense turns out to suffer from a number of errors that systematically mislead us. Yet because of the way we learn from experience … the failings of commonsense reasoning are rarely apparent to us … The paradox of common sense, therefore, is that even as it helps us make sense of the world, it can actively undermine our ability to understand it.”

The author argues that common sense explanations to complex behavior fail in three ways. The first error is that the mental model of individual behavior is systematically flawed. The second centers on explanations for collective behavior that are even worse, often missing the “emergence” – one plus one equals three – of social behavior. And finally, “we learn less from history than we think we do, and that misperception skews our perception of the future.”

Reminds me of Thinking, Fast and Slow by Daniel Kahneman.

Not that two books with a similar “take” proves anything but you should put them on your reading list.

I wonder when/where our perceptions of CS practices have been skewed?

Or where that has played a role in our decision making about information systems?

Identification – A Step Towards Semantics

Monday, February 6th, 2012

Entity resolution, also known as name resolution, refers to resolving a reference. The usual language continues with something to the effect “…real world object, person, etc.” I omit the usual language because a reference can be to anything.

Do you disagree? Just curious.

I have more comments on “resolution” but omit them here to reach the necessity for identification.

I say the “necessity for identification” as a prerequisite before semantics can be assigned to any identifier. Of whatever nature. Could be text, photograph, digital image, URI, etc.

I say that because identification (or recognition, a closely related task) may have different requirements (processing and otherwise) than the assignment of semantics, or the assignment of other properties.

For example, with legacy text, written on read-only media, if my only concern at the first step in the process is identification, I can create a list of terms that represent the terms I wish to have recognized. (Think of the TLG for example.) The semantics that I or anyone else wishes to associate with those terms becomes an entirely separate matter. Whatever means or system of semantics is used.

That only becomes possible if the notion of assigning semantics is separated from the task of identification. And separated from the organization of what has been identified and assigned semantics into a system (think ontology).

(You might want to read what was possible twenty-two (22) years ago with transient nodes and edges before responding to this post. I would extend that type of mechanisms to recognition, assignments of semantics and other properties. Having a fixed identification, assignment of semantics and other properties being only one choice among many.)

Multiple Recognitions: Reconsidered

Wednesday, February 1st, 2012

Yesterday I closed with these lines:

Requirement: A system of identification must support the same identifiers resolving to different identifications.

The consequences of deciding otherwise on such a requirement, I will try to take up tomorrow. (Multiple Recognitions)

Rereading that for today’s post, I don’t agree with myself.

The requirement isn’t a requirement at all but an observation that the same identifier may have multiple resolutions.

Better to say that the designer of systems of identification should be aware of that observation. To avoid situations like I posed yesterday with “I will call you a cab” example.

A fortuitous mistake because it leads to the next issue that I wanted to address: Do identifiers have contexts in which they have only a single resolution?

Yesterday’s mistake has made me more wary of sweeping pronouncements so I am posing the context issue as a question. ;-)

Can you think of any counter-examples?

The easiest place to look would be in comedy, where mistaken identity (such as in Shakespeare), double meanings, etc., are bread and butter of the art. Two or more people hear or see the same identifier and reach different resolutions.

In those cases, if we had a rule that identifiers could only have a single resolution, we would have to simply skip over those cases. That seems like an inelegant solution.

Or would you shrink the context down to the individuals who had the different resolutions of an identifier?

Perhaps, perhaps but then what is your solution when later in the play one or more individuals discover their mistake and now hold a common resolution but still remember the one that was in error? Or perhaps more than one that was in error? How do we describe the context(s) there?

There is a long history of such situations in comedy. You may be tempted to say that recreational literature can be excluded. That “fictional” work isn’t the first place we want semantic technologies to work.

Perhaps but remember that comedy and “fiction” have their origin in our day to day affairs. The misunderstandings they parody are our misunderstandings.

The saying: “what did X know and when did they know it?” takes on new meaning when we take about the interpretation of identifiers. Perhaps “freedom fighter” is a more sympathetic term until you “know” those forces are operating death squads. And may have different legal consequences.

How do you think boundaries for contexts should be set/designated? Seems like that would be an important issue to take up.

Thinking, Fast and Slow

Tuesday, December 27th, 2011

Thinking, Fast and Slow by Daniel Kahneman, Farrar, Straus and Giroux, New York, 2011.

I got a copy of “Thinking, Fast and Slow” for Christmas and it has already proven to be an enjoyable read.

Kahneman says early on (page 28):

The premise of this book is that it is easier to recognize other people’s mistakes than our own.

I thought about that line when I read a note from a friend that topic maps needed more than my:

tagging everything with “Topic Maps….”

Which means I haven’t been clear about the reasons for the breath of materials I have and will be covering in this blog.

One premise of this blog is that the use and recognition of identifiers is essential for communication.

Another premise of this blog is that it is easier for us to study the use and recognition of identifiers by others, much for the same reasons we can recognize the mistakes of others more easily.

The use and recognition of identifiers by others aren’t mistakes but they may be different from those we would make. In cases where they differ from ours, we have a unique opportunity to study the choices made and the impacts of those choices. And we may learn patterns in those choices that we can eventually see in our own choices.

Understanding the use and recognition of identifiers in a particular circumstance and the requirements for the use and recognition of identifiers, is the first step towards deciding whether topic maps would be useful in some circumstance and in what way?

For example, processing social security records in the United States, anything other than “bare” identifiers like a social security number may be unnecessary and add load with no corresponding benefit. Aligning social security records with bank records, might need to reconsider the judgement to use only social security numbers. (Some information sharing is “against the law.” But as the Sheriff in “Oh Brother where art thou?” says: “The law is a man made thing.” Laws change, or you can commission absurdist interpretations of it.)

Topic maps aren’t everywhere but identifiers and recognition of identifiers are.

Understanding identifiers and their recognition will help you choose the most appropriate solution to a problem

Ontology Matching 2011

Tuesday, December 13th, 2011

Ontology Matching 2011

Proceedings of the 6th International Workshop on Ontology Matching (OM-2011)

From the conference website:

Ontology matching is a key interoperability enabler for the Semantic Web, as well as a useful tactic in some classical data integration tasks dealing with the semantic heterogeneity problem. It takes the ontologies as input and determines as output an alignment, that is, a set of correspondences between the semantically related entities of those ontologies. These correspondences can be used for various tasks, such as ontology merging, data translation, query answering or navigation on the web of data. Thus, matching ontologies enables the knowledge and data expressed in the matched ontologies to interoperate.


The workshop has three goals:

  • To bring together leaders from academia, industry and user institutions to assess how academic advances are addressing real-world requirements. The workshop will strive to improve academic awareness of industrial and final user needs, and therefore direct research towards those needs. Simultaneously, the workshop will serve to inform industry and user representatives about existing research efforts that may meet their requirements. The workshop will also investigate how the ontology matching technology is going to evolve.
  • To conduct an extensive and rigorous evaluation of ontology matching approaches through the OAEI (Ontology Alignment Evaluation Initiative) 2011 campaign. The particular focus of this year’s OAEI campaign is on real-world specific matching tasks involving, e.g., open linked data and biomedical ontologies. Therefore, the ontology matching evaluation initiative itself will provide a solid ground for discussion of how well the current approaches are meeting business needs.
  • To examine similarities and differences from database schema matching, which has received decades of attention but is just beginning to transition to mainstream tools.

An excellent set of papers and posters.

While I was writing this post, I realized that had the papers been described as matching subject identifications by similarity measures, I would have felt completely different about the papers.

Isn’t that odd?

Question: Do you agree/disagree that mapping ontologies is different from mapping subject identifications? Why/why not?

OCLC Developer Network

Monday, October 24th, 2011

OCLC Developer Network

From the webpage:

The OCLC Developer Network is a community of developers collaborating to propose, discuss and test OCLC Web Services. This open source, code-sharing infrastructure improves the value of OCLC data for all users by encouraging new OCLC Web Service uses.

Thought while I was looking at OCLC resources I might as well give a shout out to the OCLC Developer Network. A community that has an interest in identifiers and identification for the purpose of furthering access to information. Who could be more sympathetic to topic maps?

WorldCat Identities Network

Monday, October 24th, 2011

WorldCat Identities Network

A project of OCLC Research, the WorldCat Identities Network is described as:

The WorldCat Identity Network uses the WorldCat Identities Web Service and the WorldCat Search API to create an interactive Related Identity Network Map for each Identity in the WorldCat Identities database. The Identity Maps can be used to explore the interconnectivity between WorldCat Identities.

A WorldCat Identity can be a person, a thing (e.g., the Titanic), a fictitious character (e.g., Harry Potter), or a corporation (e.g., IBM).

I can’t claim to be a fan of jumpy network node displays but that isn’t a criticism, more a matter of personal taste. Some people find that sort of display quite useful.

The information conveyed, leaving display to one side, is quite interesting. It has just enough fuzziness (to me at any rate) to approach the experience of serendipitous discovery using more traditional library tools. I suspect that will vary from topic to topic but that was my experience with briefly using the interface.

Despite my misgivings about the interface, I will be returning to explore this service fairly often.

BTW, the service is obviously mis-named. What is being delivered is what we used to call “see also” or related references, thus: WorldCat “See Also” Network would be a more accurate title.

For class:

  1. Spend at least an hour or more with the service and write a 2 page summary of what you liked/disliked about it. (no citations)
  2. What subject/relationship did you choose to follow? Discover anything you did not expect? 1 page (no citations)

Summing up Properties with subjectIdentifiers/URLs?

Thursday, September 8th, 2011

I was picking tomatoes in the garden when I thought about telling Carol (my wife) the plants are about to stop producing.

Those plants are at a particular address, in the backyard, middle garden bed of three, are of three different varieties, but I am going to sum up those properties by saying: “The tomatoes are about to stop producing.”

It occurred to me that a subjectIdentifier could be assigned to a topic element on the basis of summing up properties of the topic.* That would have the advantage of enabling merging on the basis of subjectIdentifiers as opposed to more complex tests upon properties of a topic.

Disclosure of the basis for assignment of a subjectIdentifier is an interesting question.

It could be that a service wishes to produce subjectIdentifiers and index information based upon complex property measures, producing for consumption, the subjectIdentifiers and merge-capable indexes on one or more information sets. The basis for merging being the competitive edge offered by the service.

If promoting merging with a vendor’s process or format, which is seeking to become the TCP/IP of some area, the basis for merging and tools to assist with it will be supplied.

Or if you are an intelligence agency and you want an inward and outward facing interface that promotes merging of information but does not disclose your internal basis for identification, variants of this technique may be of interest.

*The notion of summing up imposes no prior constraints on the tests used or the location of the information subjected to those tests.

When Should Identifications Be Immutable?

Thursday, September 8th, 2011

After watching a presentation on Clojure and its immutable data structures, I began to wonder when should identifications be immutable?

Note that I said when should identifications… which means I am not advocating a universal position for all identifiers but rather a choice that may vary from situation to situation.

We may change our minds about an identification, the fact remains that at some point (dare I say state?) a particular identification was made.

For example, you make a intimate gesture at a party only to discover your spouse wasn’t the recipient of the gesture. But at the time you made the gesture, at least I am willing to believe, you thought it was your spouse. New facts are now apparent. But it is also a new identification. As your spouse will remind you, you did make a prior, incorrect identification.

As I recall, topics (and other information items) are immutable for purposes of merging. (TMDM, 6.2 and following.) That is merging results in a new topic or other new information item. On the other hand, merging also results in updating information items other than the one subject to merging. So those information items are not being treated as immutable.

But since the references are being updates, I don’t think it would be inconsistent with the TMDM to create new information items to be the carriers of the new identifiers and thus treating the information items as immutable.

Would be application/requirement specific but say for accounting/banking/securities and similar applications, it may be important for identifications to be immutable. Such that we can “unroll” a topic map as it were to any prior arbitrary identification or state.