Archive for the ‘Subject Identifiers’ Category

Where Do We Write Down Subject Identifications?

Wednesday, December 27th, 2017

Modern Data Integration Paradigms by Matthew D. Sarrel, The Bloor Group.


Businesses of all sizes and industries are rapidly transforming to make smarter, data-driven decisions. To accomplish this transformation to digital business , organizations are capturing, storing, and analyzing massive amounts of structured, semi-structured, and unstructured data from a large variety of sources. The rapid explosion in data types and data volume has left many IT and data science/business analyst leaders reeling.

Digital transformation requires a radical shift in how a business marries technology and processes. This isn’t merely improving existing processes, but
rather redesigning them from the ground up and tightly integrating technology. The end result can be a powerful combination of greater efficiency, insight and scale that may even lead to disrupting existing markets. The shift towards reliance on data-driven decisions requires coupling digital information with powerful analytics and business intelligence tools in order to yield well-informed reasoning and business decisions. The greatest value of this data can be realized when it is analyzed rapidly to provide timely business insights. Any process can only be as timely as the underlying technology allows it to be.

Even data produced on a daily basis can exceed the capacity and capabilities of many pre-existing database management systems. This data can be structured or unstructured, static or streaming, and can undergo rapid, often unanticipated, change. It may require real-time or near-real-time transformation to be read into business intelligence (BI) systems. For these reasons, data integration platforms must be flexible and extensible to accommodate business’s types and usage patterns of the data.

There’s the usual homage to the benefits of data integration:

IT leaders should therefore try to integrate data across systems in a way that exposes them using standard and commonly implemented technologies such as SQL and REST. Integrating data, exposing it to applications, analytics and reporting improves productivity, simplifies maintenance, and decreases the amount of time and effort required to make data-driven decisions.

The paper covers, lightly, Operational Data Store (ODS) / Enterprise Data Hub (EDH), Enterprise Data Warehouse (EDW), Logical Data Warehouse (LDW), and Data Lake as data integration options.

Having found existing systems deficient in one or more ways, the report goes on to recommend replacement with Voracity.

To be fair, as described, all four systems plus Voracity are all deficient in the same way. The hard part of data integration, the rub that lies at the heart of the task, is passed over as ETL.

Efficient and correct ETL performance requires knowledge of what column headers, for instance, identify. For instance, from the Enron spreadsheets, can you specify the transformation of the data in the following columns? “A, B, C, D, E, F…” from andrea_ring_15_IFERCnov.xlsx, or “A, B, C, D, E,…” from andy_zipper__129__Success-TradeLog.xlsx?

With enough effort, no doubt you could go through speadsheets of interest and create a mapping sufficient to transform data of interest, but where are you going to write down the facts you established for each column that underlie your transformation?

In topic maps, we may the mistake of mystifying the facts for each column by claiming to talk about subject identity, which has heavy ontological overtones.

What we should have said was we wanted to talk about where do we write down subject identifications?


  1. What do you want to talk about?
  2. Data in column F in andrea_ring_15_IFERCnov.xlsx
  3. Do you want to talk about each entry separately?
  4. What subject is each entry? (date written month/day (no year))
  5. What calendar system was used for the date?
  6. Who created that date entry? (If want to talk about them as well, create a separate topic and an association to the spreadsheet.)
  7. The date is the date of … ?
  8. Conversion rules for dates in column F, such as supplying year.
  9. Merging rules for #2? (date comparison)
  10. Do you want relationship between #2 and the other data in each row? (more associations)

With simple questions, we have documented column F of a particular spreadsheet for any present or future ETL operation. No magic, no logical conundrums, no special query language, just asking what an author or ETL specialist knew but didn’t write down.

There are subtlties such as distinguishing between subject identifiers (identifies a subject, like a wiki page) and subject locators (points to the subject we want to talk about, like a particular spreadsheet) but identifying what you want to talk about (subject identifications and where to write them down) is more familiar than our prior obscurities.

Once those identifications are written down, you can search those identifications to discover the same subjects identified differently or with properties in one identification and not another. Think of it as capturing the human knowledge that resides in the brains of your staff and ETL experts.

The ETL assumed by Bloor Group should be written: ETLD – Extract, Transform, Load, Dump (knowledge). That seems remarkably inefficient and costly to me. You?

Shape Searching Dictionaries?

Thursday, November 16th, 2017

Facebook, despite its spying, censorship, and being a shill for the U.S. government, isn’t entirely without value.

For example, this post by Simon St. Laurent:

Drew this response from Peter Cooper:

Which if you follow the link: Shapecatcher: Unicode Character Recognition you find:

Draw something in the box!

And let shapecatcher help you to find the most similar unicode characters!

Currently, there are 11817 unicode character glyphs in the database. Japanese, Korean and Chinese characters are currently not supported.
(emphasis in original)

I take “Japanese, Korean and Chinese characters are currently not supported.” means Anatolian Hieroglyphs; Cuneiform, Cuneiform Numbers and Punctuation, Early Dynastic Cuneiform, Old Persian, Ugaritic; Egyptian Hieroglyphs; Meroitic Cursive, and Meroitic Hieroglphs are not supported as well.

But my first thought wasn’t discovery of glyphs in Unicode Code Charts, although useful, but shape searching dictionaries, such as Faulkner’s A Concise Dictionary of Middle Egyptian.

A sample from Faulkner’s (1991 edition):

Or, The Student’s English-Sanskrit Dictionary by Vaman Shivram Apte (1893):

Imagine being able to search by shape for either dictionary! Not just as a gylph but as a set of glyphs, within any entry!

I suspect that’s doable based on Benjamin Milde‘s explanation of Shapecatcher:

Under the hood, Shapecatcher uses so called “shape contexts” to find similarities between two shapes. Shape contexts, a robust mathematical way of describing the concept of similarity between shapes, is a feature descriptor first proposed by Serge Belongie and Jitendra Malik.

You can find an indepth explanation of the shape context matching framework that I used in my Bachelor thesis (“On the Security of reCATPCHA”). In the end, it is quite a bit different from the matching framework that Belongie and Malik proposed in 2000, but still based on the idea of shape contexts.

The engine that runs this site is a rewrite of what I developed during my bachelor thesis. To make things faster, I used CUDA to accelereate some portions of the framework. This is a fairly new technology that enables me to use my NVIDIA graphics card for general purpose computing. Newer cards are quite powerful devices!

That was written in 2011 and no doubt shape matching has progressed since then.

No technique will be 100% but even less than 100% accuracy will unlock generations of scholarly dictionaries, in ways not imagined by their creators.

If you are interested, I’m sure Benjamin Milde would love to hear from you.

Dimensions of Subject Identification

Thursday, July 27th, 2017

This isn’t a new idea, but it occurred to me that introducing readers to “dimensions of subject identification” might be an easier on ramp for topic maps. It enables us to dodge the sticky issues of “identity,” in favor of asking what do you want to talk about? and how many do you want/need to identify it?

To start with a classic example, if we only have one dimension and the string “Paris,” ambiguity is destined to follow.

If we add a country dimension, now having two dimensions, “Paris” + “France” can be distinguished from all other uses of “Paris” with the string + country dimension.

The string + country dimension fares less well for “Paris” + country = “United States:”

For the United States you need “Paris” + country + state dimensions, at a minimum, but that leaves you with two instances of Paris in Ohio.

One advantage of speaking of “dimensions of subject identification” is that we can order systems of subject identification by the number of dimensions they offer. Not to mention examining the consequences of the choices of dimensions.

One dimensional systems, that is a solitary string, "Paris," as we said above, leave users with no means to distinguish one use from another. They are useful and common in CSV files or database tables, but risk ambiguity and being difficult to communicate accurately to others.

Two dimensional systems, that is city = "Paris," enables users to distinguish usages other than for city, but as you can see from the Paris example in the U.S., that may not be sufficient.

Moreover, city itself may be a subject identified by multiple dimensions, as different governmental bodies define “city” differently.

Just as some information systems only use one dimensional strings for headers, other information systems may use one dimensional strings for the subject city in city = "Paris." But all systems can capture multiple dimensions of identification for any subjects, separate from those systems.

Perhaps the most useful aspect of dimensions of identification is enabling user to ask their information architects what dimensions and their values serve to identify subjects in information systems.

Such as the headers in database tables or spreadsheets. 😉

Digital Data Repositories in Chemistry…

Wednesday, July 1st, 2015

Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks by Matthew J. Harvey, Nicholas J. Mason, Henry S. Rzepa.


We discuss the concept of recasting the data-rich scientific journal article into two components, a narrative and separate data components, each of which is assigned a persistent digital object identifier. Doing so allows each of these components to exist in an environment optimized for purpose. We make use of a poorly-known feature of the handle system for assigning persistent identifiers that allows an individual data file from a larger file set to be retrieved according to its file name or its MIME type. The data objects allow facile visualization and retrieval for reuse of the data and facilitates other operations such as data mining. Examples from five recently published articles illustrate these concepts.

A very promising effort to integrate published content and electronic notebooks in chemistry. Encouraging that in addition to the technical and identity issues the authors also point out the lack of incentives for the extra work required to achieve useful integration.

Everyone agrees that deeper integration of resources in the sciences will be a game-changer but renewing the realization that there is no such thing as a free lunch, is an important step towards that goal.

This article easily repays a close read with interesting subject identity issues and the potential that topic maps would offer to such an effort.

Fifty Words for Databases

Saturday, March 7th, 2015

Fifty Words for Databases by Phil Factor

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

Data Modelling: The Thin Model [Entities with only identifiers]

Monday, October 27th, 2014

Data Modelling: The Thin Model by Mark Needham.

From the post:

About a third of the way through Mastering Data Modeling the authors describe common data modelling mistakes and one in particular resonated with me – ‘Thin LDS, Lost Users‘.

LDS stands for ‘Logical Data Structure’ which is a diagram depicting what kinds of data some person or group wants to remember. In other words, a tool to help derive the conceptual model for our domain.

They describe the problem that a thin model can cause as follows:

[…] within 30 minutes [of the modelling session] the users were lost…we determined that the model was too thin. That is, many entities had just identifying descriptors.

While this is syntactically okay, when we revisited those entities asking, What else is memorable here? the users had lots to say.

When there was flesh on the bones, the uncertainty abated and the session took a positive course.

I found myself making the same mistake a couple of weeks ago during a graph modelling session. I tend to spend the majority of the time focused on the relationships between the bits of data and treat the meta data or attributes almost as an after thought.

A good example of why subjects need multiple attributes, even multiple identifying attributes.

When sketching just a bare data model, the author, having prepared in advance is conversant with the scant identifiers. The audience, on the other hand is not. Additional attributes for each entity quickly reminds the audience of the entity in question.

Take this as anecdotal evidence that multiple attributes assist users in recognition of entities (aka subjects).

Will that impact how you identify subjects for your users?

Search Gets Smarter with Identifiers

Wednesday, March 19th, 2014

Search Gets Smarter with Identifiers

From the post:

The future of computing is based on Big Data. The vast collections of information available on the web and in the cloud could help prevent the next financial crisis, or even tell you exactly when your bus is due. The key lies in giving everything – whether it’s a person, business or product – a unique identifier.

Imagine if everything you owned or used had a unique code that you could scan, and that would bring you a wealth of information. Creating a database of billions of unique identifiers could revolutionise the way we think about objects. For example, if every product that you buy can be traced through every step in the supply chain you can check whether your food has really come from an organic farm or whether your car is subject to an emergency recall.


The difficulty with using big data is that the person or business named in one database might have a completely different name somewhere else. For example, news reports talk about Barack Obama, The US President, and The White House interchangeably. For a human being, it’s easy to know that these names all refer to the same person, but computers don’t know how to make these connections. To address the problem, Okkam has created a Global Open Naming System: essentially an index of unique entities like people, organisations and products, that lets people share data.

“We provide a very fast and effective way of discovering data about the same entities across a variety of sources. We do it very quickly,” says Paolo Bouquet. “And we do it in a way that it is incremental so you never waste the work you’ve done. Okkam’s entity naming system allows you to share the same identifiers across different projects, different companies, different data sets. You can always build on top of what you have done in the past.”

The benefits of a unique name for everything

The community website: reports 8.5+ million entities.

When the EU/CORDIS show up late for a party, it’s really late.

A multi-lingual organization like the EU, kudos on their efforts in that direction, should know uniformity of language or identifiers is only found in dystopian fiction.

I prefer the language and cultural richness of Europe over the sterile uniformity of American fast food chains. Same issue.


I first saw this in a tweet by Stefano Bertolo.

Finding Subject Identifiers

Monday, April 1st, 2013

A recent comment made it clear that tooling, or the lack thereof, is a real issue for topic maps.

Here is my first suggestion of a tool you can use while authoring a topic map:


Seriously, think about it. You want a URL that identifies subject X.

Granting that Wikipedia is a fairly limited set of subjects, it is at least a starting point.

Example: I want a subject identifier for “Donald Duck,” a cartoon character.

I can use the search box at Wikipedia or I can type in a browser:

Go ahead, try it.

If I don’t know the full name:

What do you think?

Allows you to disambiguate Donalds, at least the ones that Wikipedia knows about.

Not to mention giving you access to other subjects and relationships that may be of interest for your topic map.

To include foreign language materials (outside of English only non-thinking zones in the U.S.), try a different language Wikipedia:

Finding subject identifiers won’t write your topic map for you but can make the job easier.

There are other sources of subject identifiers so send in your suggestions and any syntax short-cuts for accessing them.

You have no doubt read that URIs used as identifiers are supposed to be semi-permanent, “cool,” etc.

But identifiers change over time. It’s one of the reasons for historical semantic diversity.

URIs as identifiers will change as well.

Good thing topic maps enable you to have multiple identifiers for any subject.

Means old references to old identifiers still work.

Glad we dodged having to redo and reproof all those old connections.

Aren’t you?

Visual Clues: A Brain “feature,” not a “bug”

Saturday, September 29th, 2012

You will read in When Your Eyes Tell Your Hands What to Think: You’re Far Less in Control of Your Brain Than You Think that:

You’ve probably never given much thought to the fact that picking up your cup of morning coffee presents your brain with a set of complex decisions. You need to decide how to aim your hand, grasp the handle and raise the cup to your mouth, all without spilling the contents on your lap.

A new Northwestern University study shows that, not only does your brain handle such complex decisions for you, it also hides information from you about how those decisions are made.

“Our study gives a salient example,” said Yangqing ‘Lucie’ Xu, lead author of the study and a doctoral candidate in psychology at Northwestern. “When you pick up an object, your brain automatically decides how to control your muscles based on what your eyes provide about the object’s shape. When you pick up a mug by the handle with your right hand, you need to add a clockwise twist to your grip to compensate for the extra weight that you see on the left side of the mug.

“We showed that the use of this visual information is so powerful and automatic that we cannot turn it off. When people see an object weighted in one direction, they actually can’t help but ‘feel’ the weight in that direction, even when they know that we’re tricking them,” Xu said. (emphasis added)

I never quite trusted my brain and now I have proof that it is untrustworthy. Hiding stuff indeed! 😉

But that’s the trick of subject identification/identity isn’t it?

That our brains “recognize” all manner of subjects without any effort on our part.

Another part of the effortless features of our brains. But it hides the information we need to integrate information stores from ourselves and others.

Or rather, making it more work than we are usually willing to devote to digging it out.

When called upon to be “explicit” about subject identification, or even worse, to imagine how other people identify subjects, we prefer to stay at home consuming passive entertainment.

Two quick points:

First, need to think about how to incorporate this “feature” into delivery interfaces for users.

Second, what subjects would users pay others to mine/collate/identify for them? (Delivery being a separate issue.)

Author Identifiers ( [> one (1) identifier per subject]

Monday, September 10th, 2012

I happened upon an author who used an author identifier at their webpage.

From the page:

It is a long-term goal of arXiv to accurately identify and disambiguate all authors of all articles in arXiv. Such identification would provide accurate results for queries such as "show me all the other papers by the particular John Smith that wrote this paper", something that can be done only approximately with text-based searches. It would also permit construction of an author-article graph which is useful for relevance
assessment and bibliometric analysis.

Since 2005 arXiv has used authority records that associate user accounts with articles authored by that user. These records support the endorsement system. The use of public author identifiers as a way to build services upon this data is new in 2009. Initially, users must opt-in to have a public author identifier and to expose the record of their articles on arXiv for use in other services. At some later date we hope to be able to improve our authority records to the point where we can create public author identifiers for all authors of arXiv articles without needing to enlist the help of each author to check their record before opting in.

The services we offer based on author identifiers are:

Significant enough in its own right but note the plans for the future:

The following enhancements and interoperability features are planned:

  • arXiv will permit authors to record other identifiers they have in other schemes and include these in the data feeds. This will allow agents and systems to link together the same author in different databases.
  • arXiv will support mechanisms for handling name changes, combination of accidentally created duplicates and separation of accidentally combined identifiers.

Recoding other identifiers? What? Acknowledge that there can be more than one identifier (yours) per subject? Blasphemy! 😉


I am sure suggestions, support, contributions, etc., would be most welcome.

Data and Reality

Thursday, March 15th, 2012

Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.

I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”

His abstract there read:

The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.

I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.

The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.

I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.

More to follow.

Then BI and Data Science Thinking Are Flawed, Too

Tuesday, March 13th, 2012

Then BI and Data Science Thinking Are Flawed, Too

Steve Miller writes:

I just finished an informative read entitled “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us,” by social scientist Duncan Watts.

Regular readers of Open Thoughts on Analytics won’t be surprised I found a book with a title like this noteworthy. I’ve written quite a bit over the years on challenges we face trying to be the rational, objective, non-biased actors and decision-makers we think we are.

So why is a book outlining the weaknesses of day-to-day, common sense thinking important for business intelligence and data science? Because both BI and DS are driven from a science of business framework that formulates and tests hypotheses on the causes and effects of business operations. If the thinking that produces that testable understanding is flawed, then so will be the resulting BI and DS.

According to Watts, common sense is “exquisitely adapted to handling the kind of complexity that arises in everyday situations … But ‘situations’ involving corporations, cultures, markets, nation-states, and global institutions exhibit a very different kind of complexity from everyday situations. And under these circumstances, common sense turns out to suffer from a number of errors that systematically mislead us. Yet because of the way we learn from experience … the failings of commonsense reasoning are rarely apparent to us … The paradox of common sense, therefore, is that even as it helps us make sense of the world, it can actively undermine our ability to understand it.”

The author argues that common sense explanations to complex behavior fail in three ways. The first error is that the mental model of individual behavior is systematically flawed. The second centers on explanations for collective behavior that are even worse, often missing the “emergence” – one plus one equals three – of social behavior. And finally, “we learn less from history than we think we do, and that misperception skews our perception of the future.”

Reminds me of Thinking, Fast and Slow by Daniel Kahneman.

Not that two books with a similar “take” proves anything but you should put them on your reading list.

I wonder when/where our perceptions of CS practices have been skewed?

Or where that has played a role in our decision making about information systems?

Identity – The Philosophical Challenge For the Web

Sunday, February 19th, 2012

Identity – The Philosophical Challenge For the Web by Matthew Hurst.

From the post:

I work in local search at Microsoft which means, like all those working in this space, I have to deal with an identity crisis on a daily basis. Currently, most local search products – like Bing’s and Google’s – leverage multiple data sets to derive a digital model of the world that users can then interact with. In creating this digital model, multiple statements have to be conflated to form a unified representation. This can be extremely challenging for two reasons. Firstly, the system has to decided when two records are intended to denote the same real world entity. Secondly, the designers of the system have to determine what real world entities are and how to describe them.

For example, if a business moves is that the same business or the closure of one and the opening of another? What does it mean to categorize a business? The cafe in Barnes and Noble is branded Starbucks but isn’t actually part of the Starbucks chain – should is surface as a separate entity or is it ‘hidden’ within the bookshop as an attribute (‘has cafe’)?

Thinking through these hard representational problems is as much part of the transformative trends going on in the tech industry as are those characterized by terms like ‘big data’ and ‘data scientist’.

Questions of identity and how to resolve different multiple references to the same entity have been debated at least since the time of Greek philosophers. Identity (Wikipedia page, see references on the various pages.)

This “philosophical challenge” has been going on for a very long time and so far I haven’t seen any demonstrations that the Web raises new questions.

You need to read Matthew’s identity example in his post.

The songs in question could be said to be instances of the same subject and a reference to that subject would be satisfied with any of those instances. From another point of view, the origin of the instances could be said to distinguish them into different subjects, say for proof of licensing purposes. Other view points are possible. Depends upon the purpose of your criteria of identification.

SACO: Subject Authority Cooperative Program of the PPC

Friday, February 10th, 2012

SACO: Subject Authority Cooperative Program of the PPC

SACO was established to allow libraries to contribute proposed subject headings to the Library of Congress.

Of particular interest is: Web Resources for SACO Proposals by Adam L. Schiff.

It is a very rich source of reference materials that you may find useful in developing subject heading proposals or subject classifications for other uses (such as topic maps).

But don’t neglect the materials you find on the SACO homepage.

Thinking, Fast and Slow

Tuesday, December 27th, 2011

Thinking, Fast and Slow by Daniel Kahneman, Farrar, Straus and Giroux, New York, 2011.

I got a copy of “Thinking, Fast and Slow” for Christmas and it has already proven to be an enjoyable read.

Kahneman says early on (page 28):

The premise of this book is that it is easier to recognize other people’s mistakes than our own.

I thought about that line when I read a note from a friend that topic maps needed more than my:

tagging everything with “Topic Maps….”

Which means I haven’t been clear about the reasons for the breath of materials I have and will be covering in this blog.

One premise of this blog is that the use and recognition of identifiers is essential for communication.

Another premise of this blog is that it is easier for us to study the use and recognition of identifiers by others, much for the same reasons we can recognize the mistakes of others more easily.

The use and recognition of identifiers by others aren’t mistakes but they may be different from those we would make. In cases where they differ from ours, we have a unique opportunity to study the choices made and the impacts of those choices. And we may learn patterns in those choices that we can eventually see in our own choices.

Understanding the use and recognition of identifiers in a particular circumstance and the requirements for the use and recognition of identifiers, is the first step towards deciding whether topic maps would be useful in some circumstance and in what way?

For example, processing social security records in the United States, anything other than “bare” identifiers like a social security number may be unnecessary and add load with no corresponding benefit. Aligning social security records with bank records, might need to reconsider the judgement to use only social security numbers. (Some information sharing is “against the law.” But as the Sheriff in “Oh Brother where art thou?” says: “The law is a man made thing.” Laws change, or you can commission absurdist interpretations of it.)

Topic maps aren’t everywhere but identifiers and recognition of identifiers are.

Understanding identifiers and their recognition will help you choose the most appropriate solution to a problem

a speed gun for spam

Tuesday, October 25th, 2011

a speed gun for spam

From the post:

Apart from the content there are various features from metadata (like IP etc) which can help tell a spammer and regular user apart. Following are results of some data analysis (done on roughly 8000+ comments) which speak of another feature which proves to be a good discriminator. Hopefully this will aid others fighting spam/abuse (if not already using a similar feature).

(graph omitted)

The discriminator referred above is typing speed. The graph above plots the content length of a comment posted by a user against the (approximate) time he took to write it. If a user posts more than one comment in window of 5-10 minutes, we can consider those comments as consecutive posts. …

An illustration that subject identity tests are limited only by your imagination. From what I understand, very few spammers self-identify themselves using OWL and URLs. So as in this case, you need other tests to separate them.

A follow-up on this would be to see if particular spammers have speed patterns in their posts or searching more broadly, say across a set of blogs, a particular pattern. That is they start with blog X and then move down the line. Could be useful for dynamically configuring firewalls to block further content after they hit the first blog.

You have heard that passwords + keying patterns are used for personal identity?

ORCID (Open Researcher & Contributor ID)

Saturday, September 24th, 2011

ORCID (Open Researcher & Contributor ID)

From the About page:

ORCID, Inc. is a non-profit organization dedicated to solving the name ambiguity problem in scholarly research and brings together the leaders of the most influential universities, funding organizations, societies, publishers and corporations from around the globe. The ideal solution is to establish a registry that is adopted and embraced as the de facto standard by the whole of the community. A resolution to the systemic name ambiguity problem, by means of assigning unique identifiers linkable to an individual’s research output, will enhance the scientific discovery process and improve the efficiency of funding and collaboration. The organization is managed by a fourteen member Board of Directors.

ORCID’s principles will guide the initiative as it grows and operates. The principles confirm our commitment to open access, global communication, and researcher privacy.

Accurate identification of researchers and their work is one of the pillars for the transition from science to e-Science, wherein scholarly publications can be mined to spot links and ideas hidden in the ever-growing volume of scholarly literature. A disambiguated set of authors will allow new services and benefits to be built for the research community by all stakeholders in scholarly communication: from commercial actors to non-profit organizations, from governments to universities.

Thomson Reuters and Nature Publishing Group convened the first Name Identifier Summit in Cambridge, MA in November 2009, where a cross-section of the research community explored approaches to address name ambiguity. The ORCID initiative officially launched as a non-profit organization in August 2010 and is moving ahead with broad stakeholder participation (view participant gallery). As ORCID develops, we plan to engage researchers and other community members directly via social media and other activity. Participation from all stakeholders at all levels is essential to fulfilling the Initiative’s mission.

I am not altogether certain that elimination of ambiguity in identification will enable “…min[ing] to spot links and ideas hidden in the ever-growing volume of scientific literature.” Or should I say there is no demonstrated connection between unambiguous identification of researchers and such gains?

True enough, the claim is made but I thought science was based on evidence, not simply making claims.

And, like most researchers, I have discovered unexpected riches when mistaking one researcher’s name for another’s. Reducing ambiguity in identification will reduce the incidence of, well, ambiguity in identification.

Jack Park forwarded this link to me.

Summing up Properties with subjectIdentifiers/URLs?

Thursday, September 8th, 2011

I was picking tomatoes in the garden when I thought about telling Carol (my wife) the plants are about to stop producing.

Those plants are at a particular address, in the backyard, middle garden bed of three, are of three different varieties, but I am going to sum up those properties by saying: “The tomatoes are about to stop producing.”

It occurred to me that a subjectIdentifier could be assigned to a topic element on the basis of summing up properties of the topic.* That would have the advantage of enabling merging on the basis of subjectIdentifiers as opposed to more complex tests upon properties of a topic.

Disclosure of the basis for assignment of a subjectIdentifier is an interesting question.

It could be that a service wishes to produce subjectIdentifiers and index information based upon complex property measures, producing for consumption, the subjectIdentifiers and merge-capable indexes on one or more information sets. The basis for merging being the competitive edge offered by the service.

If promoting merging with a vendor’s process or format, which is seeking to become the TCP/IP of some area, the basis for merging and tools to assist with it will be supplied.

Or if you are an intelligence agency and you want an inward and outward facing interface that promotes merging of information but does not disclose your internal basis for identification, variants of this technique may be of interest.

*The notion of summing up imposes no prior constraints on the tests used or the location of the information subjected to those tests.

Sowa on Watson

Friday, February 11th, 2011

John Sowa’s posting on Watson merits reproduction in its entirety (lite editing to make it format for easy reading):


Thanks for the reminder:

Dave Ferrucci gave a talk on UIMA (the Unstructured Information Management Architecture) back in May-2006, entitled: “Putting the Semantics in the Semantic Web: An overview of UIMA and its role in Accelerating the Semantic Revolution”

I recommend that readers compare Ferrucci’s talk about UIMA in 2006 with his talk about the Watson system and Jeopardy in 2011. In less than 5 years, they built Watson on the UIMA foundation, which contained a reasonable amount of NLP tools, a modest ontology, and some useful tools for knowledge acquisition. During that time, they added quite a bit of machine learning, reasoning, statistics, and heuristics. But most of all, they added terabytes of documents.

For the record, following are Ferrucci’s slides from 2006:–DavidFerrucci_20060511.pdf

Following is the talk that explains the slides:–DavidFerrucci_20060511_Recording-2914992-460237.mp3

And following is his recent talk about the DeepQA project for building and extending that foundation for Jeopardy:

Compared to Ferrucci’s talks, the PBS Nova program was a disappointment. It didn’t get into any technical detail, but it did have a few cameo appearances from AI researchers. Terry Winograd and Pat Winston, for example, said that the problem of language understanding is hard.

But I thought that Marvin Minsky and Doug Lenat said more with their tone of voice than with their words. My interpretation (which could, of course, be wrong) is that both of them were seething with jealousy that IBM built a system that was competing with Jeopardy champions on national TV — and without their help.

In any case, the Watson project shows that terabytes of documents are far more important for commonsense reasoning than the millions of formal axioms in Cyc. That does not mean that the Cyc ontology is useless, but it undermines the original assumptions for the Cyc project: commonsense reasoning requires a huge knowledge base of hand-coded axioms together with a powerful inference engine.

An important observation by Ferrucci: The URIs of the Semantic Web are *not* useful for processing natural languages — not for ordinary documents, not for scientific documents, and especially not for Jeopardy questions:

1. For scientific documents, words like ‘H2O’ are excellent URIs. Adding an http address in front of them is pointless.

2. A word like ‘water’, which is sometimes a synonym for ‘H2O’, has an open-ended number of senses and microsenses.

3. Even if every microsense could be precisely defined and cataloged on the WWW, that wouldn’t help determine which one is appropriate for any particular context.

4. Any attempt to force human being(s) to specify or select a precise sense cannot succeed unless *every* human understands and consistently selects the correct sense at *every* possible occasion.

5. Given that point #4 is impossible to enforce and dangerous to assume, any software that uses URIs will have to verify that the selected sense is appropriate to the context.

6. Therefore, URIs found “in the wild” on the WWW can never be assumed to be correct unless they have been guaranteed to be correct by a trusted source.

These points taken together imply that annotations on documents can’t be trusted unless (a) they have been generated by your own system or (b) they were generated by a system which is at least as trustworthy as your own and which has been verified to be 100% compatible with yours.

In summary, the underlying assumptions for both Cyc and the Semantic Web need to be reconsidered.

You can see the post at:

I don’t always agree with Sowa but he has written extensively on conceptual graphs, knowledge representation and ontological matters. See

I missed the local showing but found the video at: Smartest Machine on Earth.

You will find a link to an interview with Minsky at that same location.

I don’t know that I would describe Minsky as “…seething with jealousy….”

While I enjoy Jeopardy and it is certainly more cerebral than say American Idol, I think Minsky is right in seeing the Watson effort as something other than artificial intelligence.

Q: In 2011, who was the only non-sentient contestant on the TV show Jeopardy?

A: What is IBM’s Watson?

Names, Identifiers, LOD, and the Semantic Web

Sunday, November 28th, 2010

I have been watching the identifier debate in the LOD community with its revisionists, personal accounts and other takes on what the problem is, if there is a problem and how to solve the problem if there is one.

I have a slightly different question: What happens when we have a name/identifier?

Short of being present when someone points to or touches an object, themselves, you (if the TSA) and says a name or identifier, what happens?

Try this experiment. Take a sheet of paper and write: George W. Bush.

Now write 10 facts about George W. Bush.

Please circle which ones that you think must match to identify George W. Bush.

So, even though you knew the name George W. Bush, isn’t it fair to say that the circled facts are what you would use to identify George W. Bush?

Here’s is the fun part: Get a colleague or co-worker to do the same experiment. (Substitute Lady Gaga if your friends don’t know enough facts about George W. Bush.)

Now compare several sets of answers for the same person.

Working from the same name, you most likely listed different facts and different ones you would use to identify that subject.

Even though most of you would agree that some or all of the facts listed go with that person.

It sounds like even though we use identifiers/names, those just clue us in on facts, some of which we use to make the identification.

That’s the problem isn’t it?

A name or identifier can make us think of different facts (possibly identifying different subjects) and even if the same subject, we may use different facts to identify the subject.

Assuming we are at a set of facts (RDF graph, whatever) we need to know: What facts identify the subject?

And a subject may have different identifying properties, depending on the context of identification.


  1. How to specify essential facts for identification as opposed to the extra ones?
  2. How to answer #1 for an RDF graph?
  3. How do you make others aware of your answer in #2?


Subject Identification Patterns

Thursday, November 4th, 2010

Does that sound like a good book title?

Thinking that since everyone is recycling old stuff under the patterns rubric that topic maps may as well jump on the bandwagon.

Instead of the three amigos (was that a movie?) we could have the dirty dozen honchos (or was that another movie?). I don’t get out much these days so I would probably need some help with current cultural references.

This ties into Lars Heuer’s effort to distinguish between Playboy Playmates and Astronauts, while trying to figure out why birds keep, well, let’s just say he has to wash his hair a lot.

When you have an entry from DBpedia, what do you have to know to identify it? Its URI is one thing but I rarely encounter URIs while shopping. (Or playmates for that matter.)

PSIs Going Viral?

Tuesday, October 26th, 2010

Publishing subject identifiers with node makes me wonder if PSIs (Published Subject Identifiers) are about to go viral?

This server software, written in Javascript, is an early release and needs features and bug fixes (feel free to contribute comments/fixes).

As it matures we could see a proliferation of PSI servers.

Key to that is downloading, installing, breaking, complaining, return to downloading. 😉

As Graham Moore says on TopicMapMail: “This is very cool.”

Semantic Drift: What Are Linked Data/RDF and TMDM Topic Maps Missing?

Wednesday, October 13th, 2010

One RDF approach to semantic drift is to situate a vocabulary among other terms.

TMDM topic maps enable a user to gather up information that they considered as identifying the subject in question.

Additional information helps to identify a particular subject. (RDF/TMDM approaches)

Isn’t that the opposite of semantic drift?

What’s happening in both cases?

The RDF approach is guessing that it has the sense of the word as used by the author (if the right word at all).

Kelb reports approximately 48% precision.

So in 1 out of 2 emergency room situations we get the right term? (Not to knock Kelb’s work. It is an important approach that needs further development.)

Topic maps are guessing as well.

We don’t know what information in a subject identifier identifies a subject. Some of it? All of it? Under what circumstances?

Question: What information identifies a subject, at least to its author?

Answer: Ask the Author.

Asking authors what information identifies their subject(s) seems like an overlooked approach.

Domain specific vocabularies with additional information about subjects that indicates the information that identifies a subject versus merely supplemental information would be a good start.

That avoids inline syntax difficulties and enables authors to easily and quickly associate subject identification information with their documents.

Both RDF and TMDM Topic Maps could use the same vocabularies to improve their handling of associated document content.

Semantic Drift: A Topic Map Answer (sort-of)

Tuesday, October 12th, 2010

Topic maps took a different approach to the problem of identifying subjects (than RDF) and so looks at semantic drift differently.

In the original 13250, subject descriptor was defined as:

3.19 subject descriptor – Information which is intended to provide a positive, unambiguous indication of the identity of a subject, and which is the referent of an identity attribute of a topic link.

When 13250 was reformulated to focus on the XTM syntax and the legend known as the Topic Maps Data Model (TMDM), the subject descriptor of old became subject identifiers. (Clause 7, TMDM)

A subject identifier has information that identifies a subject.

The author of a topic uses information that identifies a subject to create a subject identifier. (Which is represented in a topic map by an IRI.)

Anyone can look at the subject identifier to see if they are talking about the same subject.

They are responsible for catching semantic drift if it occurs.

But, there is something missing from RDF and topic maps.

Something that would help with semantic drift, although they would use it differently.

Care to take a guess?

Public Interchangeable Identifier

Thursday, October 7th, 2010

I mentioned yesterday that creating a public interchangeable identifier isn’t as easy as identifying identifier and documenting them publicly. Recognizing an Interchangeable Identifier

What if I identified (by some means) “Patrick” as an identifier and posted it to my website (public documentation).

Is that now a “public interchangeable identifier?”

No. Why?

First, there has to be some agreed upon means to declare an identifier to be an identifier. When I say agreed upon, it need not be something as formal as a standard but it has to be recognized by a community of users.

Second, it is important to know in what context this is an identifier? Akin to what we talk about as “scope” in topic maps. But with the recognition that the notion of “unconstrained” scope is a pernicious fiction. Scope may be unspecified but it is never unconstrained.

I would argue that no identifier exists without some defined scope. It may not be known or specified but the essence of an identifier, that it identifies some subject, exists only within some scope.

More on means to declare identifiers and their context anon.

Recognizing an Interchangeable Identifier

Wednesday, October 6th, 2010

Subjects & Identifiers shows why we need interchangeable identifiers.

Q: How would you recognize an interchangeable identifier?

A: Oh, yeah, that’s right. Anything we can talk about has an identifier, so how to recognize an interchangeable identifier?

If two people agree on column headers for a database table, they have interchangeable identifiers for the columns, at least between the two of them.

There are two requirements for interchangeable identifiers:

  1. Identification as an identifier.
  2. Notice of the identifier.

Any token can be an identifier under some circumstances so identifiers must be identified for interchange.

Notice of an identifier is usually a matter of being part of a profession or discipline. Some term is an identifier because it was taught to you as one.

That works but for local interchange, but public interchange requires publicly documented identifiers.

That’s it. Identify identifiers and document the identifiers publicly and you will have public interchangeable identifiers.

It can’t be that simple? Well, truthfully, it’s not.

More on public interchangeable identifiers forthcoming.

The General Case

Sunday, September 26th, 2010

The SciDB project illustrates that there is no general case solution for semantic identity.

If we distinguish between IRIs as addresses versus IRIs as identifiers, IRIs are useful for some cases of semantic identity. (IRIs can be used even if you don’t make that distinction, but they are less useful.)

But can you imagine an IRI for each tuple of values in the some 15 petabytes of data annually from the Large Hadron Collider? It may be very important to identify any number of those tuples. Such as if (not when) they discover the Higgs boson.

Those tuples have semantic identity, as do subjects composed of those tuples.

Rather than seeking general solutions for all semantic identity, perhaps we should find solutions that work for particular cases.

A Logical Account of Lying

Friday, September 17th, 2010

A Logical Account of Lying Authors:Chiaki Sakama, Martin Caminada and Andreas Herzig Keywords: lying, lies, argumentation systems, artificial intelligence, multiagent systems, intelligent agents.


This paper aims at providing a formal account of lying – a dishonest attitude of human beings. We first formulate lying under propositional modal logic and present basic properties for it. We then investigate why one engages in lying and how one reasons about lying. We distinguish between offensive and defensive lies, or deductive and abductive lies, based on intention behind the act. We also study two weak forms of dishonesty, bullshit and deception, and provide their logical features in contrast to lying. We finally argue dishonesty postulates that agents should try to satisfy for both moral and self-interested reasons. (emphasis in original)

Be the first to have your topic map distinguish between:

  • offensive lies
  • defensive lies
  • deductive lies
  • abductive lies (Someone tweet John Sowa please.)
  • deception
  • bullshit has an identifier for the subject “bullshit,”, but it does not reflect this latest analysis.

Lost In Translation – Article

Sunday, July 25th, 2010

Lost In Translation is a summary of recent research on language and its impact on our thinking by Lera Boroditsky (Professor of psychology at Stanford University and editor in chief of Frontiers in Cultural Psychology).

Read the article for the details but concepts such as causality, space and others aren’t as fixed as you may have thought.

Another teaser:

It turns out that if you change how people talk, that changes how they think. If people learn another language, they inadvertently also learn a new way of looking at the world. When bilingual people switch from one language to another, they start thinking differently, too.

Topic maps show different ways to identify the same subject. Put enough alternative identifications together and you will learn to think in another language.

Question: Should topic maps come with the following warning?

Caution: Topic Map – You May Start Thinking Differently