Archive for the ‘TMRM’ Category

SQL, NoSQL =? CoSQL? Category Theory to the Rescue

Wednesday, January 30th, 2013

A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman.

I missed this when it appeared in March of 2011.

From the conclusion:

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL’s foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

The ACM Digital Library reports only 3 citations, which is unfortunate for such an interesting proposal.

I have heard about key/value pairs somewhere else. I will have to think about that and get back to you. (Hint for the uninitiated, try the Topic Maps Reference Model (TMRM). A new draft of the TMRM is due to appear in a week or so.)

Appropriating IT: Glue Steps [Gluing Subject Representatives Together?]

Tuesday, October 9th, 2012

Appropriating IT: Glue Steps by Tony Hirst.

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

(diagrams and other material omitted)

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

If instead of “don’t share a common interface” you read “semantic diversity” and in place of Field Programmable Gate Array, or FPGA, you read “legend,” to “creat[e] an interface between two otherwise incompatible [subject representatives],” you would think Tony’s post was about the topic maps reference model.

Well, this post is and Tony’s is very close.

Particularly the part about being a “reprogrammable device.”

I can tell you: “black” = “schwarz,” but without more, you won’t be able to rely on or extend that statement.

For that, you need a “reprogrammable device” and some basis on which to do the reprogramming.

Legends anyone?

Subject Normalization

Thursday, September 29th, 2011

Another way to explain topic maps is in terms of Database normalization, except that I would call it subject normalization. That is every subject that is explicitly represented in the topic map appears once and only once, with relations to other subjects being recast to point to this single representative and all properties of the subject gathered to that one place.

One obvious advantage is that the shipping and accounting departments, for example, both have access to updated information for a customer as soon as entered by the other. And although they may gather different information about a customer, that information can be (doesn’t have to be) available to both of them.

Unlike database normalization, subject normalization in topic maps does not require rewriting of database tables, which can cause data access problems. Subject normalization (merging) occurs automatically, based on the presence of properties defined by the Topic Maps Data Model (TMDM).

And unlike OWL same:As, subject normalization in topic maps does not require knowledge of the “other” subject representative. That is I can insert an identifier that I know has been used for a subject, without knowledge it has been used in this topic map, and topics representing that subject will automatically merge (or be normalized).

Subject normalization in the terms of the TMDM, reduces the redundancy of information items. Which is true enough but not the primary experience of users with subject normalization. How many copies of a subject representative (information items) a system has is of little concern for an end-user.

What does concern end-users is getting the most complete and up-to-date information on a subject, however that is accomplished.

Topic maps accomplish that goal by empowering users to add identifiers to subject representatives that result in subject normalization. It doesn’t get any easier than that.

SAGA: A DSL for Story Management

Monday, September 12th, 2011

SAGA: A DSL for Story Management by Lucas Beyak and Jacques Carette (McMaster University).

Abstract:

Video game development is currently a very labour-intensive endeavour. Furthermore it involves multi-disciplinary teams of artistic content creators and programmers, whose typical working patterns are not easily meshed. SAGA is our first effort at augmenting the productivity of such teams.

Already convinced of the benefits of DSLs, we set out to analyze the domains present in games in order to find out which would be most amenable to the DSL approach. Based on previous work, we thus sought those sub-parts that already had a partially established vocabulary and at the same time could be well modeled using classical computer science structures. We settled on the ‘story’ aspect of video games as the best candidate domain, which can be modeled using state transition systems.

As we are working with a specific company as the ultimate customer for this work, an additional requirement was that our DSL should produce code that can be used within a pre-existing framework. We developed a full system (SAGA) comprised of a parser for a human-friendly language for ‘story events’, an internal representation of design patterns for implementing object-oriented state-transitions systems, an instantiator for these patterns for a specific ‘story’, and three renderers (for C++, C# and Java) for the instantiated abstract code.

I mention this only in part because of Jack Park’s long standing interest in narrative structures.

The other reason I mention this article is it is a model for how to transition between vocabularies in a useful way.

Transitioning between vocabularies is as nearly a constant theme in computer science as data storage. Not to mention that disciplines, domains, professions, etc., have been transitioning between vocabularies for thousands of years. Some more slowly than other, some terms in legal vocabularies date back centuries.

We need vocabularies and data structures, but with the realization that none of them are final. If you want blind interchangea of topic maps I would strongly suggest that you use one of the standard syntaxes.

But with the realization that you will encounter data that isn’t in a standard topic map syntax. What subjects are represented there? How would you tell others about them? And those vocabularies are going to change over time, just as there were vocabularies before RDF and topic maps.

If you ask an honest MDM advocate, they will tell you that the current MDM effort is not really all that different from MDM in the ’90′s. And MDM may be what you need, depends on your requirements. (Sorry, master data management = MDM.)

The point being that there isn’t any place where a particular vocabulary or “solution” is going to freeze the creativity of users and even programmers, to say nothing of the rest of humanity. Change is the only constant and those who aren’t prepared to deal with it, will be the worse off for it.

Neo4j Enhanced API

Wednesday, August 10th, 2011

Neo4j Enhanced API

From the wiki page:

A Neo4J graph consists of the following element types:

  • Node
  • Relationship
  • RelationshipType
  • Property name
  • Property value

These five types of elements don’t share a common interface, except for Node and Relationship, which both extend the PropertyContainer interface.

The Enhanced API unifies all Neo4j elements under the common interface Vertex.

Which has the result:

Generalizations of database elements

The Vertex interface support methods for the manipulation of Properties and Edges, thereby providing all methods normally associated with Nodes. Properties and Edges (including their Types) are Vertices too. This allows for the generalization of all Neo4j database elements as if they were Nodes.

Due to generalization it is possible to create Edges involving regular Vertices, Edges of any kind, including BinaryEdges and Properties.

The generalization also makes it possible to set Properties on all Vertices. So it even becomes possible to set a Property on a Property.

Hmmm, properties on properties, where have I heard that? ;-)

Properties on properties and properties on values are what we need for robust data preservation, migration or even re-use.

I was reminded recently that SNOBOL turns 50 years old in 2012. Care to guess how many formats, schemas, data structures we have been through during just that time period. Some of them intended to be “legacy” formats, forever readable by those who follow. Except that people forget the meaning of the “properties” and their “values.”

If we had properties on properties and properties on values, we could at least record our present understandings of those items. And others could do the same to our properties and values.

Those mappings would not be universally useful to everyone. But if present, we would have the options to follow those mappings or not.

Perhaps that’s the key, topic maps are about transparent choice in the reuse of data.

Leaves the exercise or not of choice up to the user.

This is a step in that direction.

Another Word For It at #2,000

Thursday, July 28th, 2011

According to my blogging software this is my 2,000th post!

During the search for content and ideas for this blog I have thought a lot about topic maps and how to explain them.

Or should I say how to explain topic maps without inventing new terminologies or notations? ;-)

Topic maps deal with a familiar problem:

People use different words when talking about the same subject and the same word when talking about different subjects.

Happens in conversations, newspapers, magazines, movies, videos, tv/radio, texts, and alas, electronic data.

The confusion caused by using different words for the same subject and same word for different subjects is a source of humor. (What does “nothing” stand for in Shakespeare’s “Much Ado About Nothing”?)

In searching electronic data, that confusion causes us to miss some data we want to find (different word for the same subject) and to find some data we don’t want (same word but different subject).

When searching old newspaper archives this can be amusing and/or annoying.

Potential outcomes of failure elsewhere:

medical literature injury/death/liability
financial records civil/criminal liability
patents lost opportunities/infringement
business records civil/criminal liability

Solving the problem of different words for the same subject and the same word but different subjects is important.

But how?

Topic maps and other solutions have one thing in common:

They use words to solve the problem of different words for the same subject and the same word but different subjects.

Oops!

The usual battle cry is “if everyone uses my words, we can end semantic confusion, have meaningful interchange for commerce, research, cultural enlightenment and so on and so forth.”

I hate to be the bearer of bad news but what about all the petabytes of data we already have on hand with zettabytes of previous interpretations? With more being added every day and not universal solution in sight? (If you don’t like any of the current solutions, wait a few months and new proposals, schemas, vocabularies, etc., will surface. Or you can take the most popular approach and start your own.)

Proposals to deal with semantic confusion are also frozen in time and place. Unlike the human semantics they propose to sort out, they do not change and evolve.

We have to use the source of semantic difficulty, words, in crafting a solution and our solution has to evolve over time even as our semantics do.

That’s a tall order.

Part of the solution, if you want to call it that, is to recognize when the benefits of solving semantic confusion outweighs the cost of the solution. We don’t need to solve semantic confusion everywhere and anywhere it occurs. In some cases, perhaps rather large cases, it isn’t worth the effort.

That triage of semantic confusion allows us to concentrate on cases where the investment of time and effort are worthwhile. In searching for the Hilton Hotel in Paris I may get “hits” for someone with underwear control issues but so what? Is that really a problem that needs a solution?

On the other hand, being able to resolve semantic confusion, such as underlies different accounting systems for businesses, could give investors a clearer picture of the potential risks and benefits of particular investments. Or doing the same for financial institutions so that regulators can “look down” into regulated systems with some semantic coherence (without requiring identical systems).

Having chosen some semantic confusion to resolve, we then have to choose a method to resolve it.

One method, probably the most popular one, is the “use my (insert vocabulary)” method for resolving semantic confusion. Works and for some cases, may be all that you need. Databases with gigabyte size tables (and larger) operate quite well using this approach. Can become problematic after acquisitions when migration to other database systems is required. Undocumented semantics can prove to be costly in many situations.

Semantic Web techniques, leaving aside the fanciful notion of unique identifiers, do offer the capability of recording additional properties about terms or rather the subjects that terms represent. Problematically though, they don’t offer the capacity to specify which properties are required to distinguish one term from another.

No, I am not about to launch into a screed about why “my” system works better than all the others.

Recognition that all solutions are composed of semantic ambiguity is the most important lesson of the Topic Maps Reference Model (TMRM).

Keys (of key/value pairs) are pointers to subject representatives (proxies) and values may be such references. Other keys and/or values may point to other proxies that represent the same subjects. Which replicates the current dilemma.

The second important lesson of the TMRM is the use of legends to define what key/value pairs occur in a subject representative (proxy) and how to determine two or more proxies represent the same subject (subject identity).

Neither lesson ends semantic ambiguity, nor do they mandate any particular technology or methodology.

They do enable the creation and analysis of solutions, including legends, with an awareness they are all partial mappings, with costs and benefits.

I will continue the broad coverage of this blog on semantic issues but in the next 1,000 posts I will make a particular effort to cover:

  • Ex Parte Declaration of Legends for Data Sources (even using existing Linked Data where available)
  • Suggestions for explicit subject identity mapping in open source data integration software
  • Advances in graph algorithms
  • Sample topic maps using existing and proposed legends

Other suggestions?

From Technologist to Philosopher

Monday, July 25th, 2011

From Technologist to Philosopher: Why you should quit your technology job and get a Ph.D. in the humanities by Damon Horowitz.

Created a startup that was acquired by Google. That is some measure of success.

If you want to create an exceptional company, hire humanists.

Read the essay to find out why.

If the TMRM is a Data Model…

Wednesday, May 4th, 2011

Whenever I hear the TMRM referred to or treated like a data model, I feel like saying in a Darth Vader type voice:

If the TMRM is a data model, then where are its data types?

It is my understanding that data models, legends in TMRM-speak, define data types on which they base declarations of equivalence (in terms of the subjects represented).

Being somewhat familiar with the text of the TMRM, or at least the current draft, I don’t see any declaration of data types in the TMRM.

Nor do I see any declarations of where the recursion of keys ends. Another important aspect of legends.

Nor do I see any declarations of equivalence (on the absent data types).

Yes, there is an abstraction of a path language, which would depend upon the data types and recursion through keys and values, but that is only an abstraction of a path language. It awaits declaration of data types, etc., in order to be an implementable path language.

There is a reason for the TMRM being written at that level of abstraction. To support any number of legends, written with any range of data types and choices with regard to the composition of those data types and subsequently the paths supported.

Any legend is going to make those choices and they are all equally valid if not all equally useful for some use cases. Every legend closes off some choices and opens up others.

For example, in bioinformatics, why would I want to do the subjectIdentifier/subjectLocator shuffle when I am concerned with standard identifiers for genes for example?

BTW, before anyone rushes out to write the legend syntax, realize that its writing results in subjects that could also be the targets of topic maps with suitable legends.

It is important that syntaxes be subjects, for a suitable legend, because syntaxes come and go out of fashion.

The need to merge subjects represented by those syntaxes, however, awaits only the next person with a brilliant insight.

TMQL Slides for Prague 2011

Tuesday, March 1st, 2011

TMQL Slides for Prague 2011

TMQL slides with discussion points for Prague have been posted!

Please review even if you don’t plan on attending the Prague meeting to offer your comments and questions.

Comments and questions I am sure are always welcome, but are more useful if received prior to weeks if not months of preparing standards prose.

Since I ask, I have several questions (some of which will probably have to be answered post-Prague):

1st Question:

While I understand the utility of the illustrated syntax reflected on the slides, I am more concerned with the underlying formal model for TMQL. Syntax and its explanation for users is very important, but that can take many forms. Can you say a bit more about the underlying formal model that underlies TMQL?

2nd Question:

See my blog post on Indexing by Properties. To what extent is TMQL going to support the use of multiple properties (occurrences) for the purposes of identifications?

3rd Question:

What datatypes will be supported by TMQL? How are additional datatypes declared?

4th Question:

What comparison operators are supported by TMQL?

Topic Maps, Google and the Billion Fact Parade

Thursday, February 10th, 2011

Andrew Hogue (Google) actually titled his presentation on Google’s plan for Freebase: The Structured Search Engine.

Several minutes into the presentation Hogue points out that to answer the question, “when was Martin Luther King, Jr. born?” that date of birth, date born, appeared, dob were all considered synonyms that expect the date type.

Hmmm, he must mean keys that represent the same subject and so subject to merging and possibly, depending on their role in a subject representative, further merging of those subject representatives. Can you say Steve Newcomb and the TMRM?

Yes, attribute names represent subjects just like collections of attributes are thought to represent subjects. And benefit from rules specifying subject identity, other properties and merging rules. (Some of those rules can be derived from mechanical analysis, others probably not.)

Second, Hogue points out that Freebase had 13 million entities when purchased by Google. He speculates on taking that to 1 billion entities.

Let’s cut to the chase, I will see Hogue’s 1 billion entities and raise him 9 billion entities for a total pot of 10 billion entities.

Now what?

Let’s take a simple question that Hogue’s 10 billion entity Google/Freebase cannot usefully answer.

What is democracy?

Seems simple enough. (viewers at home can try this with their favorite search engine.)

1) United States State Department: Democracy means a state that support Israel, keeps the Suez canal open and opposes people we don’t like in the U.S. Oh, and that protects the rights and social status of the wealthy, almost forgot that one. Sorry.

2) Protesters in Egypt (my view): Democracy probably does not include some or all of the points I mention for #1.

3) Turn of the century U.S.: Effectively only the white male population participates.

4) Early U.S. history: Land ownership is a requirement.

I am sure examples can be supplied from other “democracies” and their histories around the world.

This is a very important term and it differing use by different people in different contexts, is going to make discussion and negotiations more difficult.

There are lots of terms where no single “entity” or “fact” that is going to work for everyone.

Subject identity is a tough question and the identification of a subject changes over time, social context, etc. Not to mention that the subjects identified by particular identifications change as well.

Consider that at one time cab was not used to refer to a method of transportation but to a brothel. You may object that was “slang” usage but if I am searching an index of police reports for that time period for raids on brothel’s, your objection isn’t helpful. Doesn’t matter if the usage is “slang” or not, I need to obtain accurate results.

User expectations and needs cannot (or at least should not in my opinion) be adapted to the limitations of a particular approach or technology.

Particularly when we already know of strategies that can help with, not solve, the issues surrounding subject identity.

The first step that Hogue and Google have taken, recognizing that attribute names can have synonyms, is a good start. In topic map terms, recognizing that information structures are composed of subjects as well. So that we can map between information structures, rather than replacing one with another. (Or having religious discussions about which one is better, etc.)

Hogue and Google are already on the way to treating some subjects as worthy of more effort than others, but for those that merit the attention, solving the issue of to reliable, repeatable subject identification, is non-trivial.

Topic maps can make a number of suggestions that can help with that task.

CS Abstraction – Bridging Data Models – JSON and COBOL

Thursday, December 9th, 2010

I was reading Ullman’s Foundations of Computer Science on abstraction when it occurred to me:

A topic map legend is an abstraction that bridges some set of abstractions (read data models), to enable us to navigate and possibly combine data from them.

with the corollary:

Any topic map legend is itself an abstraction that is subject to being bridged for navigation or data combination purposes.

The first statement recognizes that there are no Ur abstractions that will dispel all others. Never have been, never will be.

If the history of CS teaches anything, it is the ephemeral nature of modeling.

The latest hot item is JSON but it was COBOL some, well, more years ago than I care to say. Nothing against JSON but in five years or less, it will either be fairly common or footnoted in dissertations.

The important thing is that we will have data stored in JSON for a very long time. Whether it gains in popularity or no.

We could say everyone will convert to the XXX format of years hence, but in fact that never happens.

Legacy systems (some at defense facilities, systems still require punched data entry, simply not economical to re-write/debug a new system) need the data, cost of the conversion, cost of verification, etc.

The corollary recognizes that once written, a topic map of a set of data models, the topic map itself becomes a data model for navigation/aggregation.

Otherwise we fall into the same trap as the data model paradigms that posit they will be the data model that dispels all others.

There are no cases where that has happened, either in digital times or in the millennia of data models that preceded digital times.

The emphasis on subject identify in topic maps facilitates the bridging of data models and having a useful result when we do.

What data models would you like to bridge today?

Foundations of Computer Science

Thursday, December 9th, 2010

Foundations of Computer Science

Introduction to theory in computer science by Alfred V. Aho and Jeffrey D. Ullman. (Free PDF of the entire text)

The turtle on the cover is said to be a reference to the turtle on which the world rests.

This particular turtle serves as the foundation for:

I point out this work because of its emphasis on abstraction.

Topic maps, at their best, are abstractions that bridge other abstractions and make use of information recorded in those abstractions.

*****
PS: The “rules of thumb” for programming in the introduction are equally applicable to writing topic maps. You will not encounter many instances of them being applied but they remain good guidance.

TMRM and a “universal information space”

Wednesday, November 24th, 2010

As an editor of the TMRM (Topic Maps Reference Model) I feel compelled to point out the TMRM is not a universal information space.

I bring up the universal issue because someone mentioned lately, mapping to the TMRM.

There is a lot to say about the TMRM but let’s start with the mapping issue.

There is no mapping to the TMRM. (full stop) The reason is that the TMRM is also not a data model. (full stop)

There is a simple reason why the TMRM was not, is not, nor ever will be a data model or universal information space.

There is no universal information space or data model.

Data models are an absolute necessity and more will be invented tomorrow.

But, to be a data model is to govern some larger or smaller slice of data.

We want to meaningfully access information across past, present and future data models in different information spaces.

Enter the TMRM, a model for disclosure of the subjects represented by a data model. Any data model, in any information space.

A model for disclosure, not a methodology, not a target, etc.

We used key and value because a key/value pair is the simplest expression of a property class.

The representative of the definition of a class (the key) and an instance of that class (the value).

That does not constrain or mandate any particular data model or information space.

Rather than mapping to the TMRM, we should say mapping using the principles of the TMRM.

I will say more in a later post, but for example, what subject does a topic represent?

With disclosure for the TMDM and RDF, we might not agree on the mapping, but it would be transparent. And useful.

Reducing Ambiguity, LOD, Ookaboo, TMRM

Tuesday, November 16th, 2010

While reading Resource Identity and Semantic Extensions: Making Sense of Ambiguity and In Defense of Ambiguity it occurred to me that reducing ambiguity has a hidden assumption.

That hidden assumption is the intended audience for who I wish to reduce ambiguity.

For example, Ookaboo does #it solves the problem of multiple vocabularies for its intended audience thusly:

Our strategy for dealing with multiple subject terminologies is to what we call a reference set, which in this case is

http://ookaboo.com/o/pictures/topic/2021903/Central_Air_Force_Museum#it

http://dbpedia.org/resource/Central_Air_Force_Museum

http://rdf.freebase.com/ns/m.0g_2bv

If we want to assert foaf:depicts we assert foaf:depicts against all of these. The idea is that not all clients are going to have the inferencing capabilities that I wish they’d have, so I’m trying to assert terms in the most “core” databases of the LOD cloud.

In a case like this we may have YAGO, OpenCyc, UMBEL and other terms available. Relationships like this are expressed as

<:Whatever> <ontology2:ak>
<http://mpii.de/yago/resource/Central_Air_Force_Museum> .

<ontology2:aka>, not dereferencable yet, means (roughly) that “some people use term X to refer to substantially the same thing as term Y.” It’s my own answer to the <owl:sameAs> problem and deliberately leaves the exact semantics to the reader. (It’s a lossy expression of the data structures that I use for entity management)

This is very like a TMRM solution since it gathers different identifications together, in hopes that at least one will be understood by a reader.

This is very unlike a TMRM solution because it has no legend to say how to compare these “values,” must less their “key.”

The lack of a legend makes integration in legal, technical, medical or intelligence applications, ah, difficult.

Still, it is encouraging to see the better Linked Data applications moving in the direction of the TMRM.

Whose Logic Binds A Topic Map?

Tuesday, November 9th, 2010

An exchange with Lars Heuer over what the TMRM should say about “ako” and “isa” (see: A Guide to Publishing Linked Data Without Redirects brings up an important but often unspoken issue.

The current draft of the Topic Maps Reference Model (TMRM) says that subclass-superclass relationships are reflexive and transitive. Moreover, “isa” relationships, are non-reflexive and transitive.

Which is all well and good, assuming that accords with your definition of subclass-superclass and isa. The Topic Maps Data Model (TMDM) on the other hand defines “isa” as non-transitive.

Either one is a legitimate choice and I will cover the resolution of that difference elsewhere.

My point here is to ask: “Whose logic binds a topic map?”

My impression is that here and in the Semantic Web, logical frameworks are being created, into which users are supposed to fit their data.

As a user I would take serious exception to fitting my data into someone else’s world view (read logic).

That the real question isn’t it?

Whether IT/SW dictates to users the logic that will bind their data or if users get to define their own “logics?”

Given the popularity of tagging and folksonomies, user “logics” look like the better bet.

The UMLS Metathesaurus: representing different views of biomedical concepts

Wednesday, October 27th, 2010

The UMLS Metathesaurus: representing different views of biomedical concepts

Abstract

The UMLS Metathesaurus is a compilation of names, relationships, and associated information from a variety of biomedical naming systems representing different views of biomedical practice or research. The Metathesaurus is organized by meaning, and the fundamental unit in the Metathesaurus is the concept. Differing names for a biomedical meaning are linked in a single Metathesaurus concept. Extensive additional information describing semantic characteristics, occurrence in machine-readable information sources, and how concepts co-occur in these sources is also provided, enabling a greater comprehension of the concept in its various contexts. The Metathesaurus is not a standardized vocabulary; it is a tool for maximizing the usefulness of existing vocabularies. It serves as a knowledge source for developers of biomedical information applications and as a powerful resource for biomedical information specialists.

Bull Med Libr Assoc. 1993 Apr;81(2):217-22.
Schuyler PL, Hole WT, Tuttle MS, Sherertz DD.
Medical Subject Headings Section, National Library of Medicine, Bethesda, MD 20894.

Questions:

  1. Did you notice the date on the citation?
  2. Map this article to the Topic Maps Data Model (3-5 pages, no citations)
  3. Where does the Topic Maps Data Model differ from this article? (3-5 pages, no citations)
  4. If concept = proxy, what concepts (subjects) don’t have proxies in the Metathesaurus?
  5. On what basis are “biomedical meanings” mapped to a single Metathesaurus “concept?” Describe in general but illustrate with at least five (5) examples

Key-Value Pairs

Monday, September 13th, 2010

The Topic Map Reference Model can’t claim to have invented the key/value view of the world.

But it is interesting how much traction key/value pair approaches have been getting of late. From NoSQL in general to Neo4j and Redis in particular. (no offense to other NoSQL contenders, those are the two that came to mind)

Declare which key/value pairs identify a subject and you are on your way towards a subject-centric view of computing.

OK, there are some details but declaring how you identify a subject is the first step in enabling others to reliably identify the same subject.

Cartesian Products and Topic Maps

Sunday, September 12th, 2010

Using SQL Cross Join – the report writers secret weapon is a very clear explanation of the utility of cross-joins in SQL.

Cross-join = Cartesian product, something you will remember from the Topic Maps Reference Model.

Makes a robust where clause look important doesn’t it?

Set-Similarity and Topic Maps

Monday, July 12th, 2010

The set-similarity offers a useful way to think about merging in a topic maps context. The measure of self-similarity that we want for merging in topic maps is the same subject.

Self-similarity, in the TMDM, for topics is:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

The research literature makes it clear that judging self-similarity isn’t subject to one test or even a handful of them for all purposes. Not to mention that more often than not, self-similarity is being judged on high dimensional data.

Despite clever approaches and quite frankly amazing results, I have yet to run across sustained discussion of how to interchange self-similarity tests. Perhaps it is my markup background but that seems like the sort of capability that would be widely desired.

The issue of interchangeable self-similarity tests looks like an area where JTC 1/SC 34/WG 3 could make a real contribution.

Second Class Citizens/Subjects

Thursday, April 29th, 2010

One of the difficulties that topic maps solve is the question of second class citizens (or subjects) in information systems.

The difficulty is one that Marijane raises when she quotes Michael Sperberg-McQueen wondering how topic maps differ from SQL databases, Prolog or colloquial XML?

One doesn’t have to read far to find that SQL databases, colloquial XML (and other information technologies) talk about real world subjects.*

The real world view leaves the subjects that comprise information systems out of the picture.

That creates an underclass of subjects that appear in information systems, but can never be identified or be declared to have more than one identification.

Mapping strategies, like topic maps enable users to identify any subject. Any subject can have multiple identifiers. Users can declare what properties must be present to identify a subject. Including the subjects that make up information systems.

*Note my omission of Prolog. Some programming languages may be more map friendly than others but I am unaware of any that cannot attribute properties to parts of a data structure (or its contents) for the purposes of mapping and declaring a mapping.

Implementing the TMRM (Part 2)

Wednesday, March 10th, 2010

Implementing the TMRM (Part 2)

I left off in Implementing the TMRM (Part 1) by saying that if the TMRM defined proxies for particular subjects, it would lack the generality needed to enable legends to be written between arbitrary existing systems.

The goal of the TMRM is not to be yet another semantic integration format but to enable users to speak meaningfully of the subjects their systems already represent and to know when the same subjects are being identified differently. The last thing we all need is another semantic integration format. Sorry, back to the main theme:

One reason why it isn’t possible to “implement” the TMRM is the lack of any subject identity equivalence rules.

String matching for IRIs is one test for equivalence of subject identification but not the only one. The TMRM places no restrictions on tests for subject equivalence so any implementation will only have a subset of all the possible subject equivalence tests. (Defining a subset of equivalence tests underlies the capacity for blind interchange of topic maps based on particular legends. More on that later.)

An implementation that compares IRIs for example, would fail if a legend asked it to compare the equivalence of Feynman diagrams generated from the detector output from the Large Hadron Collider. Equivalence of Feynman diagrams being a legitimate test for subject equivalence and well within the bounds of the TMRM.

(It occurs to me that the real question to ask is why we don’t have more generalized legends with ranges of subject identity tests. Sort of like XML parsers only parse part of the universe of markup documents but do quite well within that subset. Apologies for the interruption, that will be yet another post.)

The TMRM is designed to provide the common heuristic through the representation of any subject can be discussed. However, it does not define a processing model, which is another reason why it isn’t possible to “implement” the TMRM, but more on that in Implementing the TMRM (Part 3).

Implementing the TMRM (Part 1)

Monday, March 8th, 2010

There are two short pieces on the Topic Maps Reference Model (TMRM) that are helpful to read before talking about “implementing” the TMRM. Both are by Robert Barta, one of the co-editors of the TMRM, A 5 min Introduction into TMRM and TMRM Exegesis: Proxies.

The TMRM defines an abstract structure to enable us to talk about proxies, the generic representative for subjects. It does not define:

  1. Any rules for identifying subjects
  2. Any rules for comparing identifications of subjects
  3. Any rules for what happens if proxies represent the same subjects
  4. Any subjects for that matter

If that seems like a lot to not define, it was and it took a while to get there.

The TMRM does not define any of those things, not because they are not necessary, but doing so would impair the ability of legends (the disclosures of all those things) to create views of information that merge diverse information resources.

Consider a recent call for help with the earthquake in Chile. Data was held by a Google’s people finder service but the request was to convert it into RDF. Then do incremental dumps every hour.

So the data moves from one data silo to another data silo. As Ben Stein would say, “Wow.”

If we could identify the subjects, both structural and as represented, we could merge information about those subjects with information about the same subjects in any data silo, not just one in particular.

How is that for a business case? Pay to identify your subjects once versus paying that cost every time you move from one data silo to another one.

The generality of the TMRM is necessary to support the writing of a legend that identifies the subjects in a more than one system and, more importantly, defines rules for when they are talking about the same subjects. (to be continued)

(BTW, using Robert Barta’s virtual topic map approach, hourly dumps/conversion would be unnecessary, unless there was some other reason for it. That is an approach that I hope continues in the next TMQL draft (see the current TMQL draft).)