Archive for the ‘TMRM’ Category

Coeffects: Context-aware programming languages – Subject Identity As Type Checking?

Tuesday, April 12th, 2016

Coeffects: Context-aware programming languages by Tomas Petricek.

From the webpage:

Coeffects are Tomas Petricek‘s PhD research project. They are a programming language abstraction for understanding how programs access the context or environment in which they execute.

The context may be resources on your mobile phone (battery, GPS location or a network printer), IoT devices in a physical neighborhood or historical stock prices. By understanding the neighborhood or history, a context-aware programming language can catch bugs earlier and run more efficiently.

This page is an interactive tutorial that shows a prototype implementation of coeffects in a browser. You can play with two simple context-aware languages, see how the type checking works and how context-aware programs run.

This page is also an experiment in presenting programming language research. It is a live environment where you can play with the theory using the power of new media, rather than staring at a dead pieces of wood (although we have those too).

(break from summary)

Programming languages evolve to reflect the changes in the computing ecosystem. The next big challenge for programming language designers is building languages that understand the context in which programs run.

This challenge is not easy to see. We are so used to working with context using the current cumbersome methods that we do not even see that there is an issue. We also do not realize that many programming features related to context can be captured by a simple unified abstraction. This is what coeffects do!

What if we extend the idea of context to include the context within which words appear?

For example, writing a police report, the following sentence appeared:

There were 20 or more <proxy string=”black” pos=”noun” synonym=”African” type=”race”/>s in the group.

For display purposes, the string value “black” appears in the sentence:

There were 20 or more blacks in the group.

But a search for the color “black” would not return that report because the type = color does not match type = race.

On the other hand, if I searched for African-American, that report would show up because “black” with type = race is recognized as a synonym for people of African extraction.

Inline proxies are the easiest to illustrate but that is only one way to serialize such a result.

If done in an authoring interface, such an approach would have the distinct advantage of offering the original author the choice of subject properties.

The advantage of involving the original author is that they have an interest in and awareness of the document in question. Quite unlike automated processes that later attempt annotation by rote.

Spreadsheets – 90+ million End User Programmers…

Thursday, August 13th, 2015

Spreadsheets – 90+ million End User Programmers With No Comment Tracking or Version Control by Patrick Durusau and Sam Hunting.

From all available reports, Sam Hunting did a killer job presenting our paper at the Balisage conference on Wednesday of this week! Way to go Sam!

I will be posting the slides and the files shown in the presentation tomorrow.

BTW, development of the topic map for one or more Enron spreadsheets will continue.

Watch this blog for future developments!

Everyone is an IA [Information Architecture]

Wednesday, February 25th, 2015

Everyone is an IA [Information Architecture] by Dan Ramsden.

From the post:

This is a post inspired by my talk from World IA Day. On the day I had 20 minutes to fill – I did a magic trick and talked about an imaginary uncle. This post has the benefit of an edit, but recreates the central argument – everyone makes IA.

Information architecture is everywhere, it’s a part of every project, every design includes it. But I think there’s often a perception that because it requires a level of specialization to do the most complicated types of IA, people are nervous about how and when they engage with it – no-one like to look out of their depth. And some IA requires a depth of thinking that deserves justification and explanation.

Even when you’ve built up trust with teams of other disciplines or clients, I think one of the most regular questions asked of an IA is probably, ‘Is it really that complicated?’ And if we want to be happier in ourselves, and spread happiness by creating meaningful, beautiful, wonderful things – we need to convince people that complex is different from complicated. We need to share our conviction that IA is a real thing and that thinking like an IA is probably one of the most effective ways of contributing to a more meaningful world.

But we have a challenge, IAs are usualy the minority. At the BBC we have a team of about 140 in UX&D, and IAs are the minority – we’re not quite 10%. It’s my job to work out how those less than 1 in 10 can be as effective as possible and have the biggest positive impact on the work we do and the experiences we offer to our audiences. I don’t think this is unique. A lot of the time IAs don’t work together, or there’s not enough IAs to work on every project that could benefit from an IA mindset, which is every project.

This is what troubled me. How could I make sure that it is always designed? My solution to this is simple. We become the majority. And because we can’t do that just by recruiting a legion of IAs we do it another way. We turn everyone in the team into an information architect.

Now this is a bit contentious. There’s legitimate certainty that IA is a specialism and that there are dangers of diluting it. But last year I talked about an IA mindset, a way of approaching any design challenge from an IA perspective. My point then was that the way we tend to think and therefore approach design challenges is usually a bit different from other designers. But I don’t believe we’re that special. I think other people can adopt that mindset and think a little bit more like we do. I think if we work hard enough we can find ways to help designers to adopt that IA mindset more regularly.

And we know the benefits on offer when every design starts from the architecture up. Well-architected things work better. They are more efficient, connected, resilient and meaningful – they’re more useful.

Dan goes onto say that information is everywhere. Much in the same way that I would say that subjects are everywhere.

Just as users must describe information architectures as they experience them, the same is true for users identifying the subjects that are important to them.

There is never a doubt that more IAs and more subjects exist, but the best anyone can do is to tell you about the ones that are important to them and how they have chosen to identify them.

To no small degree, I think terminology has been used to disenfranchise users from discussing subjects as they understand them.

From my own background, I remember a database project where the head of membership services, who ran reports by rote out of R&R, insisted on saying where data needed to reside in tables during a complete re-write of the database. I keep trying, with little success, to get them to describe what they wanted to store and what capabilities they needed.

In retrospect, I should have allowed membership services to use their terminology to describe the database because whether they understood the underlying data architecture or not wasn’t a design goal. The easier course would have been to provide them with a view that accorded with their idea of the database structure and to run their reports. That other “views” of the data existed would have been neither here nor there to them.

As “experts,” we should listen to the description of information architectures and/or identifications of subjects and their relationships as a voyage of discovery. We are discovering the way someone else views the world, not for our correction to the “right” way but so we can enable their view to be more productive and useful to them.

That approach takes more work on the part of “experts” but think of all the things you will learn along the way.

Rumors of Legends (the TMRM kind?)

Tuesday, September 24th, 2013

BioC: a minimalist approach to interoperability for biomedical text processing (numerous authors, see the article).

Abstract:

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/.

From the introduction:

With the proliferation of natural language text, text mining has emerged as an important research area. As a result many researchers are developing natural language processing (NLP) and information retrieval tools for text mining purposes. However, while the capabilities and the quality of tools continue to grow, it remains challenging to combine these into more complex systems. Every new generation of researchers creates their own software specific to their research, their environment and the format of the data they study; possibly due to the fact that this is the path requiring the least labor. However, with every new cycle restarting in this manner, the sophistication of systems that can be developed is limited. (emphasis added)

That is the experience with creating electronic versions of the Hebrew Bible. Every project has started from a blank screen, requiring re-proofing of the same text, etc. As a result, there is no electronic encoding of the masora magna (think long margin notes). Duplicated effort has a real cost to scholarship.

The authors stray into legend land when they write:

Our approach to these problems is what we would like to call a ‘minimalist’ approach. How ‘little’ can one do to obtain interoperability? We provide an extensible mark-up language (XML) document type definition (DTD) defining ways in which a document can contain text, annotations and relations. Major XML elements may contain ‘infon’ elements, which store key-value pairs with any desired semantic information. We have adapted the term ‘infon’ from the writings of Devlin (1), where it is given the sense of a discrete item of information. An associated ‘key’ file is necessary to define the semantics that appear in tags such as the infon elements. Key files are simple text files where the developer defines the semantics associated with the data. Different corpora or annotation sets sharing the same semantics may reuse an existing key file, thus representing an accepted standard for a particular data type. In addition, key files may describe a new kind of data not seen before. At this point we prescribe no semantic standards. BioC users are encouraged to create their own key files to represent their BioC data collections. In time, we believe, the most useful key files will develop a life of their own, thus providing emerging standards that are naturally adopted by the community.

The “key files” don’t specify subject identities for the purposes of merging. But defining the semantics of data is a first step in that direction.

I like the idea of popular “key files” (read legends) taking on a life of their own due to their usefulness. An economic activity based on reducing the friction in using or re-using data. That should have legs.

BTW, don’t overlook the author’s data and code, available at: http://bioc.sourceforge.net/.

SQL, NoSQL =? CoSQL? Category Theory to the Rescue

Wednesday, January 30th, 2013

A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman.

I missed this when it appeared in March of 2011.

From the conclusion:

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL’s foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

The ACM Digital Library reports only 3 citations, which is unfortunate for such an interesting proposal.

I have heard about key/value pairs somewhere else. I will have to think about that and get back to you. (Hint for the uninitiated, try the Topic Maps Reference Model (TMRM). A new draft of the TMRM is due to appear in a week or so.)

Appropriating IT: Glue Steps [Gluing Subject Representatives Together?]

Tuesday, October 9th, 2012

Appropriating IT: Glue Steps by Tony Hirst.

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

(diagrams and other material omitted)

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

If instead of “don’t share a common interface” you read “semantic diversity” and in place of Field Programmable Gate Array, or FPGA, you read “legend,” to “creat[e] an interface between two otherwise incompatible [subject representatives],” you would think Tony’s post was about the topic maps reference model.

Well, this post is and Tony’s is very close.

Particularly the part about being a “reprogrammable device.”

I can tell you: “black” = “schwarz,” but without more, you won’t be able to rely on or extend that statement.

For that, you need a “reprogrammable device” and some basis on which to do the reprogramming.

Legends anyone?

Subject Normalization

Thursday, September 29th, 2011

Another way to explain topic maps is in terms of Database normalization, except that I would call it subject normalization. That is every subject that is explicitly represented in the topic map appears once and only once, with relations to other subjects being recast to point to this single representative and all properties of the subject gathered to that one place.

One obvious advantage is that the shipping and accounting departments, for example, both have access to updated information for a customer as soon as entered by the other. And although they may gather different information about a customer, that information can be (doesn’t have to be) available to both of them.

Unlike database normalization, subject normalization in topic maps does not require rewriting of database tables, which can cause data access problems. Subject normalization (merging) occurs automatically, based on the presence of properties defined by the Topic Maps Data Model (TMDM).

And unlike OWL same:As, subject normalization in topic maps does not require knowledge of the “other” subject representative. That is I can insert an identifier that I know has been used for a subject, without knowledge it has been used in this topic map, and topics representing that subject will automatically merge (or be normalized).

Subject normalization in the terms of the TMDM, reduces the redundancy of information items. Which is true enough but not the primary experience of users with subject normalization. How many copies of a subject representative (information items) a system has is of little concern for an end-user.

What does concern end-users is getting the most complete and up-to-date information on a subject, however that is accomplished.

Topic maps accomplish that goal by empowering users to add identifiers to subject representatives that result in subject normalization. It doesn’t get any easier than that.

SAGA: A DSL for Story Management

Monday, September 12th, 2011

SAGA: A DSL for Story Management by Lucas Beyak and Jacques Carette (McMaster University).

Abstract:

Video game development is currently a very labour-intensive endeavour. Furthermore it involves multi-disciplinary teams of artistic content creators and programmers, whose typical working patterns are not easily meshed. SAGA is our first effort at augmenting the productivity of such teams.

Already convinced of the benefits of DSLs, we set out to analyze the domains present in games in order to find out which would be most amenable to the DSL approach. Based on previous work, we thus sought those sub-parts that already had a partially established vocabulary and at the same time could be well modeled using classical computer science structures. We settled on the ‘story’ aspect of video games as the best candidate domain, which can be modeled using state transition systems.

As we are working with a specific company as the ultimate customer for this work, an additional requirement was that our DSL should produce code that can be used within a pre-existing framework. We developed a full system (SAGA) comprised of a parser for a human-friendly language for ‘story events’, an internal representation of design patterns for implementing object-oriented state-transitions systems, an instantiator for these patterns for a specific ‘story’, and three renderers (for C++, C# and Java) for the instantiated abstract code.

I mention this only in part because of Jack Park’s long standing interest in narrative structures.

The other reason I mention this article is it is a model for how to transition between vocabularies in a useful way.

Transitioning between vocabularies is as nearly a constant theme in computer science as data storage. Not to mention that disciplines, domains, professions, etc., have been transitioning between vocabularies for thousands of years. Some more slowly than other, some terms in legal vocabularies date back centuries.

We need vocabularies and data structures, but with the realization that none of them are final. If you want blind interchangea of topic maps I would strongly suggest that you use one of the standard syntaxes.

But with the realization that you will encounter data that isn’t in a standard topic map syntax. What subjects are represented there? How would you tell others about them? And those vocabularies are going to change over time, just as there were vocabularies before RDF and topic maps.

If you ask an honest MDM advocate, they will tell you that the current MDM effort is not really all that different from MDM in the ’90’s. And MDM may be what you need, depends on your requirements. (Sorry, master data management = MDM.)

The point being that there isn’t any place where a particular vocabulary or “solution” is going to freeze the creativity of users and even programmers, to say nothing of the rest of humanity. Change is the only constant and those who aren’t prepared to deal with it, will be the worse off for it.

Neo4j Enhanced API

Wednesday, August 10th, 2011

Neo4j Enhanced API

From the wiki page:

A Neo4J graph consists of the following element types:

  • Node
  • Relationship
  • RelationshipType
  • Property name
  • Property value

These five types of elements don’t share a common interface, except for Node and Relationship, which both extend the PropertyContainer interface.

The Enhanced API unifies all Neo4j elements under the common interface Vertex.

Which has the result:

Generalizations of database elements

The Vertex interface support methods for the manipulation of Properties and Edges, thereby providing all methods normally associated with Nodes. Properties and Edges (including their Types) are Vertices too. This allows for the generalization of all Neo4j database elements as if they were Nodes.

Due to generalization it is possible to create Edges involving regular Vertices, Edges of any kind, including BinaryEdges and Properties.

The generalization also makes it possible to set Properties on all Vertices. So it even becomes possible to set a Property on a Property.

Hmmm, properties on properties, where have I heard that? 😉

Properties on properties and properties on values are what we need for robust data preservation, migration or even re-use.

I was reminded recently that SNOBOL turns 50 years old in 2012. Care to guess how many formats, schemas, data structures we have been through during just that time period. Some of them intended to be “legacy” formats, forever readable by those who follow. Except that people forget the meaning of the “properties” and their “values.”

If we had properties on properties and properties on values, we could at least record our present understandings of those items. And others could do the same to our properties and values.

Those mappings would not be universally useful to everyone. But if present, we would have the options to follow those mappings or not.

Perhaps that’s the key, topic maps are about transparent choice in the reuse of data.

Leaves the exercise or not of choice up to the user.

This is a step in that direction.

Another Word For It at #2,000

Thursday, July 28th, 2011

According to my blogging software this is my 2,000th post!

During the search for content and ideas for this blog I have thought a lot about topic maps and how to explain them.

Or should I say how to explain topic maps without inventing new terminologies or notations? 😉

Topic maps deal with a familiar problem:

People use different words when talking about the same subject and the same word when talking about different subjects.

Happens in conversations, newspapers, magazines, movies, videos, tv/radio, texts, and alas, electronic data.

The confusion caused by using different words for the same subject and same word for different subjects is a source of humor. (What does “nothing” stand for in Shakespeare’s “Much Ado About Nothing”?)

In searching electronic data, that confusion causes us to miss some data we want to find (different word for the same subject) and to find some data we don’t want (same word but different subject).

When searching old newspaper archives this can be amusing and/or annoying.

Potential outcomes of failure elsewhere:

medical literature injury/death/liability
financial records civil/criminal liability
patents lost opportunities/infringement
business records civil/criminal liability

Solving the problem of different words for the same subject and the same word but different subjects is important.

But how?

Topic maps and other solutions have one thing in common:

They use words to solve the problem of different words for the same subject and the same word but different subjects.

Oops!

The usual battle cry is “if everyone uses my words, we can end semantic confusion, have meaningful interchange for commerce, research, cultural enlightenment and so on and so forth.”

I hate to be the bearer of bad news but what about all the petabytes of data we already have on hand with zettabytes of previous interpretations? With more being added every day and not universal solution in sight? (If you don’t like any of the current solutions, wait a few months and new proposals, schemas, vocabularies, etc., will surface. Or you can take the most popular approach and start your own.)

Proposals to deal with semantic confusion are also frozen in time and place. Unlike the human semantics they propose to sort out, they do not change and evolve.

We have to use the source of semantic difficulty, words, in crafting a solution and our solution has to evolve over time even as our semantics do.

That’s a tall order.

Part of the solution, if you want to call it that, is to recognize when the benefits of solving semantic confusion outweighs the cost of the solution. We don’t need to solve semantic confusion everywhere and anywhere it occurs. In some cases, perhaps rather large cases, it isn’t worth the effort.

That triage of semantic confusion allows us to concentrate on cases where the investment of time and effort are worthwhile. In searching for the Hilton Hotel in Paris I may get “hits” for someone with underwear control issues but so what? Is that really a problem that needs a solution?

On the other hand, being able to resolve semantic confusion, such as underlies different accounting systems for businesses, could give investors a clearer picture of the potential risks and benefits of particular investments. Or doing the same for financial institutions so that regulators can “look down” into regulated systems with some semantic coherence (without requiring identical systems).

Having chosen some semantic confusion to resolve, we then have to choose a method to resolve it.

One method, probably the most popular one, is the “use my (insert vocabulary)” method for resolving semantic confusion. Works and for some cases, may be all that you need. Databases with gigabyte size tables (and larger) operate quite well using this approach. Can become problematic after acquisitions when migration to other database systems is required. Undocumented semantics can prove to be costly in many situations.

Semantic Web techniques, leaving aside the fanciful notion of unique identifiers, do offer the capability of recording additional properties about terms or rather the subjects that terms represent. Problematically though, they don’t offer the capacity to specify which properties are required to distinguish one term from another.

No, I am not about to launch into a screed about why “my” system works better than all the others.

Recognition that all solutions are composed of semantic ambiguity is the most important lesson of the Topic Maps Reference Model (TMRM).

Keys (of key/value pairs) are pointers to subject representatives (proxies) and values may be such references. Other keys and/or values may point to other proxies that represent the same subjects. Which replicates the current dilemma.

The second important lesson of the TMRM is the use of legends to define what key/value pairs occur in a subject representative (proxy) and how to determine two or more proxies represent the same subject (subject identity).

Neither lesson ends semantic ambiguity, nor do they mandate any particular technology or methodology.

They do enable the creation and analysis of solutions, including legends, with an awareness they are all partial mappings, with costs and benefits.

I will continue the broad coverage of this blog on semantic issues but in the next 1,000 posts I will make a particular effort to cover:

  • Ex Parte Declaration of Legends for Data Sources (even using existing Linked Data where available)
  • Suggestions for explicit subject identity mapping in open source data integration software
  • Advances in graph algorithms
  • Sample topic maps using existing and proposed legends

Other suggestions?

From Technologist to Philosopher

Monday, July 25th, 2011

From Technologist to Philosopher: Why you should quit your technology job and get a Ph.D. in the humanities by Damon Horowitz.

Created a startup that was acquired by Google. That is some measure of success.

If you want to create an exceptional company, hire humanists.

Read the essay to find out why.

If the TMRM is a Data Model…

Wednesday, May 4th, 2011

Whenever I hear the TMRM referred to or treated like a data model, I feel like saying in a Darth Vader type voice:

If the TMRM is a data model, then where are its data types?

It is my understanding that data models, legends in TMRM-speak, define data types on which they base declarations of equivalence (in terms of the subjects represented).

Being somewhat familiar with the text of the TMRM, or at least the current draft, I don’t see any declaration of data types in the TMRM.

Nor do I see any declarations of where the recursion of keys ends. Another important aspect of legends.

Nor do I see any declarations of equivalence (on the absent data types).

Yes, there is an abstraction of a path language, which would depend upon the data types and recursion through keys and values, but that is only an abstraction of a path language. It awaits declaration of data types, etc., in order to be an implementable path language.

There is a reason for the TMRM being written at that level of abstraction. To support any number of legends, written with any range of data types and choices with regard to the composition of those data types and subsequently the paths supported.

Any legend is going to make those choices and they are all equally valid if not all equally useful for some use cases. Every legend closes off some choices and opens up others.

For example, in bioinformatics, why would I want to do the subjectIdentifier/subjectLocator shuffle when I am concerned with standard identifiers for genes for example?

BTW, before anyone rushes out to write the legend syntax, realize that its writing results in subjects that could also be the targets of topic maps with suitable legends.

It is important that syntaxes be subjects, for a suitable legend, because syntaxes come and go out of fashion.

The need to merge subjects represented by those syntaxes, however, awaits only the next person with a brilliant insight.

TMQL Slides for Prague 2011

Tuesday, March 1st, 2011

TMQL Slides for Prague 2011

TMQL slides with discussion points for Prague have been posted!

Please review even if you don’t plan on attending the Prague meeting to offer your comments and questions.

Comments and questions I am sure are always welcome, but are more useful if received prior to weeks if not months of preparing standards prose.

Since I ask, I have several questions (some of which will probably have to be answered post-Prague):

1st Question:

While I understand the utility of the illustrated syntax reflected on the slides, I am more concerned with the underlying formal model for TMQL. Syntax and its explanation for users is very important, but that can take many forms. Can you say a bit more about the underlying formal model that underlies TMQL?

2nd Question:

See my blog post on Indexing by Properties. To what extent is TMQL going to support the use of multiple properties (occurrences) for the purposes of identifications?

3rd Question:

What datatypes will be supported by TMQL? How are additional datatypes declared?

4th Question:

What comparison operators are supported by TMQL?

Topic Maps, Google and the Billion Fact Parade

Thursday, February 10th, 2011

Andrew Hogue (Google) actually titled his presentation on Google’s plan for Freebase: The Structured Search Engine.

Several minutes into the presentation Hogue points out that to answer the question, “when was Martin Luther King, Jr. born?” that date of birth, date born, appeared, dob were all considered synonyms that expect the date type.

Hmmm, he must mean keys that represent the same subject and so subject to merging and possibly, depending on their role in a subject representative, further merging of those subject representatives. Can you say Steve Newcomb and the TMRM?

Yes, attribute names represent subjects just like collections of attributes are thought to represent subjects. And benefit from rules specifying subject identity, other properties and merging rules. (Some of those rules can be derived from mechanical analysis, others probably not.)

Second, Hogue points out that Freebase had 13 million entities when purchased by Google. He speculates on taking that to 1 billion entities.

Let’s cut to the chase, I will see Hogue’s 1 billion entities and raise him 9 billion entities for a total pot of 10 billion entities.

Now what?

Let’s take a simple question that Hogue’s 10 billion entity Google/Freebase cannot usefully answer.

What is democracy?

Seems simple enough. (viewers at home can try this with their favorite search engine.)

1) United States State Department: Democracy means a state that support Israel, keeps the Suez canal open and opposes people we don’t like in the U.S. Oh, and that protects the rights and social status of the wealthy, almost forgot that one. Sorry.

2) Protesters in Egypt (my view): Democracy probably does not include some or all of the points I mention for #1.

3) Turn of the century U.S.: Effectively only the white male population participates.

4) Early U.S. history: Land ownership is a requirement.

I am sure examples can be supplied from other “democracies” and their histories around the world.

This is a very important term and it differing use by different people in different contexts, is going to make discussion and negotiations more difficult.

There are lots of terms where no single “entity” or “fact” that is going to work for everyone.

Subject identity is a tough question and the identification of a subject changes over time, social context, etc. Not to mention that the subjects identified by particular identifications change as well.

Consider that at one time cab was not used to refer to a method of transportation but to a brothel. You may object that was “slang” usage but if I am searching an index of police reports for that time period for raids on brothel’s, your objection isn’t helpful. Doesn’t matter if the usage is “slang” or not, I need to obtain accurate results.

User expectations and needs cannot (or at least should not in my opinion) be adapted to the limitations of a particular approach or technology.

Particularly when we already know of strategies that can help with, not solve, the issues surrounding subject identity.

The first step that Hogue and Google have taken, recognizing that attribute names can have synonyms, is a good start. In topic map terms, recognizing that information structures are composed of subjects as well. So that we can map between information structures, rather than replacing one with another. (Or having religious discussions about which one is better, etc.)

Hogue and Google are already on the way to treating some subjects as worthy of more effort than others, but for those that merit the attention, solving the issue of to reliable, repeatable subject identification, is non-trivial.

Topic maps can make a number of suggestions that can help with that task.

CS Abstraction – Bridging Data Models – JSON and COBOL

Thursday, December 9th, 2010

I was reading Ullman’s Foundations of Computer Science on abstraction when it occurred to me:

A topic map legend is an abstraction that bridges some set of abstractions (read data models), to enable us to navigate and possibly combine data from them.

with the corollary:

Any topic map legend is itself an abstraction that is subject to being bridged for navigation or data combination purposes.

The first statement recognizes that there are no Ur abstractions that will dispel all others. Never have been, never will be.

If the history of CS teaches anything, it is the ephemeral nature of modeling.

The latest hot item is JSON but it was COBOL some, well, more years ago than I care to say. Nothing against JSON but in five years or less, it will either be fairly common or footnoted in dissertations.

The important thing is that we will have data stored in JSON for a very long time. Whether it gains in popularity or no.

We could say everyone will convert to the XXX format of years hence, but in fact that never happens.

Legacy systems (some at defense facilities, systems still require punched data entry, simply not economical to re-write/debug a new system) need the data, cost of the conversion, cost of verification, etc.

The corollary recognizes that once written, a topic map of a set of data models, the topic map itself becomes a data model for navigation/aggregation.

Otherwise we fall into the same trap as the data model paradigms that posit they will be the data model that dispels all others.

There are no cases where that has happened, either in digital times or in the millennia of data models that preceded digital times.

The emphasis on subject identify in topic maps facilitates the bridging of data models and having a useful result when we do.

What data models would you like to bridge today?

Foundations of Computer Science

Thursday, December 9th, 2010

Foundations of Computer Science

Introduction to theory in computer science by Alfred V. Aho and Jeffrey D. Ullman. (Free PDF of the entire text)

The turtle on the cover is said to be a reference to the turtle on which the world rests.

This particular turtle serves as the foundation for:

I point out this work because of its emphasis on abstraction.

Topic maps, at their best, are abstractions that bridge other abstractions and make use of information recorded in those abstractions.

*****
PS: The “rules of thumb” for programming in the introduction are equally applicable to writing topic maps. You will not encounter many instances of them being applied but they remain good guidance.

TMRM and a “universal information space”

Wednesday, November 24th, 2010

As an editor of the TMRM (Topic Maps Reference Model) I feel compelled to point out the TMRM is not a universal information space.

I bring up the universal issue because someone mentioned lately, mapping to the TMRM.

There is a lot to say about the TMRM but let’s start with the mapping issue.

There is no mapping to the TMRM. (full stop) The reason is that the TMRM is also not a data model. (full stop)

There is a simple reason why the TMRM was not, is not, nor ever will be a data model or universal information space.

There is no universal information space or data model.

Data models are an absolute necessity and more will be invented tomorrow.

But, to be a data model is to govern some larger or smaller slice of data.

We want to meaningfully access information across past, present and future data models in different information spaces.

Enter the TMRM, a model for disclosure of the subjects represented by a data model. Any data model, in any information space.

A model for disclosure, not a methodology, not a target, etc.

We used key and value because a key/value pair is the simplest expression of a property class.

The representative of the definition of a class (the key) and an instance of that class (the value).

That does not constrain or mandate any particular data model or information space.

Rather than mapping to the TMRM, we should say mapping using the principles of the TMRM.

I will say more in a later post, but for example, what subject does a topic represent?

With disclosure for the TMDM and RDF, we might not agree on the mapping, but it would be transparent. And useful.

Reducing Ambiguity, LOD, Ookaboo, TMRM

Tuesday, November 16th, 2010

While reading Resource Identity and Semantic Extensions: Making Sense of Ambiguity and In Defense of Ambiguity it occurred to me that reducing ambiguity has a hidden assumption.

That hidden assumption is the intended audience for who I wish to reduce ambiguity.

For example, Ookaboo does #it solves the problem of multiple vocabularies for its intended audience thusly:

Our strategy for dealing with multiple subject terminologies is to what we call a reference set, which in this case is

http://ookaboo.com/o/pictures/topic/2021903/Central_Air_Force_Museum#it
http://dbpedia.org/resource/Central_Air_Force_Museum
http://rdf.freebase.com/ns/m.0g_2bv

If we want to assert foaf:depicts we assert foaf:depicts against all of these. The idea is that not all clients are going to have the inferencing capabilities that I wish they’d have, so I’m trying to assert terms in the most “core” databases of the LOD cloud.

In a case like this we may have YAGO, OpenCyc, UMBEL and other terms available. Relationships like this are expressed as

<:Whatever> <ontology2:ak>
<http://mpii.de/yago/resource/Central_Air_Force_Museum> .

<ontology2:aka>, not dereferencable yet, means (roughly) that “some people use term X to refer to substantially the same thing as term Y.” It’s my own answer to the <owl:sameAs> problem and deliberately leaves the exact semantics to the reader. (It’s a lossy expression of the data structures that I use for entity management)

This is very like a TMRM solution since it gathers different identifications together, in hopes that at least one will be understood by a reader.

This is very unlike a TMRM solution because it has no legend to say how to compare these “values,” must less their “key.”

The lack of a legend makes integration in legal, technical, medical or intelligence applications, ah, difficult.

Still, it is encouraging to see the better Linked Data applications moving in the direction of the TMRM.

Whose Logic Binds A Topic Map?

Tuesday, November 9th, 2010

An exchange with Lars Heuer over what the TMRM should say about “ako” and “isa” (see: A Guide to Publishing Linked Data Without Redirects brings up an important but often unspoken issue.

The current draft of the Topic Maps Reference Model (TMRM) says that subclass-superclass relationships are reflexive and transitive. Moreover, “isa” relationships, are non-reflexive and transitive.

Which is all well and good, assuming that accords with your definition of subclass-superclass and isa. The Topic Maps Data Model (TMDM) on the other hand defines “isa” as non-transitive.

Either one is a legitimate choice and I will cover the resolution of that difference elsewhere.

My point here is to ask: “Whose logic binds a topic map?”

My impression is that here and in the Semantic Web, logical frameworks are being created, into which users are supposed to fit their data.

As a user I would take serious exception to fitting my data into someone else’s world view (read logic).

That the real question isn’t it?

Whether IT/SW dictates to users the logic that will bind their data or if users get to define their own “logics?”

Given the popularity of tagging and folksonomies, user “logics” look like the better bet.

The UMLS Metathesaurus: representing different views of biomedical concepts

Wednesday, October 27th, 2010

The UMLS Metathesaurus: representing different views of biomedical concepts

Abstract

The UMLS Metathesaurus is a compilation of names, relationships, and associated information from a variety of biomedical naming systems representing different views of biomedical practice or research. The Metathesaurus is organized by meaning, and the fundamental unit in the Metathesaurus is the concept. Differing names for a biomedical meaning are linked in a single Metathesaurus concept. Extensive additional information describing semantic characteristics, occurrence in machine-readable information sources, and how concepts co-occur in these sources is also provided, enabling a greater comprehension of the concept in its various contexts. The Metathesaurus is not a standardized vocabulary; it is a tool for maximizing the usefulness of existing vocabularies. It serves as a knowledge source for developers of biomedical information applications and as a powerful resource for biomedical information specialists.

Bull Med Libr Assoc. 1993 Apr;81(2):217-22.
Schuyler PL, Hole WT, Tuttle MS, Sherertz DD.
Medical Subject Headings Section, National Library of Medicine, Bethesda, MD 20894.

Questions:

  1. Did you notice the date on the citation?
  2. Map this article to the Topic Maps Data Model (3-5 pages, no citations)
  3. Where does the Topic Maps Data Model differ from this article? (3-5 pages, no citations)
  4. If concept = proxy, what concepts (subjects) don’t have proxies in the Metathesaurus?
  5. On what basis are “biomedical meanings” mapped to a single Metathesaurus “concept?” Describe in general but illustrate with at least five (5) examples

Key-Value Pairs

Monday, September 13th, 2010

The Topic Map Reference Model can’t claim to have invented the key/value view of the world.

But it is interesting how much traction key/value pair approaches have been getting of late. From NoSQL in general to Neo4j and Redis in particular. (no offense to other NoSQL contenders, those are the two that came to mind)

Declare which key/value pairs identify a subject and you are on your way towards a subject-centric view of computing.

OK, there are some details but declaring how you identify a subject is the first step in enabling others to reliably identify the same subject.

Cartesian Products and Topic Maps

Sunday, September 12th, 2010

Using SQL Cross Join – the report writers secret weapon is a very clear explanation of the utility of cross-joins in SQL.

Cross-join = Cartesian product, something you will remember from the Topic Maps Reference Model.

Makes a robust where clause look important doesn’t it?

Set-Similarity and Topic Maps

Monday, July 12th, 2010

The set-similarity offers a useful way to think about merging in a topic maps context. The measure of self-similarity that we want for merging in topic maps is the same subject.

Self-similarity, in the TMDM, for topics is:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

The research literature makes it clear that judging self-similarity isn’t subject to one test or even a handful of them for all purposes. Not to mention that more often than not, self-similarity is being judged on high dimensional data.

Despite clever approaches and quite frankly amazing results, I have yet to run across sustained discussion of how to interchange self-similarity tests. Perhaps it is my markup background but that seems like the sort of capability that would be widely desired.

The issue of interchangeable self-similarity tests looks like an area where JTC 1/SC 34/WG 3 could make a real contribution.

Second Class Citizens/Subjects

Thursday, April 29th, 2010

One of the difficulties that topic maps solve is the question of second class citizens (or subjects) in information systems.

The difficulty is one that Marijane raises when she quotes Michael Sperberg-McQueen wondering how topic maps differ from SQL databases, Prolog or colloquial XML?

One doesn’t have to read far to find that SQL databases, colloquial XML (and other information technologies) talk about real world subjects.*

The real world view leaves the subjects that comprise information systems out of the picture.

That creates an underclass of subjects that appear in information systems, but can never be identified or be declared to have more than one identification.

Mapping strategies, like topic maps enable users to identify any subject. Any subject can have multiple identifiers. Users can declare what properties must be present to identify a subject. Including the subjects that make up information systems.

*Note my omission of Prolog. Some programming languages may be more map friendly than others but I am unaware of any that cannot attribute properties to parts of a data structure (or its contents) for the purposes of mapping and declaring a mapping.

Implementing the TMRM (Part 2)

Wednesday, March 10th, 2010

Implementing the TMRM (Part 2)

I left off in Implementing the TMRM (Part 1) by saying that if the TMRM defined proxies for particular subjects, it would lack the generality needed to enable legends to be written between arbitrary existing systems.

The goal of the TMRM is not to be yet another semantic integration format but to enable users to speak meaningfully of the subjects their systems already represent and to know when the same subjects are being identified differently. The last thing we all need is another semantic integration format. Sorry, back to the main theme:

One reason why it isn’t possible to “implement” the TMRM is the lack of any subject identity equivalence rules.

String matching for IRIs is one test for equivalence of subject identification but not the only one. The TMRM places no restrictions on tests for subject equivalence so any implementation will only have a subset of all the possible subject equivalence tests. (Defining a subset of equivalence tests underlies the capacity for blind interchange of topic maps based on particular legends. More on that later.)

An implementation that compares IRIs for example, would fail if a legend asked it to compare the equivalence of Feynman diagrams generated from the detector output from the Large Hadron Collider. Equivalence of Feynman diagrams being a legitimate test for subject equivalence and well within the bounds of the TMRM.

(It occurs to me that the real question to ask is why we don’t have more generalized legends with ranges of subject identity tests. Sort of like XML parsers only parse part of the universe of markup documents but do quite well within that subset. Apologies for the interruption, that will be yet another post.)

The TMRM is designed to provide the common heuristic through the representation of any subject can be discussed. However, it does not define a processing model, which is another reason why it isn’t possible to “implement” the TMRM, but more on that in Implementing the TMRM (Part 3).

Implementing the TMRM (Part 1)

Monday, March 8th, 2010

There are two short pieces on the Topic Maps Reference Model (TMRM) that are helpful to read before talking about “implementing” the TMRM. Both are by Robert Barta, one of the co-editors of the TMRM, A 5 min Introduction into TMRM and TMRM Exegesis: Proxies.

The TMRM defines an abstract structure to enable us to talk about proxies, the generic representative for subjects. It does not define:

  1. Any rules for identifying subjects
  2. Any rules for comparing identifications of subjects
  3. Any rules for what happens if proxies represent the same subjects
  4. Any subjects for that matter

If that seems like a lot to not define, it was and it took a while to get there.

The TMRM does not define any of those things, not because they are not necessary, but doing so would impair the ability of legends (the disclosures of all those things) to create views of information that merge diverse information resources.

Consider a recent call for help with the earthquake in Chile. Data was held by a Google’s people finder service but the request was to convert it into RDF. Then do incremental dumps every hour.

So the data moves from one data silo to another data silo. As Ben Stein would say, “Wow.”

If we could identify the subjects, both structural and as represented, we could merge information about those subjects with information about the same subjects in any data silo, not just one in particular.

How is that for a business case? Pay to identify your subjects once versus paying that cost every time you move from one data silo to another one.

The generality of the TMRM is necessary to support the writing of a legend that identifies the subjects in a more than one system and, more importantly, defines rules for when they are talking about the same subjects. (to be continued)

(BTW, using Robert Barta’s virtual topic map approach, hourly dumps/conversion would be unnecessary, unless there was some other reason for it. That is an approach that I hope continues in the next TMQL draft (see the current TMQL draft).)