Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 23, 2010

KNIME – Professional Open-Source Software

Filed under: Heterogeneous Data,Mapping,Software,Subject Identity — Patrick Durusau @ 7:27 pm

KNIME – Professional Open-Source Software is another effort by domain bridging folks I mentioned yesterday.

From the homepage:

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive Open-Source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is currently being used actively by over 6.000 professionals all over the world, both in industry and academia.

Read the KNIME features page for a very long list of potentially useful subject identity tests.

There is a place for string matching IRIs, but there is a world of subject identity beyond that as well.

July 27, 2010

Federation and Business Intelligence Applications

Filed under: Mapping,Subject Identity,Topic Map Software,Topic Maps — Patrick Durusau @ 8:09 pm

Federated Stream Processing Support for Real-Time Business Intelligence Applications by Irina Botan, Younggoo Cho, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Kihong Kim, and Nesime Tatbul, argues realtime BI has two critical requirements:

  1. reducing latency
  2. providing rich contextual data that is directly actionable

Topic maps enable you to (reliably) endow subjects in streams with rich contextual data that is directly actionable — across streams. (Not to mention that it will remain re-usable when your current IT department turns over.)

Selling topic maps means casting them in terms of fixing issues of interest to customers.

I think this is another opportunity that awaits some clever topic map company.

July 25, 2010

Dependency and Subject Identity

Filed under: Information Retrieval,Subject Identity,Topic Maps — Patrick Durusau @ 5:44 am

Dependence language model for information retrieval by Jianfeng Gao , Jian-yun Nie , Guangyuan Wu , and Guihong Cao, is a good introduction to dependency analysis in information retrieval.

The theory is that terms (words) in a document depend upon other words and that those dependencies can be used to improve the results of information retrieval efforts.

Beyond its own merits, I find the analogy of dependency analysis to subject identification interesting. That any subject identification depends upon other subjects being identified, whether those identifications are explicit or not.

If not explicit, we have the traditional IR problem of trying to determine what subjects were meant. We can see the patterns of usage but the reasons for the patterns lie just beyond our reach.

Dependency analysis does not seek an explicit identification but identifies patterns that appear to be associated with a particular term. That improves out “guesses” to a degree.

Topic maps enable us to make explicit what subjects the identification of a particular subject depends upon. Or rather to make explicit our identifications of subjects upon which an identification depends.

Whether the same subject is being identified, even by use of the same dependent identifications, is a question best answered by a user.

July 24, 2010

Dangers of Renaming

Filed under: Subject Identifiers,Subject Identity,Topic Maps — Patrick Durusau @ 3:49 pm

Topic maps and the semantic web share problems and dangers in their rush to re-name things with IRIs.

The problems include, the number of subjects, the propagation (enforcement?) of new names, the emergence of new subjects, and others.

Re-naming has a graver danger, identified by Michael Shara, curator of Astrophysics, American Museum of Natural History, when asked why the heaviest star in the universe, R136a**, doesn’t have a better name.* He responsed:

…partly because it [R136a] refers back to the original catalog, and once you go back to the original catalog, you can find all the literature that refers to it, so naming it John’s star or Betty’s Bright Object, would take that away from us.

So would renaming it to an IRI.

Request of the topic maps and semantic web communities:

Please let us keep our identifiers (as identifiers) and our history.

*****
*Weekend Edition – 24 July 2010 – Biggest Star Still Managed To Hide Until Just Now

**Astronomers find a 300 mass star (Royal Astronomical Society)

July 20, 2010

Subjects, Sets and Identifications

Filed under: Subject Identifiers,Subject Identity,Topic Maps — Patrick Durusau @ 6:18 am

There was a lively discussion on the topicmapmail discussion list about books and whether they have any universal identifiers. (Look in the archives for July, 2010 and messages with MARC in the subject line.)

There are known problems with ISBNs, such as publishers re-using them or assigning duplicate ISBNs to different books or simply making mistakes with the numbers themselves.

It was reported by one participant that Amazon uses it own unique identifier for books.

The United States Library of Congress has its own internal identifier for books in its collection.

Not to mention that other library systems have their own identifiers for their collections.

At a minimum, it is possible for a book, considered as a subject, to have an ISBN, an identifier from Amazon, another identifier at the Library of Congress and still others in other systems. Perhaps even a unique identifier from a book jobber that sells books to libraries.

If you think about that for a moment, it become clear that a book as a subject has a *set* of identifiers, all of which identify the same subject. Moreover, each of those identifiers works best in a particular context, dare we say the identifier has a scope?

If I had a representative (a topic) for this subject (book) that had a set of identifiers (ISBN, ASIN, LOC, etc.) and each of those identifiers had a scope, I could reliably import information from any source that used at least one of those identifiers.

The originators of those identifiers can use continue to use their identifiers and yet enjoy the benefits of information that was generated or collected using other identifiers.

Topic maps anyone?

July 18, 2010

This Means This, This Means That: A User’s Guide to Semiotics

Filed under: Semantics,Semiotics,Subject Identity — Patrick Durusau @ 10:25 am

This Means This, This Means That: A User’s Guide to Semiotics was “recommended” to me by Amazon.

From the product description:

Divided into 75 key semiotic concepts, each section of the book begins with a single image or sign, accompanied by a question that invites us to interpret what we are seeing. Turning the page, we can compare our response with the theory behind the sign. In this way, we actively engage in creative thinking. Read straight through or dipped into regularly, this book provides practical examples of how meaning is made in contemporary culture.

I probably have better stuff on Semiotics on my bookshelf but what interests me is the approach taken to explaining the concepts.

I don’t have a copy (yet) but would like to hear from anyone who has used it in an classroom setting.

Wondering if some thing similar would prove useful as an introduction to subject analysis in general or for some area in particular?

Perhaps showing documented cases where mistakes in subject identity lead to spectacular outcomes?

A “cost” of mis-interpretation to hook users into thinking about subject identity before they get to the hard part.

July 14, 2010

Are simplified hadoop interfaces the next web cash cow? – Post

Filed under: Hadoop,Legends,MapReduce,Semantic Diversity,Subject Identity — Patrick Durusau @ 12:06 pm

Are simplified hadoop interfaces the next web cash cow? is a question that Brian Breslin is asking these days.

It isn’t that hard to imagine that not only Hadoop interfaces being cash cows but also canned analysis of public date sets that can be incorporated into those interfaces.

But then the semantics question comes back up when you want to join that canned analysis to your own. What did they mean by X? Or Y? Or for that matter, what are the semantics of the data set?

But we can solve that issue by explicit subject identification! Did I hear someone say topic maps? 😉 So our identifications of subjects in public data sets will themselves become a commodity. There could be competing set-similarity analysis of  public data sets.

If a simplified Hadoop interface is the next cash cow, we need to be ready to stuff it with data mapped to subject identifications to make it grow even larger. A large cash cow is a good thing, a larger cash cow is better and a BP-sized cash cow is just about right.

July 2, 2010

Rough Fuzzies, and Beyond?

Filed under: Fuzzy Sets,Rough Sets,Semantic Diversity,Subject Identity — Patrick Durusau @ 8:18 pm

Reading Rought Sets: Theoretical Aspects of Reasoning about Data by Zdzislaw Pawlak, when I ran across this comparison of rough versus fuzzy sets:

Rough sets has often been compared to fuzzy sets, sometimes with a view to introduce them as competing models of imperfect knowledge. Such a comparison is unfounded. Indiscernibility and vagueness are distinct facets of imperfect knowledge. Indiscernibility refers to the granularity of knowledge, that affects the definition of universes of discourse. Vagueness is due to the fact that categories of natural language are often gradual notions, and refer to sets with smooth boundaries. Borrowing an example from image processing, rough set theory is about the size of pixels, fuzzy set theory is about the existence of more than two levels of grey. (pp. ix-x)

It occurred to me that the precision of our identifications or perhaps better, the fixed precision of our identifications is a real barrier to semantic integration. Because the precision I need for semantic integration is going to vary from subject to subject, depending upon what I already know, what I need to know and for what purpose. Very coarse identification may be acceptable for some purposes but not others.

I don’t know what it would look like to have varying degrees of precision to subject identification or even how that would be represented. But, I suspect solving those problems will be involved in any successful approach to semantic integration.

June 20, 2010

“What Is I.B.M’s Watson?” – Review

Filed under: Data Mining,Semantic Diversity,Subject Identity — Patrick Durusau @ 7:34 pm

What Is I.B.M.’s Watson? appears in the New York Time Magazine on 20 June 2010. IBM or more precisely David Ferrucci and his team at IBM have made serious progress towards a useful question-answering machine. (On Ferrucci see, Ferrucci – DBLP, Ferrucci – Scientific Commons)

It won’t spoil the article to say that raw computing horsepower (BlueGene servers) plays a role in the success of the Watson project. But, there is another aspect of the project that makes it relevant to topic maps.

Rather than relying on a few algorithms to analyze questions, Watson uses more than a hundred and as summarized by the article:

Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one. In essence, Watson thinks in probabilities. It produces not one single “right” answer, but an enormous number of possibilities, then ranks them by assessing how likely each one is to answer the question.

Transpose that into a topic maps setting and imagine that you are using probabilistic merging algorithms that are applied interactively by a user in real time.

Suddenly we are not talking about a technology for hand curated information resources but an assistive technology that would enable human users go deep knowledge diving into the sea of information resources. While generating buoys and markers for others to follow.

Our ability to do that will depend on processing power, creative use and development of “probabilistic merging” algorithms and a Topic Maps Query Language that supports querying of non-topic map data and creation of content based on the results of those queries.

****

PS: For more information on the Watson project, see: What Is Watson?, part of IBM’s DeepQA project.

June 15, 2010

Comparing Models – Exercise

Filed under: Exercises,Subject Identity — Patrick Durusau @ 10:45 am

The Library of Congress record for Meaning and mental representations illustrates why topic maps can be different from other information resources.

The record offers a default display, but also MARCXML, MODS, DUBLINCORE formats.

Each display is unique to that format.

Exercise: Requires pencil/pen, paper, scissors, tape.

Draw 4 unfolded cubes, ;-), just draw double lines across the paper and divide into 4 equal spaces.

Write down one of the values you see on the default page, say the title, Meaning and mental representation.

In the first box to your left (my right), write “Main Title.” Then go to each of the alternative formats and write down what subjects “contain” the title.

First difference, a topic map can treat the containers of subjects as subjects in their own right. (Important for mapping between systems and disclosing that mapping to others.)

Second difference, with the topic “unfolded” as it were, you can either view the other subjects that contain the subject of interest, or, you can cut the cube out and fold it up and display only one set of subjects at a time. You should fill out another set of boxes and make such cubes in preparation for the next difference.

Third difference, assuming that you have cut out two or more cubes and taped them together.

Rotate one of the cubes for a particular piece of information to a different face than the others.

Now we can see “Main Title” in the default system while seeing the author listing in Dublin Core. Our information system has become as heterogeneous as the data that it consumes.

Assignment: Do this exercise for 5 items in the LOC catalog (at least 3 fields) (your choice of items and fields) and prepare to discuss what insights this gives you about the items, their cataloging, the systems for classification or similar themes. Or a theme of your own. This entire area is very much in discovery mode.

June 12, 2010

The LibraryThing

Filed under: Collocation,Examples,Marketing,Subject Identity — Patrick Durusau @ 3:42 pm

The LibraryThing is the home of OverCat, a collection of 32 million library records.

It is a nifty illustration of re-using identifiers, not re-inventing them.

I put in an ISBN, for example, and the system searches for that work. It does not ask me to create a “cool” URI for it.

It also demonstrates some of the characteristics of a topic map in that it does return multiple matches for all the libraries that hold a work, but only one. (You can still view the other records as well.)

I am not sure I have the time to enter, even by ISBN, all the books that line the walls of my office but maybe I will start with the new ones as they come in and the older ones as I use them. The result is a catalog of my books, but more importantly, additional information about those works entered by others.

Maybe that could be a marketing pitch for topic maps? That topic maps enable users to coordinate their information with others, without prior agreement. Sort of like asking for a ride to town and at the same time, someone in a particular area says they are going to town but need to share gas expenses. (Treating a circumference around a set of geographic coordinates as a subject. Users neither know nor care about the details, just expressing their needs.)

June 7, 2010

Datasets Galore! (Data.gov)

Filed under: Data Integration,Data Source,Linked Data,Subject Identity,Topic Maps — Patrick Durusau @ 9:56 am

Data.gov hosts 272,677 datasets.

LinkingOpenData will point you to a 400 subset that is available as “Linked Data.”

I guess that means that the other 272,277 datasets are not “Linked Data.”

Fertile ground for topic maps.

Topic Maps don’t limit users to “[u]se URIs as names for things.” (Linked Data)

A topic map can use the identifiers that are in place in one or more of the 272,277 datasets and create mappings to one or more of the 400 datasets in “Linked Data.”

Without creating “Linked Data” or the overhead of the “303 Cloud.”

Which datasets look the most promising to you?

The Value of Indexing

Filed under: Citation Indexing,Indexing,Subject Identity — Patrick Durusau @ 8:46 am

The Value of Indexing (2001) by Jan Sykes is a promotion piece for Factiva, a Dow Jones and Reuters Company, but is also a good overview of the value of indexing.

I find it interesting in its description of the use of a taxonomy for indexing purposes. You may remember from reading a print index the use of the term “see also.” This paper appears to argue that the indexing process consists of mapping one or more terms to a single term in the controlled vocabulary.

A single entry from the controlled vocabulary represents a particular concept no matter how it was referred to in the original article. (page 5)

I assume the mapping between the terms in the article and the term in the controlled vocabulary is documented. That mapping maybe of more interest to the professionals who create the indexes and power users than the typical user.

Perhaps that is a lesson in terms of what is presented to users of topic maps.

Delivery of the information a user wants/needs in their context is more important than demonstrating our cleverness.

That was one of the mistakes in promoting markup, too much emphasis on the cool, new, paradigm shifting and too little emphasis on the benefit to users. With office products that use markup in a non-visible manner to the average user, markup usage has spread rapidly around the world.

Suggestions on how to make that happen for topic maps?

PS: Obviously this is an old piece so in fairness I am contacting Factiva to advise them of this post and to ask if they have an updated paper, etc. that they might want me to post. I will take the opportunity to plug topic maps as well. 😉

June 3, 2010

Connecting The Dots

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 2:20 pm

I have listened to and tried to help refine marketing for topic maps. The one possible slogan is that topic maps make vendor X’s software suck less. Hardly a ringing endorsement of topic maps. 😉

There is the venerable “connecting the dots” theme, but I can connect dots with a pen and one of those puzzle books they sell at the airports. I don’t need a topic map to connect dots. Besides, I am the one who does the connecting of the dots, I just use a topic map to write my connecting of the dots down.

Maybe that is part of the answer.

Topic maps give us a way to write down our connecting of the dots. I can’t think of any search engine that allows you to store your connecting of any dots you find. True enough, applications like Talend help you write down your mapping of dots from one data source to another. But with one important difference from topic maps.

You can’t share your dots or their connections with others. Not and expect them to make sense to anyone else. It is the original topic map dilemma. No one knows what dots you have identified or connected and you don’t have any way to tell them.

With topic maps you can identify your dots, say how they are connected, and share them with others.

That sounds pretty close to being an elevator speech to me. Suggestions?

PS: I like the idea of connecting dots that can later be extended by others. Remember the original mapping European mapping expeditions in Africa or South America? They were all partial and all later extended by others. If that were to happen today, the argument would be how to best map the entire territory all at once. Which is doable, but only with omitting a lot of detail, such as meeting the actual residents.

Think of “exploring” one of the document archives that Jason Baron maintains at the U.S. National Archives and Records Administration and connecting a set of dots, that are later extended or perhaps merged with dots identified and connected by others. Eventually, with enough people connecting the dots, the “dark” areas become fewer and fewer. Not unlike what news reporters, lawyers and researchers do now, with the exception that the connected dots become useful to others. Collaborative discovery anyone?

June 1, 2010

Enhancing navigation in biomedical databases by community voting and database-driven text classification

Enhancing navigation in biomedical databases by community voting and database-driven text classification demonstrates improvement of automatic classification of literature by harnessing community knowledge.

From the authors:

Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases.

The system can be seen at: PepBank.

You need to read the article in full to appreciate what the authors have done but a couple of quick points to notice:

1) The use of heat maps to assist users in determining the relevance of a given abstract. (Domain specific facts.)

2) The user interface allows yes/no voting on the same facts as appear in the heat map.

Voting results in reclassification of the entries.

Equally important is a user interface that enables immediate evaluation of relevance and, quick user feedback on relevance.

The user is not asked a series of questions, given complex rating choices, etc., it is yes or no. That may seem coarse but the project demonstrates with proper design, that can be very useful.

May 31, 2010

Authoritative Identifications?

Filed under: Semantic Web,Subject Identity — Patrick Durusau @ 3:10 pm

Sam Hunting reminded me that if a method of identification becomes authoritative, that can lead to massive loss of data (prior methods of identification). We were discussing the Semantic Web Challenge. That assumes systems that do not support multiple “authoritative” and alternative identifications.

While I can understand the concern, I think it is largely unwarranted.

Natural language and consequently identification have been taking care of themselves in the face of “planned” language proposals for centuries. According to Klaus Schubert in the introduction to: Interlinguistics: Aspects of the Science of Planned Languages, Berlin: Mouton de Gruyter, 1989, there are almost 1,000 such projects, most since the second half of the 19th century. I suspect the count was too low by the time it was published.

The welter of identifications has continued merrily along for more than the last 20 years so I don’t feel like we are in any imminent danger of uniformity.

And, as a practical matter, more that a Billion speakers of Chinese, Japanese and Korean are bringing their concerns and identifications of subjects to the WWW in a way that will be hard to ignore. (Nor should they be.)

Systems that support multiple authoritative and alternative identifications will be the future of the WWW.

PS:The use of owl:sameAs is a pale glimmer of what needs to be possible for reliable mappings of identifications. The reason for any mapping remains unknown.

Semantic Web Challenge

The Semantic Web Challenge 2010 details landed in my inbox this morning. My first reaction was to refine my spam filter. 😉 Just teasing. My second and more considered reaction was to think about the “challenge” in terms of topic maps.

Particularly because a posting from the Ontology Alignment Evaluation Initiative arrived the same day, in response to a posting from sameas.org.

I freely grant that URIs that cannot distinguish between identifiers and resources without 303 overhead are poor design. But the fact remains that there are many data sets, representing large numbers of subjects that have even poorer subject identification practices. And there are no known approaches that are going to result in the conversion of those data sets.

Personally I am unwilling to wait until some new “perfect” language for data sweeps the planet and results in all data being converted into the “perfect” format. Anyone who thinks that is going to happen needs to stand with the end-of-the-world-in-2012 crowd. They have a lot in common. Magical thinking being one common trait.

The question for topic mappers to answer is how do we attribute to whatever data language we are confronting, characteristics that will enable us to reliably merge information about subjects in that format either with other information in the same or another data language? Understanding that the necessary characteristics may vary from data language to data language.

Take the lack of a distinction between identifier and resource in the Semantic Web for instance. One easy step towards making use of such data would be to attribute to each URI the status of either being an identifier or a resource. I suspect, but cannot say, that the authors/users of those URIs know the answer to that question. It seems even possible that some sets of such URIs are all identifiers and if so marked/indicated in some fashion, they automatically become useful as just that, identifiers (without 303 overhead).

As identifiers they may lack the resolution that topic maps provide to the human user, which enables them to better understand what subject is being identified. But, since topic maps can map additional identifiers together, when you encounter a deficient identifier, simply create another one for the same subject and map them together.

I think we need to view the Semantic Web data sets as opportunities to demonstrate how understanding subject identity, however that is indicated, is the linchpin to meaningful integration of data about subjects.

Bearing in mind that all our identifications, Semantic Web, topic map or otherwise, are always local, provisional and subject to improvement, in the eye of another.

May 26, 2010

Ontological Emptiness

Filed under: Mapping,Ontological Emptiness,Subject Identity — Patrick Durusau @ 12:48 pm

Bernard Vatant’s ontological emptiness comment on mapping of identifiers continues to haunt me.

I am tempted to say that if the identifiers are unambiguous, then an ontologically empty mapping is sufficient. What more is there to say than each of two or more identifiers do in fact identify the same subject?

That begs the identification question doesn’t it? To say that two or more identifiers identify the same subject presumes a judgment on some basis that the identifiers do in fact represent the same subject. Bernard is asserting is that a mapping in the absence of a basis for mapping is sufficient.

When put that way, “mapping in the absence of a basis for mapping,” then Bernard’s proposal seems deeply problematic, at least for human users.

For computers a mapping is always just a mapping. There may be reasons to include or exclude the basis for a mapping, but end of the day, the result is a mapping. (There may be values that trigger mappings but that isn’t the same as a “reason” for a mapping.)

For the human user, on the other hand, the information “behind” each identifier, is what they use to form a judgment about the subject an identifier represents. That enables them to form a judgment about the mapping of identifiers. And whether they wish to follow the same mapping.

Perhaps we should separate the question of how to communicate to a user why a mapping has occurred from the simple fact of mapping in an information system? The information system is incapable of caring by definition and perhaps the basis for mapping is simply clutter from its perspective. The human user, on the other hand, needs the information that is meaningless to the information system.

May 25, 2010

A Mapmaker’s Manifesto

Filed under: Maps,Search Engines,Search Interface,Searching,Subject Identity,Usability — Patrick Durusau @ 3:48 pm

Search Patterns by Peter Moreville and Jeffrey Callender should be on your must read list. Their “Mapmaker’s Manifesto” will give you an idea of why I like the book:

  1. Search is a problem too big to ignore.
  2. Browsing doesn’t scale, even on an IPhone.
  3. Size matters. Linear growth compels a step change in design.
  4. Simple, fast, and relevant are table stakes.
  5. One size won’t fit all. Search must adapt to context.
  6. Search in iterative, social, and multisensory.
  7. Increments aren’t enough. Even Google must innovate or die.
  8. It’s not just about findability. It’s not just about the Web.
  9. The challenge is radically multidisciplinary.
  10. We must engage engineers and executives in design.
  11. We can learn from the past. Library science is still relevant.
  12. We can learn from behavior. Interaction design affords actionable results.
  13. We can learn from one user. Analytics is enriched by ethnography.
  14. Some patterns, we should study and reuse.
  15. Some patterns, we should break like a bad habit.
  16. Search is a complex adaptive system.
  17. Emergence, cocreation, and self-organization are in play.
  18. To discover the seeds of change, go outside.
  19. In science, fiction, and search, the map invents the territory.
  20. The future isn’t just unwritten—it’s unsearched.

I also like Search Patterns because the authors’ concede there are vast unknowns as opposed to saying: “If you just use our (insert paradigm/syntax/ontology/language) then all those nasty problems go away.”

I think we need to accept their invitation to face the vast unknowns head on.

May 22, 2010

CoReference Service

Filed under: Conferences,Ontological Emptiness,Subject Identity — Patrick Durusau @ 3:04 pm

Coreference as Service by Bernard Vatant says the ontological emptiness of an identifier mapping service determines its usefulness.

I wonder how to know when that will be true?

That is I can imagine use cases where empty mapping of identifiers is good enough for some purpose.

In the case Bernard is talking about the identifiers are of geographic locations. Perhaps there is a common enough frame of reference for that to work.

On the other hand, I can imagine coreference services with mappings based upon “attributes” associated with identifiers.

How to judge between which one to use seems like an open question to me.

May 12, 2010

Time, Tide and Identifiers Wait for No One

Filed under: Subject Identifiers,Subject Identity — Patrick Durusau @ 10:38 am

The earliest record of time and tide wait for no man dates from 1225 and reads in modern English:

the tide abides for, tarrieth for no man, stays no man, tide nor time tarrieth no man

Meaning no one can command time. The same is true for identifiers.

What do you think “tide” means in the title? Ocean tide perhaps?

In the original phrase, “tide” meant a period of time. The identifier persisted, but its meaning changed.

Identifiers for subjects and their meanings change.

Topic maps can follow those changes.  Can you?

April 29, 2010

Second Class Citizens/Subjects

Filed under: Semantic Diversity,Subject Identity,TMRM — Patrick Durusau @ 6:39 pm

One of the difficulties that topic maps solve is the question of second class citizens (or subjects) in information systems.

The difficulty is one that Marijane raises when she quotes Michael Sperberg-McQueen wondering how topic maps differ from SQL databases, Prolog or colloquial XML?

One doesn’t have to read far to find that SQL databases, colloquial XML (and other information technologies) talk about real world subjects.*

The real world view leaves the subjects that comprise information systems out of the picture.

That creates an underclass of subjects that appear in information systems, but can never be identified or be declared to have more than one identification.

Mapping strategies, like topic maps enable users to identify any subject. Any subject can have multiple identifiers. Users can declare what properties must be present to identify a subject. Including the subjects that make up information systems.

*Note my omission of Prolog. Some programming languages may be more map friendly than others but I am unaware of any that cannot attribute properties to parts of a data structure (or its contents) for the purposes of mapping and declaring a mapping.

April 28, 2010

URIs As Shorthand

Filed under: PSI,Subject Identity — Patrick Durusau @ 3:49 pm

Inge Henriksen made me realize that URIs are being used as a shorthand for the {set of properties} that identify subjects.

A user recognizes a subject by observing/recognizing some {set of properties}.

They then choose a URI as the shorthand for the {set of properties} they recognized.

To interchange a URI with others, the other users need to know what {set of properties} map to the URI.

Corollary: If no {set of properties} maps to a URI, there is no interchange.

Well, no reliable interchange.

Inge could use http://psi.ontopedia.net/Inge_Henriksen to identify himself. I could use http://psi.ontopedia.net/Inge_Henriksen to identify Gjetost cheese. If the URI did not map to a set of properties, how would you choose between them?

(Detail: Less than all the properties in the {set of properties} may identify a subject. I’ll talk about that at a later point.)

April 27, 2010

Use My Model/Language Mister!

Filed under: Authoring Topic Maps,Semantics,Subject Identity,Topic Maps — Patrick Durusau @ 6:48 pm

“Use My Model/Language Mister!” is the cry of markup, modeling and semantics projects.

They all equally sincere and if you don’t like any of them, wait another six months or so for additional choices.

I don’t remember if it was after the 75th or 100th or somewhere past the 100th “true” model that I began to suspect something was amiss.

Models and languages change over time and can be barriers to discussion and discovery of badly needed information.

Rather than arguing for this or that model, as though it were some final answer, why not ask which model suits our present purposes?

With topic maps, once the subjects under discussion are identified, how they are represented for some purpose is a detail. A very important detail but a detail none the less.

If, or rather when, our requirements change, the same subject can be represented in a different way. The subjects can be identified, again, to create a new representation, or, if identified using topic maps, our job of moving to another model just got a whole lot easier.

April 26, 2010

Are Topic Maps News?

Filed under: Semantic Diversity,Subject Identity,Topic Maps — Patrick Durusau @ 4:24 pm

Jack Park, co-editor of XML Topic Maps likes to tell me: “topic maps are not news.” I respond with a variety of explanations/defenses.

Today I wrote the following:

Topic maps are a representation of what people have been doing for as long as they been able to communicate and had different ways to identify things they wanted to talk about.

Some people were able to recognize the same subjects were being identified differently, so they created a mental mapping of the different identifiers. When we reached the age of recorded information, that mental mapping enabled them to find information recorded under different identifications for the same subject.

Topic maps, like thesauri and indexes before them, enable people to write down their mappings. And say on what basis those mappings were done. The first act enables people to use mappings done by others, like thesauri and indexes. The second act, recording the reason for the mapping (subject identity), enables the re-use of a mapping.

So, no news. Saving time, money, resources, enabling auditability/transparency, preserving institutional memory, re-use of mappings (reliably), making more information available to more people, but alas, no news.

April 23, 2010

Are Data Mediators Topic Maps?

Filed under: Mapping,Subject Identity — Patrick Durusau @ 8:32 pm

Data mediators are similar to topic maps. They take heterogeneous data and present a common interface to it.

Does that make a data mediator mapping a topic map?

No!

Here is a test to see if a data mediator mapping qualifies as a topic map:

  1. Create two mappings of the same data using mediators that use different vocabularies.
  2. State what basis you would combine components from the two mappings. (Or even the basis for mapping from the data source to the target, but I digress.)

What a data mediator creates is a useful but blind mapping. That is the reason for that mapping is not apparent from the map.

That prevents re-use of the mapping by others. Such as combining it with other mappings.

The reason for the mapping, subject identification in topic map terms, remains in the head of the person who created the mapping.

What happens when they move up, on or simply retire?

The business case question is:

How many times have you paid to have subjects in your information system identified?

Or even better:

How many more times are you going to pay to have subjects in your information system identified?

Is this time going to be the last time? Could be if you were using topic maps.

April 20, 2010

Data Virtualization

Filed under: Data Integration,Heterogeneous Data,Subject Identity — Patrick Durusau @ 6:47 pm

I ran across a depressing quote today on data virtualization:

But data is distributed, heterogeneous, and often full of errors. Simply federating it insufficient. IT organizations must build a single, accurate, and consistent view of data, and deliver it precisely when it’s needed. Data virtualization needs to take this complexity into account.*

It is very important to have a single view of data for some purposes, but what happens when circumstances change and we need a different view than the one before?

Without explicit identification of subjects, all the IT effort that went into the first data integration project gets repeated in the next data integration project.

You would think that after sixty years of data migration, largely repeating the efforts of prior migrations, even business types would have caught on by this point.

Without explicit identification of subjects, there isn’t any way to “know” what subjects were being identified. Or to create reliable new mappings. So the cycle of data migrations goes on and on.

Break the cycle of data migrations, choose topic maps!

*Look under webinars at: http://www.informatica.com/Pages/index.aspx There wasn’t a direct link that I could post to lead you to the quote.

April 18, 2010

Maps and Territories

Filed under: Maps,Subject Identity — Patrick Durusau @ 6:26 pm

All maps are territories.

The question for comparing SQL (or any other system) to topic maps is:

Can SQL (or other system) recognize one of its own mappings/models as a territory for mapping? If so, how?

I reviewed Chapter 14, “Semantic Modeling,” of C. J. Date’s An Introduction To Database Systems and modeling/mapping there refers to objects in “the real world.”

I take it that Date would exclude SQL schemas as the objects of modeling or mapping with a relational database.

Does anyone have a different impression?

April 15, 2010

What Is Your TFM (To Find Me) Score?

Filed under: Information Retrieval,Recall,Search Engines,Subject Identity — Patrick Durusau @ 10:54 am

I have talked about TFM (To Find Me) scores before. Take a look at How Can I Find Thee? Let me count the ways… for example.

So, you have looked at your OPAC, database, RDF datastore, topic map. What is your average TMF Score?

What do you think it needs to be for 60 to 80% retrieval?

The Furnas article from 1983 is the key to this series of posts. See the full citation in Are You Designing a 10% Solution?.

Would you believe 15 ways to identify a subject? Or aliases to use the common terminology.

Say it slowly, 15 ways to identify a subject gets on average 60 to 80% retrieval. If you are in the range of 3 – 5 ways to identify a subject on your ecommerce site, you are leaving money on the table. Lots of money on the table.

Want to leave less money on the table? Use topic maps and try for 15 aliases for a subject or more.

April 11, 2010

Texts and Topic Maps

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 8:50 pm

Topic maps are composed of representatives of subjects, that is representatives of:

anything whatsoever, regardless of whether it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever (TMDM, 3.14)

Every text is composed of representatives of subjects as well.

Does that make every text a topic map? The answer to that is “no” but why?

Comparing a Text and a Topic Map:

Property Text Topic Map
Subject Representatives yes yes
Explicit Rules for Identification/Representation no yes
Explicit Rules for Merging no yes

I waver between saying that the explicit rules for Identification/Representation are sufficient by themselves and adding explicit rules for Merging. Certainly the rules for merging presume the first but without rules for merging, the rules for identification/representation are nugatory.

Following both sets of rules does not necessarily result in merging all the subject representatives for the same subject. The most any topic map application can claim is that a set of rules for identification/representation have been followed by a particular map and that specified rules for merging have been applied.

Whether a topic map has in fact properly “merged” all the subject representatives is a judgment only a human reader can make, along side whatever texts they happen to be reading.

PS: Merging means that a single representative for a subject results, containing all the different identifications for that subject and any properties of that subject.

« Newer PostsOlder Posts »

Powered by WordPress