Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 7, 2014

Filtering: Seven Principles

Filed under: Filters,Legends,Merging — Patrick Durusau @ 5:29 pm

Filtering: Seven Principles by JP Rangaswami.

When you read “filters” in the seven rules, think merging rules.

From the post:

  1. Filters should be built such that they are selectable by subscriber, not publisher.
  2. Filters should intrinsically be dynamic, not static.
  3. Filters should have inbuilt “serendipity” functionality.
  4. Filters should be interchangeable, exchangeable, even tradeable.
  5. The principal filters should be by choosing a variable and a value (or range of values) to include or exclude.
  6. Secondary filters should then be about routing.
  7. Network-based filters, “collaborative filtering” should then complete the set.

Nat Torkington comments on this list:

I think the basic is: 0: Customers should be able to run their own filters across the information you’re showing them.

+1!

And it should be simpler than hunting for .config/google-chrome/Default/User Stylesheets/Custom.css (for Chrome on Ubuntu).

Ideally a select (from a webpage) and choose an action.

The ability to dynamically select properties for merging would greatly enhance a user’s ability to explore and mine a topic map.

I first saw this in Nat Torkington’s Four short links: 6 January 2014.

September 24, 2013

Rumors of Legends (the TMRM kind?)

Filed under: Bioinformatics,Biomedical,Legends,Semantics,TMRM,XML — Patrick Durusau @ 3:42 pm

BioC: a minimalist approach to interoperability for biomedical text processing (numerous authors, see the article).

Abstract:

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/.

From the introduction:

With the proliferation of natural language text, text mining has emerged as an important research area. As a result many researchers are developing natural language processing (NLP) and information retrieval tools for text mining purposes. However, while the capabilities and the quality of tools continue to grow, it remains challenging to combine these into more complex systems. Every new generation of researchers creates their own software specific to their research, their environment and the format of the data they study; possibly due to the fact that this is the path requiring the least labor. However, with every new cycle restarting in this manner, the sophistication of systems that can be developed is limited. (emphasis added)

That is the experience with creating electronic versions of the Hebrew Bible. Every project has started from a blank screen, requiring re-proofing of the same text, etc. As a result, there is no electronic encoding of the masora magna (think long margin notes). Duplicated effort has a real cost to scholarship.

The authors stray into legend land when they write:

Our approach to these problems is what we would like to call a ‘minimalist’ approach. How ‘little’ can one do to obtain interoperability? We provide an extensible mark-up language (XML) document type definition (DTD) defining ways in which a document can contain text, annotations and relations. Major XML elements may contain ‘infon’ elements, which store key-value pairs with any desired semantic information. We have adapted the term ‘infon’ from the writings of Devlin (1), where it is given the sense of a discrete item of information. An associated ‘key’ file is necessary to define the semantics that appear in tags such as the infon elements. Key files are simple text files where the developer defines the semantics associated with the data. Different corpora or annotation sets sharing the same semantics may reuse an existing key file, thus representing an accepted standard for a particular data type. In addition, key files may describe a new kind of data not seen before. At this point we prescribe no semantic standards. BioC users are encouraged to create their own key files to represent their BioC data collections. In time, we believe, the most useful key files will develop a life of their own, thus providing emerging standards that are naturally adopted by the community.

The “key files” don’t specify subject identities for the purposes of merging. But defining the semantics of data is a first step in that direction.

I like the idea of popular “key files” (read legends) taking on a life of their own due to their usefulness. An economic activity based on reducing the friction in using or re-using data. That should have legs.

BTW, don’t overlook the author’s data and code, available at: http://bioc.sourceforge.net/.

October 9, 2012

Appropriating IT: Glue Steps [Gluing Subject Representatives Together?]

Filed under: Legends,Proxies,Semantic Diversity,Semantic Inconsistency,TMRM — Patrick Durusau @ 4:39 pm

Appropriating IT: Glue Steps by Tony Hirst.

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

(diagrams and other material omitted)

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

If instead of “don’t share a common interface” you read “semantic diversity” and in place of Field Programmable Gate Array, or FPGA, you read “legend,” to “creat[e] an interface between two otherwise incompatible [subject representatives],” you would think Tony’s post was about the topic maps reference model.

Well, this post is and Tony’s is very close.

Particularly the part about being a “reprogrammable device.”

I can tell you: “black” = “schwarz,” but without more, you won’t be able to rely on or extend that statement.

For that, you need a “reprogrammable device” and some basis on which to do the reprogramming.

Legends anyone?

October 1, 2011

DSL for the Uninitiated (Legends?)

Filed under: DSL,Legends — Patrick Durusau @ 8:26 pm

DSL for the Uninitiated by Debasish Ghosh

From the post:

One of the main reasons why software projects fail is the lack of communication between the business users, who actually know the problem domain, and the developers who design and implement the software model. Business users understand the domain terminology, and they speak a vocabulary that may be quite alien to the software people; it’s no wonder that the communication model can break down right at the beginning of the project life cycle.

A DSL (domain-specific language)1,3 bridges the semantic gap between business users and developers by encouraging better collaboration through shared vocabulary. The domain model that the developers build uses the same terminologies as the business. The abstractions that the DSL offers match the syntax and semantics of the problem domain. As a result, users can get involved in verifying business rules throughout the life cycle of the project.

This article describes the role that a DSL plays in modeling expressive business rules. It starts with the basics of domain modeling and then introduces DSLs, which are classified according to implementation techniques. The article then explains in detail the design and implementation of an embedded DSL from the domain of securities trading operations.

The subject identity and merging requirements of a particular domain are certainly issues where users, who actually know the problem domain, should be in the lead. Moreover, if users object to some merging operation result, that will bring notice to perhaps unintended consequences of an identity or merging rule.

Perhaps the rule is incorrect, perhaps there are assumptions yet to be explored, but the focus in on the user’s understanding of the domain, where it should be (assuming the original coding is correct).

This sounds like a legend to me.

BTW, the comments point to Lisp resources that got to DSLs first (as is the case with most/all programming concepts):

Matthias Felleisen | Thu, 04 Aug 2011 22:26:46 UTC

DSLs have been around in the LISP world forever. The tools for building them and for integrating them into the existing toolchain are far more advanced than in the JAVA world. For an example, see

http://www.ccs.neu.edu/scheme/pubs/#pldi11-thacff for a research-y introduction

or

http://hashcollision.org/brainfudge/ for a hands-on introduction.

You might also want to simply start at the Racket homepage.

July 28, 2011

Another Word For It at #2,000

Filed under: Legends,Subject Identity,TMRM,Topic Maps — Patrick Durusau @ 6:55 pm

According to my blogging software this is my 2,000th post!

During the search for content and ideas for this blog I have thought a lot about topic maps and how to explain them.

Or should I say how to explain topic maps without inventing new terminologies or notations? 😉

Topic maps deal with a familiar problem:

People use different words when talking about the same subject and the same word when talking about different subjects.

Happens in conversations, newspapers, magazines, movies, videos, tv/radio, texts, and alas, electronic data.

The confusion caused by using different words for the same subject and same word for different subjects is a source of humor. (What does “nothing” stand for in Shakespeare’s “Much Ado About Nothing”?)

In searching electronic data, that confusion causes us to miss some data we want to find (different word for the same subject) and to find some data we don’t want (same word but different subject).

When searching old newspaper archives this can be amusing and/or annoying.

Potential outcomes of failure elsewhere:

medical literature injury/death/liability
financial records civil/criminal liability
patents lost opportunities/infringement
business records civil/criminal liability

Solving the problem of different words for the same subject and the same word but different subjects is important.

But how?

Topic maps and other solutions have one thing in common:

They use words to solve the problem of different words for the same subject and the same word but different subjects.

Oops!

The usual battle cry is “if everyone uses my words, we can end semantic confusion, have meaningful interchange for commerce, research, cultural enlightenment and so on and so forth.”

I hate to be the bearer of bad news but what about all the petabytes of data we already have on hand with zettabytes of previous interpretations? With more being added every day and not universal solution in sight? (If you don’t like any of the current solutions, wait a few months and new proposals, schemas, vocabularies, etc., will surface. Or you can take the most popular approach and start your own.)

Proposals to deal with semantic confusion are also frozen in time and place. Unlike the human semantics they propose to sort out, they do not change and evolve.

We have to use the source of semantic difficulty, words, in crafting a solution and our solution has to evolve over time even as our semantics do.

That’s a tall order.

Part of the solution, if you want to call it that, is to recognize when the benefits of solving semantic confusion outweighs the cost of the solution. We don’t need to solve semantic confusion everywhere and anywhere it occurs. In some cases, perhaps rather large cases, it isn’t worth the effort.

That triage of semantic confusion allows us to concentrate on cases where the investment of time and effort are worthwhile. In searching for the Hilton Hotel in Paris I may get “hits” for someone with underwear control issues but so what? Is that really a problem that needs a solution?

On the other hand, being able to resolve semantic confusion, such as underlies different accounting systems for businesses, could give investors a clearer picture of the potential risks and benefits of particular investments. Or doing the same for financial institutions so that regulators can “look down” into regulated systems with some semantic coherence (without requiring identical systems).

Having chosen some semantic confusion to resolve, we then have to choose a method to resolve it.

One method, probably the most popular one, is the “use my (insert vocabulary)” method for resolving semantic confusion. Works and for some cases, may be all that you need. Databases with gigabyte size tables (and larger) operate quite well using this approach. Can become problematic after acquisitions when migration to other database systems is required. Undocumented semantics can prove to be costly in many situations.

Semantic Web techniques, leaving aside the fanciful notion of unique identifiers, do offer the capability of recording additional properties about terms or rather the subjects that terms represent. Problematically though, they don’t offer the capacity to specify which properties are required to distinguish one term from another.

No, I am not about to launch into a screed about why “my” system works better than all the others.

Recognition that all solutions are composed of semantic ambiguity is the most important lesson of the Topic Maps Reference Model (TMRM).

Keys (of key/value pairs) are pointers to subject representatives (proxies) and values may be such references. Other keys and/or values may point to other proxies that represent the same subjects. Which replicates the current dilemma.

The second important lesson of the TMRM is the use of legends to define what key/value pairs occur in a subject representative (proxy) and how to determine two or more proxies represent the same subject (subject identity).

Neither lesson ends semantic ambiguity, nor do they mandate any particular technology or methodology.

They do enable the creation and analysis of solutions, including legends, with an awareness they are all partial mappings, with costs and benefits.

I will continue the broad coverage of this blog on semantic issues but in the next 1,000 posts I will make a particular effort to cover:

  • Ex Parte Declaration of Legends for Data Sources (even using existing Linked Data where available)
  • Suggestions for explicit subject identity mapping in open source data integration software
  • Advances in graph algorithms
  • Sample topic maps using existing and proposed legends

Other suggestions?

November 3, 2010

Managing Semantic Ambiguity

Filed under: Legends,TMDM,Topic Maps — Patrick Durusau @ 6:52 pm

Topic maps do not and cannot eliminate semantic ambiguity. What topic maps can do is assist users in managing semantic ambiguity with regard to identification of particular subjects.

Consider the well-known ambiguity of whether a URI is an identifier or an address.

The Topic Maps Data Model (TMDM) provides a way to manage that ambiguity by providing a means to declare if a URI is being used and an identifier or as an address.

That is only “managing” the ambiguity because there is no mechanism to prevent incorrect use of that mechanism, which would result in ambiguity or even having the mechanism mis-understood entirely.

Identification by saying a subject representative (proxy) must have properties X…Xn is a collection of possible ambiguities that an author hopes will be understood by a reader.

Since we are trying to communicate with other people, there isn’t any escape from semantic ambiguity. Ever.

Topic maps provide the ability to offer more complete descriptions of subjects in hopes of being understood by others.

With the ability to add descriptions of subjects from others, offering users a variety of descriptions of the same subject.

We have had episodic forays into “certainty,” the Semantic Web being only one of the more recent failures in that direction. Ambiguity anyone?

July 14, 2010

Are simplified hadoop interfaces the next web cash cow? – Post

Filed under: Hadoop,Legends,MapReduce,Semantic Diversity,Subject Identity — Patrick Durusau @ 12:06 pm

Are simplified hadoop interfaces the next web cash cow? is a question that Brian Breslin is asking these days.

It isn’t that hard to imagine that not only Hadoop interfaces being cash cows but also canned analysis of public date sets that can be incorporated into those interfaces.

But then the semantics question comes back up when you want to join that canned analysis to your own. What did they mean by X? Or Y? Or for that matter, what are the semantics of the data set?

But we can solve that issue by explicit subject identification! Did I hear someone say topic maps? 😉 So our identifications of subjects in public data sets will themselves become a commodity. There could be competing set-similarity analysis of  public data sets.

If a simplified Hadoop interface is the next cash cow, we need to be ready to stuff it with data mapped to subject identifications to make it grow even larger. A large cash cow is a good thing, a larger cash cow is better and a BP-sized cash cow is just about right.

March 11, 2010

In Praise of Legends (and the TMDM in particular)

Filed under: Legends,TMDM — Patrick Durusau @ 8:50 pm

Legends enable topic maps to have different representations of the same subject. Standard legends, like the Topic Maps Data Model (TMDM), are what enable blind interchange of topic maps.

Legends do a number of things but among the more important, legends define the rules for the contents of subject representatives and the rules for comparing them. The TMDM defines three representatives for subjects, topics, associations and occurrences. It also defines how to compare those representatives to see if they represent the same subjects.

Just to pull one of those rules out, if two or more topics have an equal string in their [subject identifiers] property, the two topics are deemed to represent the same subject. (TMDM 5.3.5 Properties) The [subject identifiers] property is a set so a topic could have two or more different strings in that property to match other topics.

It is the definition of a basis for comparison (see the TMDM for the other rules for comparing topics) of topics that enables the blind interchange of topic maps that follow the TMDM. That is to say that I can author a topic map in XTM (one syntax that follows the TMDM) and reasonably expect that other users will be able to successfully process it.

I am mindful of Robert Cerny’s recent comment on encoding strings as URLs but don’t think that covers the case where the identifications of the subjects are dynamic. That is to say that the strings themselves are composed of strings that are subject to change as additional items are merged into the topic map.

The best use case that comes to mind is that of the current concern in the United States over the non-sharing of intelligence data. You know, someone calls up and says their son is a terrorist and is coming to the United States to commit a terrorist act. That sort of intelligence. That isn’t passed on to anyone. At least anyone who cared enough to share it, I don’t know, with the airlines perhaps?

If I can author a subject identification that includes a previously overlooked source of information, say the parent of a potential terrorist, in addition to paid informants, current/former drug lords, etc. the usual categories, then the lights aren’t simply blinking red there is actual information in addition to the blinking lights.

I really should wait for Robert to make his own arguments but if you think of URLs as simply strings, without any need for resolution, you could compose a dynamic identification, freeze it into a URL, then pass it along to a TMDM based system. You don’t get any addition information but that would be one way to input such information into a TMDM based system. If you control the server you could provide a resolution back into the dynamic subject identification system. (Will have to think about that one.)

I think of it as the TMDM using sets of immutable strings for subject identification and one of the things the TMRM authorizes, but does not mandate, is the use of mutable strings as subject identifiers.

Powered by WordPress