Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 4, 2011

Pragmatic Philosophical Technology for Text Mining

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 6:16 pm

Pragmatic Philosophical Technology for Text Mining

Mathew Hurst writes:

In text mining applications, we often work with some form of raw input (web pages, web sites, emails, etc.) and attempt to organize it in terms of the concepts that are mentioned or introduced in the documents.

This process of interpretation can take the form of ‘normalization’ or ‘canonicalization’ (in which many expressions are associated with a singular expression as an exemplar of an set). This happens, for examples, when we map ‘Barack Obama’, ‘President Obama’, etc. to a unique string ‘President Barack Obama’. This is convenient when we want to retrieve all documents about the president.

In this process, we are associating elements within the same language (language in the sense of sets of symbols and the rules that govern their legal generation).

Another approach is to map (or associate) the terms in the original document with some structured record. For example, we might interpret the phrase ‘Starbucks’ as relating to a record of key value pairs {name=starbucks, address=123 main street, …}. In this case, the structure of the record has a semantics (or model) other than that of the original document. In other words, we are mapping from one language to another.

Of course, what we want to do is denote the thing in the real world. It is, however, impossible to represent this as all we can do is shuffle bits around inside the computer. We can’t attach a label to the real world and somehow transcend the reality/representation barrier. However, we can start to look at the modeling process with some pragmatics.

Wrestling with subject identity issues. Worth your time to read and comment.

July 28, 2011

Another Word For It at #2,000

Filed under: Legends,Subject Identity,TMRM,Topic Maps — Patrick Durusau @ 6:55 pm

According to my blogging software this is my 2,000th post!

During the search for content and ideas for this blog I have thought a lot about topic maps and how to explain them.

Or should I say how to explain topic maps without inventing new terminologies or notations? 😉

Topic maps deal with a familiar problem:

People use different words when talking about the same subject and the same word when talking about different subjects.

Happens in conversations, newspapers, magazines, movies, videos, tv/radio, texts, and alas, electronic data.

The confusion caused by using different words for the same subject and same word for different subjects is a source of humor. (What does “nothing” stand for in Shakespeare’s “Much Ado About Nothing”?)

In searching electronic data, that confusion causes us to miss some data we want to find (different word for the same subject) and to find some data we don’t want (same word but different subject).

When searching old newspaper archives this can be amusing and/or annoying.

Potential outcomes of failure elsewhere:

medical literature injury/death/liability
financial records civil/criminal liability
patents lost opportunities/infringement
business records civil/criminal liability

Solving the problem of different words for the same subject and the same word but different subjects is important.

But how?

Topic maps and other solutions have one thing in common:

They use words to solve the problem of different words for the same subject and the same word but different subjects.

Oops!

The usual battle cry is “if everyone uses my words, we can end semantic confusion, have meaningful interchange for commerce, research, cultural enlightenment and so on and so forth.”

I hate to be the bearer of bad news but what about all the petabytes of data we already have on hand with zettabytes of previous interpretations? With more being added every day and not universal solution in sight? (If you don’t like any of the current solutions, wait a few months and new proposals, schemas, vocabularies, etc., will surface. Or you can take the most popular approach and start your own.)

Proposals to deal with semantic confusion are also frozen in time and place. Unlike the human semantics they propose to sort out, they do not change and evolve.

We have to use the source of semantic difficulty, words, in crafting a solution and our solution has to evolve over time even as our semantics do.

That’s a tall order.

Part of the solution, if you want to call it that, is to recognize when the benefits of solving semantic confusion outweighs the cost of the solution. We don’t need to solve semantic confusion everywhere and anywhere it occurs. In some cases, perhaps rather large cases, it isn’t worth the effort.

That triage of semantic confusion allows us to concentrate on cases where the investment of time and effort are worthwhile. In searching for the Hilton Hotel in Paris I may get “hits” for someone with underwear control issues but so what? Is that really a problem that needs a solution?

On the other hand, being able to resolve semantic confusion, such as underlies different accounting systems for businesses, could give investors a clearer picture of the potential risks and benefits of particular investments. Or doing the same for financial institutions so that regulators can “look down” into regulated systems with some semantic coherence (without requiring identical systems).

Having chosen some semantic confusion to resolve, we then have to choose a method to resolve it.

One method, probably the most popular one, is the “use my (insert vocabulary)” method for resolving semantic confusion. Works and for some cases, may be all that you need. Databases with gigabyte size tables (and larger) operate quite well using this approach. Can become problematic after acquisitions when migration to other database systems is required. Undocumented semantics can prove to be costly in many situations.

Semantic Web techniques, leaving aside the fanciful notion of unique identifiers, do offer the capability of recording additional properties about terms or rather the subjects that terms represent. Problematically though, they don’t offer the capacity to specify which properties are required to distinguish one term from another.

No, I am not about to launch into a screed about why “my” system works better than all the others.

Recognition that all solutions are composed of semantic ambiguity is the most important lesson of the Topic Maps Reference Model (TMRM).

Keys (of key/value pairs) are pointers to subject representatives (proxies) and values may be such references. Other keys and/or values may point to other proxies that represent the same subjects. Which replicates the current dilemma.

The second important lesson of the TMRM is the use of legends to define what key/value pairs occur in a subject representative (proxy) and how to determine two or more proxies represent the same subject (subject identity).

Neither lesson ends semantic ambiguity, nor do they mandate any particular technology or methodology.

They do enable the creation and analysis of solutions, including legends, with an awareness they are all partial mappings, with costs and benefits.

I will continue the broad coverage of this blog on semantic issues but in the next 1,000 posts I will make a particular effort to cover:

  • Ex Parte Declaration of Legends for Data Sources (even using existing Linked Data where available)
  • Suggestions for explicit subject identity mapping in open source data integration software
  • Advances in graph algorithms
  • Sample topic maps using existing and proposed legends

Other suggestions?

July 21, 2011

Oracle, Sun Burned, and Solr Exposure

Filed under: Data Mining,Database,Facets,Lucene,SQL,Subject Identity — Patrick Durusau @ 6:27 pm

Oracle, Sun Burned, and Solr Exposure

From the post:

Frankly we wondered when Oracle would move off the dime in faceted search. “Faceted search”, in my lingo, is showing users categories. You can fancy up the explanation, but a person looking for a subject may hit a dead end. The “facet” angle displays links to possibly related content. If you want to educate me, use the comments section for this blog, please.

We are always looking for a solution to our clients’ Oracle “findability” woes. It’s not just relevance. Think performance. Query and snack is the operative mode for at least one of our technical baby geese. Well, Oracle is a bit of a red herring. The company is not looking for a solution to SES11g functionality. Lucid Imagination, a company offering enterprise grade enterprise search solutions, is.

If “findability” is an issue at Oracle, I would be willing to bet that subject identity is as well. Rumor has it that they have paying customers.

July 17, 2011

Hadoop & Startups: Where Open Source Meets Business Data

Filed under: Hadoop,Marketing,Subject Identity — Patrick Durusau @ 7:28 pm

Hadoop & Startups: Where Open Source Meets Business Data

From the post:

A decade ago, the open-source LAMP (Linux, Apache, MySQL, PHP/Python) stack began to transform web startup economics. As new open-source webservers, databases, and web-friendly programming languages liberated developers from proprietary software and big iron hardware, startup costs plummeted. This lowered the barrier to entry, changed the startup funding game, and led to the emergence of the current Angel/Seed funding ecosystem. In addition, of course, to enabling a generation of webapps we all use everyday.

This same process is now unfolding in the Big Data space, with an open-source ecosystem centered around Hadoop displacing the expensive, proprietary solutions. Startups are creating more intelligent businesses and more intelligent products as a result. And perhaps even more importantly, this technological movement has the potential to blur the sharp line between traditional business and traditional web startups, dramatically expanding the playing field for innovation.

So, how do we create an open-source subject identity ecosystem?

Note that I said “subject identity ecosystem” and not URLs pointing at arbitrary resources. Useful but subject identity, to be re-usable, requires more than that.

July 7, 2011

Wordnick

Filed under: Graphs,MongoDB,Subject Identity — Patrick Durusau @ 4:15 pm

Wordnick – Building a Directed Graph with MongoDB

Tony Tam slide deck on directed graphs and MongoDB.

Emphasizes what graph you build depends on your application needs. Much like using your tests for subject identity. You could always use mine but never quite as well or as accurately as your own.

May 3, 2011

The Human – Computer Chasm & Topic Maps

Filed under: Authoring Topic Maps,Subject Identity,Topic Maps — Patrick Durusau @ 1:33 pm

Someone asked the other day why I thought adoption of topic maps hasn’t “set the woods on fire,” as my parents generation would say.

I am in the middle of composing a longer response with suggestions for marketing strategies but I wanted to stop and share something about the human – computer chasm that is relevant to topic maps.

Over the years the topic map community has debated various syntaxes, models, data model, recursive subject representation, query languages and the like. All of which have been useful and sometimes productive debates.

But in those debates, we sometimes (always?) over-looked the human – computer chasm when talking about subject identity.

Take a simple example:

When I see my mother-in-law I don’t think:

  1. http://www.durusau.net/general/Ethel-Holt.html
  2. Wife of Fred Holt
  3. Mother of Carol Holt (my wife)
  4. Mother-in-law of Patrick Durusau
  5. etc….

I know all those things but they aren’t how I recognize Ethel Holt.

I have known Ethel for more than thirty (30) years and have been her primary care-giver for the last decade or so.

To be honest, I don’t know how I recognize Ethel but suspect it is a collage of factors both explicit and implicit.

But topic maps don’t record our recognition of subjects. They record our after the fact explanations of how we think we recognized subjects. To be matched with circumstances that would lead to the same explanation.

I think part of the lack of progress with topic maps is that we failed to recognize the gap between how we recognize subjects and what we write down so computers can detect when two statements are about the same subject.

What topic maps are mapping, isn’t between properties of subjects (although it can be expressed that way) but between the reasons given by some person for identifying a subject.

The act of recognition is human, complex and never fully explained.

Detecting subject sameness is mechanical and based on recorded explanations.

That distinction makes it clear the choices of properties, syntax, etc., for subject sameness, are a matter of convenience, nothing more.

April 29, 2011

Duolingo: The Next Chapter in Human Communication

Duolingo: The Next Chapter in Human Communication

By one of the co-inventors of CAPTCHA and reCAPTCHA, Luis von Ahn, so his arguments should give us pause.

Luis wants to address the problem of translating the web into multiple languages.

Yes, you heard that right, translate the web into multiple languages.

Whatever you think now, watch the video and decide if you still feel the same way.

My question is how to adapt his techniques to subject identification?

April 28, 2011

Reference and Response

Filed under: Subject Identity — Patrick Durusau @ 3:22 pm

Reference and Response by Louis deRosset. Australian Journal of Philosophy, March 2011, Vol. 89, No.1.

Before you skip this entry, realize that this article may shine light on why Linked Data works at all and quite possibly how to improve subject identification for Linked Data and topic maps as well.

Abstract:

A standard view of reference holds that a speaker’s use of a name refers to a certain thing in virtue of the speaker’s associating a condition with that use that singles the referent out. This view has been criticized by Saul Kripke as empirically inadequate. Recently, however, it has been argued that a version of the standard view, a response-based theory of reference, survives the charge of empirical inadequacy by allowing that associated conditions may be largely or even entirely implicit. This paper argues that response-based theories of reference are prey to a variant of the empirical inadequacy objection, because they are ill-suited to accommodate the successful use of proper names by pre-school children. Further, I argue that there is reason to believe that normal adults are, by and large, no different from children with respect to how the referents of their names are determined. I conclude that speakers typically refer positionally: the referent of a use of a proper name is typically determined by aspects of the speaker’s position, rather than by associated conditions present, however implicitly, in her psychology.

With apologies to the author but I would sum up his position (sorry) on referents to be that we use proper nouns to identify particular people because we have learned those references from others, that is our position in a community of users of that referent.

That is to say that all the characteristics that we can recite when called upon to say why we have identified a particular person are much like logic that justifies, after the fact, mathematical theorems and insights. Mathematical theorems and insights being “seen” first and then “proved” as justification for others.

Interesting. Another reason why computers do so poorly at subject identification. Computers are asked to act as we imagine ourselves identifying subjects and not how we identify them in fact.

How does that help with Linked Data and topic maps?

First, I would extend the author’s argument to all referents.

Second, it reveals that the URI/L versus properties to identify a subject is really a canard.

What is important, in terms of subject identification, is the origin of the identification.

For example, if “positionally” I am using .lg as used in Unix in a Nutshell, page 12-7, that is all you need to know to distinguish its reference from all the trash that a web search engine returns.

Adding up properties of “Ligature mode” of Nroff/Troff isn’t going to get you any closer to the referent of .lg. Because that isn’t how anyone used .lg in the same sense I did.

The hot question is how to capture our positional identification of subjects.

Which would include when two or more references are for the same subject.


PS: I rather like deRosset’s conclusion:

Someone, long ago, was well-placed to refer to Cicero. Now, because of our de facto historical position, we are well-placed to refer to Cicero, even though we (or those of us without classical education) wouldn’t know Cicero from Seneca. We don’t need to be able to point to him, or apprehend some condition which singles him out (other, perhaps, than being Cicero). Possessing an appropriately-derived use of ‘Cicero’ suffices. According to the theory of evolution by natural selection, so long as we are appropriately situated (i.e., so long as our local environment is relevantly similar to our ancestors’), we benefit from our biological ancestors’ reproductive successes. Similarly, when we refer positionally, so long as we are appropriately situated, we benefit from our linguistic ancestors’ referential successes. In neither case do the conditions by which we benefit have to be present, even implicitly, in our psychology.25

April 26, 2011

Data Beats Math

Filed under: Data,Mathematics,Subject Identity — Patrick Durusau @ 2:17 pm

Data Beats Math

A more recent post by Jeff Jonas.

Topic maps can capture observations, judgments, conclusions from human analysts.

Do those beat math as well?

April 20, 2011

Giving a Single Name a Single Identity

Filed under: Marketing,Subject Identity — Patrick Durusau @ 2:15 pm

Giving a Single Name a Single Identity

This was just too precious to pass up.

The securities industry, parts of it anyway, would like to identify what is being traded in a reliable way.

Answer: Well, we’ll just pick a single identifier, etc. Read the article for the details but see near the end:

If you are running a worldwide trading desk in search of buyers or sellers in every corner of the world, you’re going to have a hard time finding them, in a single universal manner, says Robin Strong, Director of Buy-side Market Strategy, at Fidessa Group, a supplier of trading systems.

That is “primarily because the parties involved at the end of the various bits of wire from a single buy-side dealing desk don’t tend to cooperate. They’re all competitors. They want to own their piece of the value chain,’’ whether it’s a coding system or an order management system. “They’ve built a market that they own” and want to protect, he said.

With a topic map you could create a mapping into other markets.

Topic maps: Enhance the market you own with a part of someone else’s.

How is that for a marketing slogan?

Should by some mis-chance a single identifier come about, topic maps can help maintain insider semantics to maintain the unevenness of the playing field.

April 19, 2011

Detexify2 – LaTeX symbol classifier

Filed under: Searching,Subject Identity,TeX/LaTeX — Patrick Durusau @ 9:34 am

Detexify2 – LaTeX symbol classifier

Clever interface that enables the search for a LaTeX symbol by drawing the symbol in a provided box.

Aside from its usefulness to the TeX/LaTeX community, I mention this because it illustrates that not all searches (or subject identities) are text based.

April 18, 2011

Perception and Action: An Introduction to Clojure’s Time Model

Filed under: Clojure,Subject Identity,Time — Patrick Durusau @ 1:54 pm

Perception and Action: An Introduction to Clojure’s Time Model

Summary:

Stuart Halloway discusses how we use a total control time model, proposing a different one that represents the world more accurately helping to solve some of the concurrency and parallelism problem.

To tempt you into watching this video, consider the following slide:

identity

  • continuity over time
    • built by minds
  • sameness across a series of perceptions
  • not a name, but can be named
  • can be composite

I will be posting other material from this presentation (as well as watching the video more than once).

(BTW, I saw the reference to this presentation in a tweet from Alex Popescu, myNoSQL.)

April 7, 2011

Third Workshop on Massive Data Algorithmics (MASSIVE 2011)

Filed under: Algorithms,BigData,Subject Identity — Patrick Durusau @ 7:26 pm

Third Workshop on Massive Data Algorithmics (MASSIVE 2011)

From the website:

Tremendous advances in our ability to acquire, store and process data, as well as the pervasive use of computers in general, have resulted in a spectacular increase in the amount of data being collected. This availability of high-quality data has led to major advances in both science and industry. In general, society is becoming increasingly data driven, and this trend is likely to continue in the coming years.

The increasing number of applications processing massive data means that in general focus on algorithm efficiency is increasing. However, the large size of the data, and/or the small size of many modern computing devices, also means that issues such as memory hierarchy architecture often play a crucial role in algorithm efficiency. Thus the availability of massive data also means many new challenges for algorithm designers.

Forgive me for mentioning it, but what is the one thing all algorithms have in common? Whether for massive data or no?

Ah, yes, some presumption about the identity of the subjects to be processed.

Would be rather difficult to efficiently process anything unless you knew where you were starting and with what?

Making the subjects processed by algorithms efficiently interchangeable seems like a good thing to me.

March 24, 2011

Twice in the Same Semantic Stream?

Filed under: Semantics,Subject Identity — Patrick Durusau @ 7:54 pm

I don’t think anyone disagrees with the proposition that the meaning, semantics of words, changes over time. And across social groups and settings.

It is like the stream described by Heraclitus, in which we can never step twice.

What meanings we assign to words, one medium of communication, are chosen from that stream at various points.

Note that I said chosen and not caught.

The words continue downstream where they may be chosen by other people with other meanings.

The notion that we can somehow fix the meaning of words, is contrary to our common and universal experience.

I wonder then why anyone would think that data structures, which are after all are composed of words and liable to the same shifting semantics as any other words, could have a fixed semantic.

That somehow data structures reside outside what we know to be the ebb and flow of semantics.

Both the words that we think of as being “data” and the words that we assign structures to hold or describe data (metadata if you like), are all part and parcel of the same stream.

The well known case of the shifting semantics of owl:sameAs is a case in point.

But you could as well pick terminology from any other vocabulary, semantic or not to illustrate the same point.

That isn’t to say that RDF or OWL aren’t useful. They are. For any number of purposes.

But, like any vocabulary, whether for data or structure, they should be used with two cautions:

1) Any term in a vocabulary stands for a subject is also represented by other terms in other vocabularies.

That is to say that a term that is used for a subject is a matter of convenience and custom, not some underlying truth.

2) Any term in a vocabulary exists in the context of other terms that represents other subjects.

A term can be best understood and communicated to others if it is documented or explained in the context of other subjects.

To say nothing of mapping terms for a subject to other terms for the same subject.

To act otherwise, as though semantics are fixed, is an attempt to step twice in the same location in a semantic stream.

Wasn’t possible for Heraclitus, isn’t possible now.

March 18, 2011

Complex Indexing?

Filed under: Indexing,Subject Identity,Topic Maps — Patrick Durusau @ 6:52 pm

The post The Joy of Indexing made me think about the original use case for topic maps, the merging of indexes prepared by different authors.

Indexing that relies either on a token in the text (simple indexing) or even a contextual clue, the compound indexing mentioned in the Joy of Indexing post, but fall short in terms of enabling the merging of indexes.

Why?

In my comments on the Joy of Indexing I mentioned that what we need is a subject indexing engine.

That is an engine that indexes the subjects that are appear in a text and not merely the manner of their appearance.

(Jack Park, topic map advocate and my friend would say I am hand waving at this point so perhaps an example will help.)

Say that I have a text where I use the words George Washington.

That could be a reference to the first president of the United States or it could be a reference to George Washington rabbit (my wife is a children’s librarian).

A simple indexing engine could not distinguish one from the other.

A compound indexing engine might list one under Presidents and the other under Characters but without more in the example we don’t know for sure.

A complex indexing engine, that is one that took into account more than simply the token in the text, say that it created its entry from that token plus other attributes of the subject it represents, would not mistake a president for a rabbit or vice versa.

Take Lucene for example. For any word in a text, it records

The position increment, start, and end offsets and payload are the only additional metadata associated with the token that is recorded in the index.

That pretty much isolates the problem is a nutshell. If that is all the metadata we get, which isn’t much, the likelihood we are going to do any reliable subject matching is pretty low.

Not to single Lucene out, I think all the search engines operate pretty much the same way.

To return to our example, what if while indexing, when we encounter George Washington, instead of the bare token we record, respectively:

George Washington – Class = Mammalia

George Washington – Class = Mammalia

Hmmm, that didn’t help much did it?

How about:

George Washington – Class = Mammalia Order = Primate

George Washington – Class = Mammalia Order = Lagomorpha

So that I can distinguish these two cases but can also ask for all instances of class = Mammalia.

Of course the trick is that no automated system is likely to make that sort of judgement reliably, at least left to its own devices.

But it doesn’t have to does it?

Imagine that I am interested in U.S. history and want to prepare an index of the Continental Congress proceedings. I could simply create an index by tokens but that will encounter all the problems we know that comes from merging indexes. Or searching across tokens as seen by such indexes. See Google for example.

But, what if I indexed the Continental Congress proceedings using more complex tokens? Ones that had multiple properties that could be indexed for one subject and that could exist in relationship to other subjects?

That is for some body of material, I declared the subjects that would be identified and what would be known about them post-identification?

A declarative model of subject identity. (There are other, equally legitimate, models of identity, that I will be covering separately.)

More on the declarative model anon.

March 17, 2011

The Joy of Indexing

Filed under: Indexing,MongoDB,NoSQL,Subject Identity — Patrick Durusau @ 6:52 pm

The Joy of Indexing

Interesting blog post on indexing by Kyle Banker of MongoDB.

Recommended in part to understanding the limits of traditional indexing.

Ask yourself, what is the index in Kyle’s examples indexing?

Kyle says the example are indexing recipes but is that really true?

Or is it the case that the index is indexing the occurrence of a string at a location in the text?

Not exactly the same thing.

That is to say there is a difference between a token that appears in a text and a subject we think about when we see that token.

It is what enables us to say that two or more words that are spelled differently are synonyms.

Something other that the two words as strings is what we are relying on to make the claim they are synonyms.

A traditional indexing engine, of the sort described here, can only index the strings it encounters in the text.

What would be more useful would be an indexing engine that indexed the subjects in a text.

I think we would call such a subject-indexing engine a topic map engine. Yes?

Questions:

  1. Do you agree/disagree that a word indexing engine is not a subject indexing engine? (3-5 pages, no citations)
  2. What would you change about a word indexing engine (if anything) to make it a subject indexing engine? (3-5 pages, no citations)
  3. What texts/subjects would you use as test cases for your engine? (3-5 pages, citations of the test documents)

March 16, 2011

KNIME – 4th Annual User Group Meeting

Filed under: Data Analysis,Heterogeneous Data,Mapping,Subject Identity — Patrick Durusau @ 3:14 pm

KNIME – 4th Annual User Group Meeting

From the website:

The 4th KNIME Workshop and Users Meeting at Technopark in Zurich, Switzerland took place between February 28th and March 4th, 2011 and was a huge success.

The meeting was very well attended by more than 130 participants. The presentations ranged from customer intelligence and applications of KNIME in soil and fuel research through to high performance data analytics and KNIME applications in the Life Science industry. The second meeting of the special interest group attracted more than 50 attendees and was filled with talks about how KNIME can be put to use in this fast growing research area.

Presentations are available.

A new version of KNIME is available for download with the features listed in ChangeLog 2.3.3.

Focused on data analytics and work flow, another software package that could benefit from an interchangeable subject-oriented approach.

March 4, 2011

Table competition at ICDAR 2011

Filed under: Dataset,Subject Identity — Patrick Durusau @ 10:40 am

I first noticed this item at Mathew Hurst’s blog Table Competition at ICDAR 2011.

As a markup person with some passing familiarity with table encoding issues, this is just awesome!

Update: March 10, 2011 Competition registration, which consists of expressing interest in competing, by email, to the competition organisers

The basic description is OK:

Motivation: Tables are a prominent element of communication in documents, often containing information that would take many a paragraph to write otherwise. The first step to table understanding is to draw the tables physical model, i.e. identify its location and component cells, rows ad columns. Several authors have dedicated themselves to these tasks, using diverse methods, however it is difficult to know which methods work best under which circumstance because of the diverse testing conditions used by each. This competition aims at addressing this lacuna in our field.

Tasks: This competition will involve two independent sub-competitions. Authors may choose to compete for one task or the other or both.

1. Table location sub-competition:

This task consists of identifying which lines in the document belong to one same table area or not;

2. Table segmentation sub-competition:

This task consists of identifying which column the cells of each table belong to, i.e. identifying which cells belong to one same column. Each cell should be attributed a start and end column index (which will be different from each other for spanning cells). Identifying row spanning cells is not relevant for this competition.

But what I think will excite markup folks (and possibly topic map advocates) is the description of the data sets:

Description of the datasets: We have gathered 22 PDF financial statements. Our documents have lengths varying between 13 and 235 pages with very diverse page layouts, for example, pages can be organised in one or two columns and page headers and footers are included; each document contains between 3 and 162 tables. In Appendix A, we present some examples of pages in our dataset with tables that we consider hard to locate or segment. We randomly chose 19 documents for training and 3 for validation; our tougher cases turned out to be in the training set.

We then converted all files to ASCII using the pdttotext linux utility2 (2Red Hat Linux 7.2 (Enigma), October 22, 2001, Linux 2.4.7-10, pdftotext version 0.92., copyright 1996-2000 Derek B. Noonburg.). As a result of the conversion, each line of each document became a line of ASCII, which when imported into a database becomes a record in a relational table. Apart from this, we collected an extra 19 PDF financial statements to form the test set; these were converted into ASCII using the same tool as the training set.

Table 1 underneath shows the resulting dimensions of the datasets and how they compare to those used by other authors (Wang et al. (2002)’s tables were automatically generated and Pinto et al. (2003)’s belong to the same government statistics website). The sizes of the datasets in other papers are not distant from ours. An exception would be Cafarella et al. (2008), who created the first large repository of HTML tables, with 154 million tables. These consist of non-marked up HTML tables detected using Wang and Hu (2002)’s algorithm, which is naturally subject to mistakes.

We have then manually created the ground-truth for this data, which involved: a) identifying which lines belong to tables and which do not; b) for each line, identifying how it should be clipped into cells; c) for each cell, identifying which table column it belongs to.

Whether you choose to compete or not, this should prove to be very interesting.

Sorry, left off the dates from the original post:

Important dates:

  • February 26, 2011 Training set is made available on the Competition Website
  • March 10, 2011 Competition registration, which consists of expressing interest in competing, by email, to the competition organisers
  • May 13, 2011 Validation set is made available on the Competition Website
  • May 15, 2011 Submission of results by competitors, which should be executable files; if at all impossible, the test data will be given out to competitors, but results must be submitted within no more than one hour (negotiable)
  • June 15, 2011 Submission of summary paper for ICDAR’s proceedings, already including the identification of the competition’s winner
  • September, 2011 Test set is made available on the Competition Website
  • September, 2011 Announcement of the results will be made during ICDAR’2011, the competition session

February 20, 2011

A thought on Hard vs Soft – Post
(nonIdentification vs. multiIdentification?)

Filed under: Marketing,Subject Identity,Topic Maps — Patrick Durusau @ 10:39 am

A thought on Hard vs Soft by Dru Sellers starts off with:

With the move from RDBMS to NoSQL are we seeing the same shift that we saw when we moved from Hardware to Software. Are we seeing a shift from Harddata to Softdata? (emphasis in original)

See his post for the rest of the post and the replies.

Do topic maps address a similar hardIdentification vs. softIdentification?

By hardIdentification I mean a single identification.

But it goes further than that doesn’t it?

There isn’t even a single identification in most information systems.

Think about it. You and I both see the same column names and have different ideas of what they mean.

I remember reading in Doan’s dissertation (see Auditable Reconciliation) that a schema reconciliation project would have taken 12 person years but for the original authors being available.

We don’t have any idea what has been identified in most systems and no way to compare it to other “identifications.”

What is this? Write once, Wonder Many Times (WOWMT)?

So, topic maps really are a leap from nonIdentification to multiIdentification.

No wonder it is such a hard sell!

People aren’t accustomed to avoiding the cost of nonIdentification and here we are pitching the advantages of multiIdentification.

Pull two tables at random for your database and have a contest to see who outside the IT department can successfully identify what the column headers represent. No data, just the column headers.*

What other ways can we illustrate the issue of nonIdentification?

Interested in hearing your suggestions.

*****
*I will be posting column headers from public data sets and asking you to guess their identifications.

BTW, some will argue that documentation exists for at least some of these columns.

True enough, but from a processing standpoint it may as well be on a one way mission to Mars.

If the system doesn’t have access to it, it doesn’t exist. (full stop)

Gives you an idea of how impoverished our systems truly are.

IBM’s Watson (the computer, not IBM’s founder, who was also soulless) has been described as deaf and blind. Not only that, but it has no more information than it is given. It cannot ask for more. The life of pocket calculator, if it had emotions, is sad.

February 18, 2011

Large Scale Packet Dump Analysis with MongoDB

Filed under: Marketing,MongoDB,Subject Identity — Patrick Durusau @ 6:51 am

Large Scale Packet Dump Analysis with MongoDB

I mention this because it occurs to me that distributed topic maps could be a way to track elusive web traffic that passes through any number of servers from one location to another.

I will have to pull Stevens’ TCP/IP Illustrated off the shelf to look up the details.

Thinking that subject identity in this case would be packet content and not the usual identifiers.

And that with distributed topic maps, no one map would have to process all the load.

Instead, upon request, delivering up proxies to be merged with other proxies, which could then be displayed as partial paths through the next works with the servers where changes took place being marked.

The upper level topic maps being responsible for processing summaries of summaries of data, but with the ability to drill back down into the actual data.

True, there is a lot of traffic, but simply by dumping all the porn, that reduces the problem by a considerable percentage. I am sure there are other data collection improvements that could be made.

February 11, 2011

Sowa on Watson

Filed under: Cyc,Ontology,Semantic Web,Subject Identifiers,Subject Identity — Patrick Durusau @ 6:43 am

John Sowa’s posting on Watson merits reproduction in its entirety (lite editing to make it format for easy reading):

Peter,

Thanks for the reminder:

Dave Ferrucci gave a talk on UIMA (the Unstructured Information Management Architecture) back in May-2006, entitled: “Putting the Semantics in the Semantic Web: An overview of UIMA and its role in Accelerating the Semantic Revolution”

I recommend that readers compare Ferrucci’s talk about UIMA in 2006 with his talk about the Watson system and Jeopardy in 2011. In less than 5 years, they built Watson on the UIMA foundation, which contained a reasonable amount of NLP tools, a modest ontology, and some useful tools for knowledge acquisition. During that time, they added quite a bit of machine learning, reasoning, statistics, and heuristics. But most of all, they added terabytes of documents.

For the record, following are Ferrucci’s slides from 2006:

http://ontolog.cim3.net/file/resource/presentation/DavidFerrucci_20060511/UIMA-SemanticWeb–DavidFerrucci_20060511.pdf

Following is the talk that explains the slides:

http://ontolog.cim3.net/file/resource/presentation/DavidFerrucci_20060511/UIMA-SemanticWeb–DavidFerrucci_20060511_Recording-2914992-460237.mp3

And following is his recent talk about the DeepQA project for building and extending that foundation for Jeopardy:

http://www-943.ibm.com/innovation/us/watson/watson-for-a-smarter-planet/building-a-jeopardy-champion/how-watson-works.html

Compared to Ferrucci’s talks, the PBS Nova program was a disappointment. It didn’t get into any technical detail, but it did have a few cameo appearances from AI researchers. Terry Winograd and Pat Winston, for example, said that the problem of language understanding is hard.

But I thought that Marvin Minsky and Doug Lenat said more with their tone of voice than with their words. My interpretation (which could, of course, be wrong) is that both of them were seething with jealousy that IBM built a system that was competing with Jeopardy champions on national TV — and without their help.

In any case, the Watson project shows that terabytes of documents are far more important for commonsense reasoning than the millions of formal axioms in Cyc. That does not mean that the Cyc ontology is useless, but it undermines the original assumptions for the Cyc project: commonsense reasoning requires a huge knowledge base of hand-coded axioms together with a powerful inference engine.

An important observation by Ferrucci: The URIs of the Semantic Web are *not* useful for processing natural languages — not for ordinary documents, not for scientific documents, and especially not for Jeopardy questions:

1. For scientific documents, words like ‘H2O’ are excellent URIs. Adding an http address in front of them is pointless.

2. A word like ‘water’, which is sometimes a synonym for ‘H2O’, has an open-ended number of senses and microsenses.

3. Even if every microsense could be precisely defined and cataloged on the WWW, that wouldn’t help determine which one is appropriate for any particular context.

4. Any attempt to force human being(s) to specify or select a precise sense cannot succeed unless *every* human understands and consistently selects the correct sense at *every* possible occasion.

5. Given that point #4 is impossible to enforce and dangerous to assume, any software that uses URIs will have to verify that the selected sense is appropriate to the context.

6. Therefore, URIs found “in the wild” on the WWW can never be assumed to be correct unless they have been guaranteed to be correct by a trusted source.

These points taken together imply that annotations on documents can’t be trusted unless (a) they have been generated by your own system or (b) they were generated by a system which is at least as trustworthy as your own and which has been verified to be 100% compatible with yours.

In summary, the underlying assumptions for both Cyc and the Semantic Web need to be reconsidered.

You can see the post at: http://ontolog.cim3.net/forum/ontolog-forum/2011-02/msg00114.html

I don’t always agree with Sowa but he has written extensively on conceptual graphs, knowledge representation and ontological matters. See http://www.jfsowa.com/

I missed the local showing but found the video at: Smartest Machine on Earth.

You will find a link to an interview with Minsky at that same location.

I don’t know that I would describe Minsky as “…seething with jealousy….”

While I enjoy Jeopardy and it is certainly more cerebral than say American Idol, I think Minsky is right in seeing the Watson effort as something other than artificial intelligence.

Q: In 2011, who was the only non-sentient contestant on the TV show Jeopardy?

A: What is IBM’s Watson?

February 10, 2011

Topic Maps, Google and the Billion Fact Parade

Filed under: Freebase,Subject Identity,TMRM,Topic Maps — Patrick Durusau @ 2:54 pm

Andrew Hogue (Google) actually titled his presentation on Google’s plan for Freebase: The Structured Search Engine.

Several minutes into the presentation Hogue points out that to answer the question, “when was Martin Luther King, Jr. born?” that date of birth, date born, appeared, dob were all considered synonyms that expect the date type.

Hmmm, he must mean keys that represent the same subject and so subject to merging and possibly, depending on their role in a subject representative, further merging of those subject representatives. Can you say Steve Newcomb and the TMRM?

Yes, attribute names represent subjects just like collections of attributes are thought to represent subjects. And benefit from rules specifying subject identity, other properties and merging rules. (Some of those rules can be derived from mechanical analysis, others probably not.)

Second, Hogue points out that Freebase had 13 million entities when purchased by Google. He speculates on taking that to 1 billion entities.

Let’s cut to the chase, I will see Hogue’s 1 billion entities and raise him 9 billion entities for a total pot of 10 billion entities.

Now what?

Let’s take a simple question that Hogue’s 10 billion entity Google/Freebase cannot usefully answer.

What is democracy?

Seems simple enough. (viewers at home can try this with their favorite search engine.)

1) United States State Department: Democracy means a state that support Israel, keeps the Suez canal open and opposes people we don’t like in the U.S. Oh, and that protects the rights and social status of the wealthy, almost forgot that one. Sorry.

2) Protesters in Egypt (my view): Democracy probably does not include some or all of the points I mention for #1.

3) Turn of the century U.S.: Effectively only the white male population participates.

4) Early U.S. history: Land ownership is a requirement.

I am sure examples can be supplied from other “democracies” and their histories around the world.

This is a very important term and it differing use by different people in different contexts, is going to make discussion and negotiations more difficult.

There are lots of terms where no single “entity” or “fact” that is going to work for everyone.

Subject identity is a tough question and the identification of a subject changes over time, social context, etc. Not to mention that the subjects identified by particular identifications change as well.

Consider that at one time cab was not used to refer to a method of transportation but to a brothel. You may object that was “slang” usage but if I am searching an index of police reports for that time period for raids on brothel’s, your objection isn’t helpful. Doesn’t matter if the usage is “slang” or not, I need to obtain accurate results.

User expectations and needs cannot (or at least should not in my opinion) be adapted to the limitations of a particular approach or technology.

Particularly when we already know of strategies that can help with, not solve, the issues surrounding subject identity.

The first step that Hogue and Google have taken, recognizing that attribute names can have synonyms, is a good start. In topic map terms, recognizing that information structures are composed of subjects as well. So that we can map between information structures, rather than replacing one with another. (Or having religious discussions about which one is better, etc.)

Hogue and Google are already on the way to treating some subjects as worthy of more effort than others, but for those that merit the attention, solving the issue of to reliable, repeatable subject identification, is non-trivial.

Topic maps can make a number of suggestions that can help with that task.

The unreasonable effectiveness of simplicity

Filed under: Authoring Topic Maps,Crowd Sourcing,Data Analysis,Subject Identity — Patrick Durusau @ 1:50 pm

The unreasonable effectiveness of simplicity from Panos Ipeirotis suggests that simplicity should be considered in the construction of information resources.

The simplest aggregation technique: Use the majority vote as the correct answer.

I am mindful of the discussion several years ago about visual topic maps. Which was a proposal to use images as identifiers. Certainly doable now but the simplicity angle suggests an interesting possibility.

Would not work for highly abstract subjects, but what if users were presented with images when called upon to make identification choices for a topic map?

For example, marking entities in a newspaper account, the user is presented with images near each marked entity and chooses yes/no.

Or in legal discovery or research, a similar mechanism, along with the ability to annotate any string with an image/marker and that image/marker appears with that string in the rest of the corpus.

Unknown to the user is further information about the subject they have identified that forms the basis for merging identifications, linking into associations, etc.

A must read!

February 9, 2011

Oyster: A Configurable ER Engine

Filed under: Entity Resolution,Record Linkage,Semantic Web,Subject Identity — Patrick Durusau @ 4:55 pm

Oyster: A Configurable ER Engine

John Talburt writes a very enticing overview of an entity resolution engine he calls Oyster.

From the post:

OYSTER will be unique among freely available systems in that it supports identity management and identity capture. This allows the user to configure OYSTER to not only run as a typical merge-purge/record linking system, but also as an identity capture and identity resolution system. (Emphasis added)

Yes, record linking we have had since the late 1950’s in a variety of guises and over twenty (20) different names that I know of.

Adding identity management and identity capture (FYI, SW uses universal identifier assignment) will be something truly different.

As in topic map different.

Will be keeping a close watch on this project and suggest that you do the same.

February 5, 2011

Subject (defined)

Filed under: Subject Identity — Patrick Durusau @ 6:13 am

Just in case you were thinking that ISO has a handle on the definition of subject in ISO/IEC 13250:

  • subject
    • person in whose ear canal the hearing aid performance is being characterized (ISO 12124:2001)
    • in the most generic sense, a “subject” is any thing whatsoever, regardless of whether it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever (ISO/IEC 13250:2003)
    • Any concept or combination of concepts representing a theme in a document. (ISO 5963:1985)
    • an entity within the TSC that causes operations to be performed. (ISO/IEC 15408-1:2005)
    • anything whatsoever, regardless of whether it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever (ISO/IEC 13250-2:2006)
    • particular information item which corresponds to the object of interest of the natural-language assertions and typically is matched by the context expression of a rule (ISO/IEC 19757-3:2006, yes, the DSDL standard)
    • entity whose public key is certified in a public key certificate (ISO 15782-2:2001)
    • condition under which two or more entities separately have key fragments which, individually, convey no knowledge of the resultant cryptographic key entity whose public key is certified in a public key certificate [split knowledge subject] (ISO 15782-1:2003)
    • individual who participates in a clinical investigation, either as a recipient of the device under investigation or as a control (ISO 14155-1:2003)
    • end-user whose biometric data is intended to be enrolled or compared (ISO/IEC 24713-1:2008)
    • entity whose public key is certified in the certificate (ISO/TS 21091:2005)
    • entity whose public key is certified in a public key certificate (ISO 21188:2006)
    • entity whose public key is certified in a public key certificate (ISO 15782-1:2009)
    • active entity in the TOE that performs operations on objects (ISO/IEC 15408-1:2009)

Questions:

  1. How would you distinguish these uses of subject in a topic map?
  2. How would these uses impact searching across texts?
  3. What if anything would you suggest to minimize the impact of these definitions on searching?

This website, The ISO Concept Database (ISO/CDB), apparently powered by Apache CentOS, is a good example of inappropriate use of open source software.

Perform a search, then go to an item returned by that search, the choose Back to previous search. Application will fail. Close tab. The try again from the homepage. That is where you get the CentOS pages.

I said inappropriate, perhaps the better term is poor. It reflects badly on open source software to have it poorly used.

January 31, 2011

Pseudo-Code: A New Definition

Filed under: Machine Learning,Sets,Subject Identity,Topic Maps — Patrick Durusau @ 7:24 am

How to Speed up Machine Learning using a Set-Oriented Approach

The detail article for Need faster machine learning? Take a set-oriented approach, which I mentioned in a separate post.

Well, somewhat more detail.

Gives new meaning to pseudo-code:

The application side becomes:

Computing the model:

Fetch “compute-model over data items”

Classifying new items:

Fetch “classify over data items”

I am reminded of the cartoon with two people at a blackboard and one of them says: I think you should be more explicit in step two., where the text reads: Then a miracle occurs.

How about you?

January 29, 2011

Need faster machine learning? Take a
set-oriented approach

Filed under: Machine Learning,Sets,Subject Identity — Patrick Durusau @ 5:00 pm

Need faster machine learning? Take a set-oriented approach.

Roger Magoulas, using not small iron reports:

The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.

or

set-oriented machine learning makes for:

  • Handling larger and more diverse data sets
  • Applying machine learning to a larger set of problems
  • Faster turnarounds
  • Less risk
  • Better focus on a problem
  • Improved accuracy, greater understanding and more usable results
  • Seems to me sameness of subject representation is a classification task. Yes?

    Going from days to minutes sounds attractive to me.

    How about you?

    R & Subject Identity/Identification

    Filed under: R,Subject Identity — Patrick Durusau @ 4:13 pm

    While posting R Books for Undergraduates, it occurred to me that having examples of using R for subject identity/identification would be helpful.

    I could create examples of first instance, but that would be a lot of work.

    Not to mention limiting me to domain in which I have some interest and expertise.

    What if I were to re-cast existing R examples as subject identity/identification issues?

    That saves me the time of creating new examples.

    More importantly, gives me a ready made audience to chime in on how I did with subject identity:

    • correct
    • close but incorrect
    • incorrect
    • incorrect and far away
    • incoherent
    • what subject did I think I was talking about?
    • etc.

    More than one answer is possible for any one example. 😉

    January 24, 2011

    Ambiguity and Charity

    Filed under: Authoring Topic Maps,Subject Identity,Topic Maps — Patrick Durusau @ 9:06 am

    John McCarthy Notes on Formalizing Context says in Entering and Leaving Contexts:

    Human natural language risks ambiguity by not always specifying such assumptions, relying on the hearer or reader to guess what contexts makes sense. The hearer employs a principle of charity and chooses an interpretation that assumes the speaker is making sense. In AI usage we probably don’t usually want computers to make assertions that depend on principles of charity for their interpretation.

    Natural language statements, outside formal contexts, almost never specify their assumptions. And even when they attempt to specify assumptions, such as in formal contexts, it is always a partial specification.

    Complete specification of context or assumptions isn’t possible. That would require recursive enumeration of all the information that forms a context and the context of that information and so on.

    It really is a question of the degree of charity that is being practiced to resolve any potential ambiguity.

    If AI chooses to avoid charity altogether, I think that says a lot about its chances for success.

    Topic maps, on the other hand, can specify both the result of the charitable assumption, the subject recognized, as well as the charitable assumption itself. Which could (but not necessarily will be) expressed as scope.

    For example, if I see the token who and I specify the scope as being rock-n-roll-bands, that avoids any potential ambiguity, at least from my perspective. I could be wrong, or it could have some other scope, but at least you know my charitable assumption.

    What is particularly clever about topic maps is that other users can combine my charitable assumptions with their own as they merge topic maps together.

    Think of it as stitching together a fabric of interpretation with a thread of charitable assumptions. A fabric that AI applications will never know.

    January 20, 2011

    Crowdsourcing for Search Evaluation

    Filed under: Subject Identity — Patrick Durusau @ 6:22 am

    Crowdsourcing for Search Evaluation

    An interesting workshop held in connection with the 33rd Annual ACM SIGIR Conference.

    Think of this in terms of crowdsourcing subject identification instead of search evaluation and its relevance to topic maps becomes clearer.

    Comments to follow on some of the specific papers.

    « Newer PostsOlder Posts »

    Powered by WordPress