Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 4, 2011

Pragmatic Philosophical Technology for Text Mining

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 6:16 pm

Pragmatic Philosophical Technology for Text Mining

Mathew Hurst writes:

In text mining applications, we often work with some form of raw input (web pages, web sites, emails, etc.) and attempt to organize it in terms of the concepts that are mentioned or introduced in the documents.

This process of interpretation can take the form of ‘normalization’ or ‘canonicalization’ (in which many expressions are associated with a singular expression as an exemplar of an set). This happens, for examples, when we map ‘Barack Obama’, ‘President Obama’, etc. to a unique string ‘President Barack Obama’. This is convenient when we want to retrieve all documents about the president.

In this process, we are associating elements within the same language (language in the sense of sets of symbols and the rules that govern their legal generation).

Another approach is to map (or associate) the terms in the original document with some structured record. For example, we might interpret the phrase ‘Starbucks’ as relating to a record of key value pairs {name=starbucks, address=123 main street, …}. In this case, the structure of the record has a semantics (or model) other than that of the original document. In other words, we are mapping from one language to another.

Of course, what we want to do is denote the thing in the real world. It is, however, impossible to represent this as all we can do is shuffle bits around inside the computer. We can’t attach a label to the real world and somehow transcend the reality/representation barrier. However, we can start to look at the modeling process with some pragmatics.

Wrestling with subject identity issues. Worth your time to read and comment.

2 Comments

  1. I’m actually working on something very similar today: how to explain to an entity extractor which names are unique and go with what IDs. We’re using Duke to say what labels are unique for what entities, and, if they’re not unique, what more information is needed to disambiguate. For now we seem to be stuck on issues with the extractor.

    Comment by larsga@garshol.priv.no — August 5, 2011 @ 4:18 am

  2. Cool! You may find Sentiment Analysis: Machines Are Like Us interesting.

    People are probably the best “general case” entity extractors, while more specialized entity extractors are tireless and nearly as accurate.

    Has always puzzled me that search engines don’t have (or don’t appear to have) specialized entity recognition. There aren’t that many CS sites for example. A goodly number but not in terms of “web scale.” Use that as a starter set for CS entity recognition. Same for other areas.

    Comment by Patrick Durusau — August 5, 2011 @ 8:20 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress