Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 21, 2015

Are You Deep Mining Shallow Data?

Filed under: BigData,Semantics — Patrick Durusau @ 6:55 pm

Do you remember this verse of Simple Simon?

Simple Simon went a-fishing,

For to catch a whale;

All the water he had got,

Was in his mother’s pail.

simple-simon-fishing

Shallow data?

To illustrate, fill in the following statement:

My mom makes the best _____.

Before completing that statement, you resolved the common noun, “mom,” differently that I did.

The string carries no clue as to the resolution of “mom” by any reader.

The string also gives no clues as to how it would be written in another language.

With a string, all you get is the string, or in other words:

All strings are shallow.

That applies to the strings we use to add depth to strings but we will reach that issue shortly.

One of the few things that RDF got right was:

…RDF puts the information in a formal way that a machine can understand. The purpose of RDF is to provide an encoding and interpretation mechanism so that resources can be described in a way that particular software can understand it; in other words, so that software can access and use information that it otherwise couldn’t use. (quote from Wikipedia on RDF)

In addition to the string, RDF posits an identifier in the form of a URI which you can follow to discover more information about that portion of string.

Unfortunately RDF was burdened by the need for all new identifiers to replace those already in place, an inability to easily distinguish identifier URIs from URIs that lead to subjects of conversation, and encoding requirements that reduced the population of potential RDF authors to a righteous remnant.

Despite its limitations and architectural flaws, RDF is evidence that strings are indeed shallow. Not to mention that if we could give strings depth, their usefulness would be greatly increased.

One method for imputing more depth to strings is natural language processing (NLP). Modern NLP techniques are based on statistical analysis of large data sets and are the most accurate for very common cases. The statistical nature of NLP makes application of those techniques to very small amounts of text or ones with unusual styles of usage problematic.

The limits of statistical techniques isn’t a criticism of NLP but rather an observation that depending on the level of accuracy desired and your data, such techniques may or may not be useful.

What is acceptable for imputing depth to strings in movie reviews is unlikely to be thought so when deciphering a manual for disassembling an atomic weapon. The question isn’t whether NLP can impute depth to strings but whether that imputation is sufficiently accurate for your use case.

Of course, RDF and NLP aren’t the only two means for imputing depth to strings.

We will take up another method for giving strings depth tomorrow.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress