Archive for the ‘Semantics’ Category

Metron – A Fist Full of Subjects

Monday, April 24th, 2017

Metron – Apache Incubator

From the description:

Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat-intelligence information to security telemetry within a single platform.

Metron can be divided into 4 areas:

  1. A mechanism to capture, store, and normalize any type of security telemetry at extremely high rates. Because security telemetry is constantly being generated, it requires a method for ingesting the data at high speeds and pushing it to various processing units for advanced computation and analytics.
  2. Real time processing and application of enrichments such as threat intelligence, geolocation, and DNS information to telemetry being collected. The immediate application of this information to incoming telemetry provides the context and situational awareness, as well as the “who” and “where” information that is critical for investigation.
  3. Efficient information storage based on how the information will be used:
    1. Logs and telemetry are stored such that they can be efficiently mined and analyzed for concise security visibility
    2. The ability to extract and reconstruct full packets helps an analyst answer questions such as who the true attacker was, what data was leaked, and where that data was sent
    3. Long-term storage not only increases visibility over time, but also enables advanced analytics such as machine learning techniques to be used to create models on the information. Incoming data can then be scored against these stored models for advanced anomaly detection.
  4. An interface that gives a security investigator a centralized view of data and alerts passed through the system. Metron’s interface presents alert summaries with threat intelligence and enrichment data specific to that alert on one single page. Furthermore, advanced search capabilities and full packet extraction tools are presented to the analyst for investigation without the need to pivot into additional tools.

Big data is a natural fit for powerful security analytics. The Metron framework integrates a number of elements from the Hadoop ecosystem to provide a scalable platform for security analytics, incorporating such functionality as full-packet capture, stream processing, batch processing, real-time search, and telemetry aggregation. With Metron, our goal is to tie big data into security analytics and drive towards an extensible centralized platform to effectively enable rapid detection and rapid response for advanced security threats.

Some useful links:

Metron (website)

Metron wiki

Metron Jira

Metron Git

Security threats aren’t going to assign themselves unique and immutable IDs. Which means they will be identified by characteristics and associated with particular acts (think associations), which are composed of other subjects, such as particular malware, dates, etc.

Being able to robustly share such identifications (unlike the “we’ve seen this before at some unknown time, with unknown characteristics,” typical of Russian attribution reports) would be a real plus.

Looks like a great opportunity for topic maps-like thinking.

Yes?

Topic Maps: On the Cusp of Success (Curate in Place/Death of ETL?)

Tuesday, February 9th, 2016

The Bright Future of Semantic Graphs and Big Connected Data by Alex Woodie.

From the post:

Semantic graph technology is shaping up to play a key role in how organizations access the growing stores of public data. This is particularly true in the healthcare space, where organizations are beginning to store their data using so-called triple stores, often defined by the Resource Description Framework (RDF), which is a model for storing metadata created by the World Wide Web Consortium (W3C).

One person who’s bullish on the prospects for semantic data lakes is Shawn Dolley, Cloudera’s big data expert for the health and life sciences market. Dolley says semantic technology is on the cusp of breaking out and being heavily adopted, particularly among healthcare providers and pharmaceutical companies.

“I have yet to speak with a large pharmaceutical company where there’s not a small group of IT folks who are working on the open Web and are evaluating different technologies to do that,” Dolley says. “These are visionaries who are looking five years out, and saying we’re entering a world where the only way for us to scale….is to not store it internally. Even with Hadoop, the data sizes are going to be too massive, so we need to learn and think about how to federate queries.”

By storing healthcare and pharmaceutical data as semantic triples using graph databases such as Franz’s AllegroGraph, it can dramatically lower the hurdles to accessing huge stores of data stored externally. “Usually the primary use case that I see for AllegroGraph is creating a data fabric or a data ecosystem where they don’t have to pull the data internally,” Dolley tells Datanami. “They can do seamless queries out to data and curate it as it sits, and that’s quite appealing.

….

This is leading-edge stuff, and there are few mission-critical deployments of semantic graph technologies being used in the real world. However, there are a few of them, and the one that keeps popping up is the one at Montefiore Health System in New York City.

Montefiore is turning heads in the healthcare IT space because it was the first hospital to construct a “longitudinally integrated, semantically enriched” big data analytic infrastructure in support of “next-generation learning healthcare systems and precision medicine,” according to Franz, which supplied the graph database at the heart of the health data lake. Cloudera’s free version of Hadoop provided the distributed architecture for Montefiore’s semantic data lake (SDL), while other components and services were provided by tech big wigs Intel (NASDAQ: INTC) and Cisco Systems (NASDAQ: CSCO).

This approach to building an SDL will bring about big improvements in healthcare, says Dr. Parsa Mirhaji MD. PhD., the director of clinical research informatics at Einstein College of Medicine and Montefiore Health System.

“Our ability to conduct real-time analysis over new combinations of data, to compare results across multiple analyses, and to engage patients, practitioners and researchers as equal partners in big-data analytics and decision support will fuel discoveries, significantly improve efficiencies, personalize care, and ultimately save lives,” Dr. Mirhaji says in a press release. (emphasis added)

If I hadn’t known better, reading passages like:

the only way for us to scale….is to not store it internally

learn and think about how to federate queries

seamless queries out to data and curate it as it sits

I would have sworn I was reading a promotion piece for topic maps!

Of course, it doesn’t mention how to discover valuable data not written in your terminology, but you have to hold something back for the first presentation to the CIO.

The growth of data sets too large for ETL are icing on the cake for topic maps.

Why ETL when the data “appears” as I choose to view it? My topic map may be quite small, at least in relationship to the data set proper.

computer-money

OK, truth-in-advertising moment, it won’t be quite that easy!

And I don’t take small bills. 😉 Diamonds, other valuable commodities, foreign deposit arrangements can be had.

People are starting to think in a “topic mappish” sort of way. Or at least a way where topic maps deliver what they are looking for.

That’s the key: What do they want?

Then use a topic map to deliver it.

Interpretation Under Ambiguity [First Cut Search Results]

Sunday, February 7th, 2016

Interpretation Under Ambiguity by Peter Norvig.

From the paper:

Introduction

This paper is concerned with the problem of semantic and pragmatic interpretation of sentences. We start with a standard strategy for interpretation, and show how problems relating to ambiguity can confound this strategy, leading us to a more complex strategy. We start with the simplest of strategies:

Strategy 1: Apply syntactic rules to the sentence to derive a parse tree, then apply semantic rules to get a translation into some logical form, and finally do a pragmatic interpretation to arrive at the final meaning.

Although this strategy completely ignores ambiguity, and is intended as a sort of strawman, it is in fact a commonly held approach. For example, it is approximately the strategy assumed by Montague grammar, where `pragmatic interpretation’ is replaced by `model theoretic interpretation.’ The problem with this strategy is that ambiguity can strike at the lexical, syntactic, semantic, or pragmatic level, introducing multiple interpretations. The obvious way to counter this problem is as follows:

Strategy 2: Apply syntactic rules to the sentence to derive a set of parse trees, then apply semantic rules to get a set of translations in some logical form, discarding any inconsistent formulae. Finally compute pragmatic interpretation scores for each possibility, to arrive at the `best’ interpretation (i.e. `most consistent’ or `most likely’ in the given context).

In this framework, the lexicon, grammar, and semantic and pragmatic interpretation rules determine a mapping between sentences and meanings. A string with exactly one interpretation is unambiguous, one with no interpretation is anomalous, and one with multiple interpretations is ambiguous. To enumerate the possible parses and logical forms of a sentence is the proper job of a linguist; to then choose from the possibilities the one “correct” or “intended” meaning of an utterance is an exercise in pragmatics or Artificial Intelligence.

One major problem with Strategy 2 is that it ignores the difference between sentences that seem truly ambiguous to the listener, and those that are only found to be ambiguous after careful analysis by the linguist. For example, each of (1-3) is technically ambiguous (with could signal the instrument or accompanier case, and port could be a harbor or the left side of a ship), but only (3) would be seen as ambiguous in a neutral context.

(1) I saw the woman with long blond hair.
(2) I drank a glass of port.
(3) I saw her duck.

Lotfi Zadeh (personal communication) has suggested that ambiguity is a matter of degree. He assumes each interpretation has a likelihood score attached to it. A sentence with a large gap between the highest and second ranked interpretation has low ambiguity; one with nearly-equal ranked interpretations has high ambiguity; and in general the degree of ambiguity is inversely proportional to the sharpness of the drop-off in ranking. So, in (1) and (2) above, the degree of ambiguity is below some threshold, and thus is not noticed. In (3), on the other hand, there are two similarly ranked interpretations, and the ambiguity is perceived as such. Many researchers, from Hockett (1954) to Jackendoff (1987), have suggested that the interpretation of sentences like (3) is similar to the perception of visual illusions such as the Necker cube or the vase/faces or duck/rabbit illusion. In other words, it is possible to shift back and forth between alternate interpretations, but it is not possible to perceive both at once. This leads us to Strategy 3:

Strategy 3: Do syntactic, semantic, and pragmatic interpretation as in Strategy 2. Discard the low-ranking interpretations, according to some threshold function. If there is more than one interpretation remaining, alternate between them.

Strategy 3 treats ambiguity seriously, but it leaves at least four problems untreated. One problem is the practicality of enumerating all possible parses and interpretations. A second is how syntactic and lexical preferences can lead the reader to an unlikely interpretation. Third, we can change our mind about the meaning of a sentence-“at first I thought it meant this, but now I see it means that.” Finally, our affectual reaction to ambiguity is variable. Ambiguity can go unnoticed, or be humorous, confusing, or perfectly harmonious. By `harmonious,’ I mean that several interpretations can be accepted simultaneously, as opposed to the case where one interpretation is selected. These problems will be addressed in the following sections.

Apologies for the long introduction quote but I want to entice you to read Norvig’s essay in full and if you have the time, the references that he cites.

It’s the literature you will have to master to use search engines and develop indexing strategies.

At least for one approach to search and indexing.

That within a language there is enough commonality for automated indexing or searching to be useful has been proven over and over again by Internet search engines.

But at the same time, the first twenty or so results typically leave you wondering what interpretation the search engine put on your words.

As I said, Peter’s approach is useful, at least for a first cut at search results.

The problem is that the first cut has become the norm for “success” of search results.

That works if I want to pay lawyers, doctors, teachers and others to find the same results as others have found before (past tense).

That cost doesn’t appear as a line item in any budget but repetitive “finding” of the same information over and over again is certainly a cost to any enterprise.

First cut on semantic interpretation, follow Norvig.

Saving re-finding costs and the cost of not-finding, requires something more robust than a one model to find words and in the search darkness bind them to particular meanings.

PS: See Peter@norvig.com for an extensive set of resources, papers, presentations, etc.

I first saw this in a tweet by James Fuller.

Spontaneous Preference for their Own Theories (SPOT effect) [SPOC?]

Thursday, February 4th, 2016

The SPOT Effect: People Spontaneously Prefer their Own Theories by Aiden P. Gregga, Nikhila Mahadevana, and Constantine Sedikidesa.

Abstract:

People often exhibit confirmation bias: they process information bearing on the truth of their theories in a way that facilitates their continuing to regard those theories as true. Here, we tested whether confirmation bias would emerge even under the most minimal of conditions. Specifically, we tested whether drawing a nominal link between the self and a theory would suffice to bias people towards regarding that theory as true. If, all else equal, people regard the self as good (i.e., engage in self-enhancement), and good theories are true (in accord with their intended function), then people should regard their own theories as true; otherwise put, they should manifest a Spontaneous Preference for their Own Theories (i.e., a SPOT effect). In three experiments, participants were introduced to a theory about which of two imaginary alien species preyed upon the other. Participants then considered in turn several items of evidence bearing on the theory, and each time evaluated the likelihood that the theory was true versus false. As hypothesized, participants regarded the theory as more likely to be true when it was arbitrarily ascribed to them as opposed to an “Alex” (Experiment 1) or to no one (Experiment 2). We also found that the SPOT effect failed to converge with four different indices of self-enhancement (Experiment 3), suggesting it may be distinctive in character.

I can’t give you the details on this article because it is fire-walled.

But the catch phrase, “Spontaneous Preference for their Own Theories (i.e., a SPOT effect)” certainly fits every discussion of semantics I have ever read or heard.

With a little funding you could prove the corollary, Spontaneous Preference for their Own Code (the SPOC effect) among programmers. 😉

There are any number of formulations for how to fight confirmation bias but Jeremy Dean puts it this way:


The way to fight the confirmation bias is simple to state but hard to put into practice.

You have to try and think up and test out alternative hypothesis. Sounds easy, but it’s not in our nature. It’s no fun thinking about why we might be misguided or have been misinformed. It takes a bit of effort.

It’s distasteful reading a book which challenges our political beliefs, or considering criticisms of our favourite film or, even, accepting how different people choose to live their lives.

Trying to be just a little bit more open is part of the challenge that the confirmation bias sets us. Can we entertain those doubts for just a little longer? Can we even let the facts sway us and perform that most fantastical of feats: changing our minds?

I wonder if that includes imagining using JSON? (shudder) 😉

Hard to do, particularly when we are talking about semantics and what we “know” to be the best practices.

Examples of trying to escape the confirmation bias trap and the results?

Perhaps we can encourage each other.

The Semasiology of Open Source [How Do You Define Source?]

Wednesday, January 20th, 2016

The Semasiology of Open Source by Robert Lefkowitz (Then, VP Enterprise Systems & Architecture, AT&T Wireless) 2004. Audio file.

Robert’s keynote from the Open Source Convention (OSCON) 2004 in Portland, Oregon.

From the description:

Semasiology, n. The science of meanings or sense development (of words); the explanation of the development and changes of the meanings of words. Source: Webster’s Revised Unabridged Dictionary, 1996, 1998 MICRA, Inc. “Open source doesn’t just mean access to the source code.” So begins the Open Source Definition. What then, does access to the source code mean? Seen through the lens of an Enterprise user, what does open source mean? When is (or isn’t) it significant? And a catalogue of open source related arbitrage opportunities.

If you haven’t heard this keynote, I hadn’t, do yourself a favor and make time to listen to it.

I do have one complaint: It’s not long enough. 😉

Enjoy!

Street-Fighting Mathematics – Free Book – Lesson For Semanticists?

Friday, January 1st, 2016

Street-Fighting Mathematics: The Art of Educated Guessing and Opportunistic Problem Solving by Sanjoy Mahajan.

From the webpage:

street-fighting

In problem solving, as in street fighting, rules are for fools: do whatever works—don’t just stand there! Yet we often fear an unjustified leap even though it may land us on a correct result. Traditional mathematics teaching is largely about solving exactly stated problems exactly, yet life often hands us partly defined problems needing only moderately accurate solutions. This engaging book is an antidote to the rigor mortis brought on by too much mathematical rigor, teaching us how to guess answers without needing a proof or an exact calculation.

In Street-Fighting Mathematics, Sanjoy Mahajan builds, sharpens, and demonstrates tools for educated guessing and down-and-dirty, opportunistic problem solving across diverse fields of knowledge—from mathematics to management. Mahajan describes six tools: dimensional analysis, easy cases, lumping, picture proofs, successive approximation, and reasoning by analogy. Illustrating each tool with numerous examples, he carefully separates the tool—the general principle—from the particular application so that the reader can most easily grasp the tool itself to use on problems of particular interest. Street-Fighting Mathematics grew out of a short course taught by the author at MIT for students ranging from first-year undergraduates to graduate students ready for careers in physics, mathematics, management, electrical engineering, computer science, and biology. They benefited from an approach that avoided rigor and taught them how to use mathematics to solve real problems.

I have just started reading Street-Fighting Mathematics but I wonder if there is a parallel between mathematics and the semantics that everyone talks about capturing from information systems.

Consider this line:

Traditional mathematics teaching is largely about solving exactly stated problems exactly, yet life often hands us partly defined problems needing only moderately accurate solutions.

And re-cast it for semantics:

Traditional semantics (Peirce, FOL, SUMO, RDF) is largely about solving exactly stated problems exactly, yet life often hands us partly defined problems needing only moderately accurate solutions.

What if the semantics we capture and apply are sufficient for your use case? Complete with ROI for that use case.

Is that sufficient?

‘Picard and Dathon at El-Adrel’

Saturday, December 26th, 2015

Machines, Lost In Translation: The Dream Of Universal Understanding by Anne Li.

From the post:

It was early 1954 when computer scientists, for the first time, publicly revealed a machine that could translate between human languages. It became known as the Georgetown-IBM experiment: an “electronic brain” that translated sentences from Russian into English.

The scientists believed a universal translator, once developed, would not only give Americans a security edge over the Soviets but also promote world peace by eliminating language barriers.

They also believed this kind of progress was just around the corner: Leon Dostert, the Georgetown language scholar who initiated the collaboration with IBM founder Thomas Watson, suggested that people might be able to use electronic translators to bridge several languages within five years, or even less.

The process proved far slower. (So slow, in fact, that about a decade later, funders of the research launched an investigation into its lack of progress.) And more than 60 years later, a true real-time universal translator — a la C-3PO from Star Wars or the Babel Fish from The Hitchhiker’s Guide to the Galaxy — is still the stuff of science fiction.

How far are we from one, really? Expert opinions vary. As with so many other areas of machine learning, it depends on how quickly computers can be trained to emulate human thinking.

The Star Trek Next Generation episode Darmok was set during a five-year mission that began in 2364, some 349 years in our future. Faster than light travel, teleportation, etc. are day to day realities. One expects machine translation to have improved at least as much.

As Li reports exciting progress is being made with neural networks for translation but transposing words from one language to another, as illustrated in Darmok, isn’t a guarantee of “universal understanding.”

In fact, the transposition may be as opaque as the statement in its original language, such as “Darmok and Jalad at Tanagra,” leaves the hearer to wonder what happened at Tanagra, what was the relationship between Darmok and Jalad, etc.

In the early lines of The Story of the Shipwrecked Sailor, a Middle Kingdom (Egypt, 2000 BCE – 1700 BCE) story, there is a line that describes the sailor returning home and words to the effect “…we struck….” Then the next sentence picks up.

The words necessary to complete that statement don’t occur in the text. You have to know that mooring boats on the Nile did not involve piers, etc. but simply banking your boat and then driving a post (the unstated subject of “we struck”) to secure the vessel.

Transposition from Middle Egyptian to English leaves you without a clue as to the meaning of that passage.

To be sure, neural networks may clear away some of the rote work of transposition between languages but that is a far cry from “universal understanding.”

Both now and likely to continue into the 24th century.

Apache Ignite – In-Memory Data Fabric – With No Semantics

Friday, December 25th, 2015

I saw a tweet from the Apache Ignite project pointing to its contributors page: Start Contributing.

The documentation describes Apache Ignite™ as:

Apache Ignite™ In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.

If you think that is impressive, here’s a block representation of Ignite:

apache-ignite

Or a more textual view:

You can view Ignite as a collection of independent, well-integrated, in-memory components geared to improve performance and scalability of your application. Some of these components include:

Imagine my surprise when as search on “semantics” said

No Results Found.”

Even without data, whose semantics could be documented, there should be hooks for documenting of the semantics of future data.

I’m not advocating Apache Ignite jury-rig some means of documenting the semantics of data and Ignite processes.

The need for semantic documentation varies what is sufficient for one case will be wholly inadequate for another. Not to mention that documentation and semantics, often require different skills than possessed by most developers.

What semantics do you need documented with your Apache Ignite installation?

An XQuery Module For Simplifying Semantic Namespaces

Wednesday, December 23rd, 2015

An XQuery Module For Simplifying Semantic Namespaces by Kurt Cagle.

From the post:

While I enjoy working with the MarkLogic 8 server, there are a number of features about the semantics library there that I still find a bit problematic. Declaring namespaces for semantics in particular is a pain—I normally have trouble remembering the namespaces for RDF or RDFS or OWL, even after working with them for several years, and once you start talking about namespaces that are specific to your own application domain, managing this list can get onerous pretty quickly.

I should point out however, that namespaces within semantics can be very useful in helping to organize and design an ontology, even a non-semantic ontology, and as such, my applications tend to be namespace rich. However, when working with Turtle, Sparql, RDFa, and other formats of namespaces, the need to incorporate these namespaces can be a real showstopper for any developer. Thus, like any good developer, I decided to automate my pain points and create a library that would allow me to simplify this process.

The code given here is in turtle and xquery, but I hope to build out similar libraries for use in JavaScript shortly. When I do, I’ll update this article to reflect those changes.

If you are forced to use a MarkLogic 8 server, great post on managing semantic namespaces.

If you have a choice of tools, something to consider before you willingly choose to use a MarkLogic 8 server.

I first saw this in a tweet by XQuery.

Debugging with the Scientific Method [Debugging Search Semantics]

Tuesday, November 17th, 2015

Debugging with the Scientific Method by Stuart Halloway.

This webpage points to a video of Stuart’s keynote address at Clojure/conj 2015 with the same title and has pointers to other resources on debugging.

Stuart summarizes the scientific method for debugging in his closing as:


know where you are going

make well-founded choices

write stuff down

Programmers, using Clojure or not, will profit from Stuart’s advice on debugging program code.

A group that Stuart does not mention, those of us interested in creating search interfaces for users will benefit as well.

We have all had a similar early library experience, we are facing (in my youth) what seems like an endless rack of card files with the desire to find information on a subject.

Of course the first problem, from Stuart’s summary, is that we don’t know where we are going. At best we have an ill-defined topic on which we are supposed to produce a report. Let’s say “George Washington, father of our country” for example. (Yes, U.S. specific but I wasn’t in elementary school outside of the U.S. Feel free to post or adapt this with other examples.)

The first step, with help from a librarian, is to learn the basic author, subject, title organization of the card catalog. And things like looking for “George Washington” starting with “George” isn’t likely to produce a useful result. Eliding over the other details that a librarian would convey, you are somewhat equipped to move to step two.

Understanding the basic organization and mechanics of a library card catalog, you can develop a plan to search for information on George Washington. Such a plan would include excluding works over the reading level of the searcher, for example.

The third step of course is to capture all the information that is found from the resources located by using the library card catalog.

I mention that scenario not just out of nostalgia for card catalogs but to illustrate the difference between a card catalog and its electronic counter-parts, which have an externally defined schema and search interfaces with no disclosed search semantics.

That is to say, if a user doesn’t find an expected result for their search, how do you debug that failure?

You could say the user should have used “term X” instead of “term Y” but that isn’t solving the search problem, that is fixing the user.

Fixing users, as any 12-step program can attest, is a very difficult and fraught with failure process.

Fixing search semantics, debugging search semantics as it were, can fix the search results for a large number of users with little or no effort on their part.

There are any number of examples of debugging or fixing search semantics but the most prominent one that comes to mine is spelling correction by search engines that result results with the “correct” spelling and offer the user an opportunity to pursue their “incorrect” spelling.

At one time search engines returned “no results” in the event of mis-spelled words.

The reason I mention this is you are likely to be debugging search semantics on a less than global search space scale but the same principle applies as does Stuart’s scientific method.

Treat complaints about search results as an opportunity to debug the search semantics of your application. Follow up with users and test your improved search semantics.

Recalling that is all events, some user signs your check, not your application.

Text Mining Meets Neural Nets: Mining the Biomedical Literature

Wednesday, October 28th, 2015

Text Mining Meets Neural Nets: Mining the Biomedical Literature by Dan Sullivan.

From the webpage:

Text mining and natural language processing employ a range of techniques from syntactic parsing, statistical analysis, and more recently deep learning. This presentation presents recent advances in dense word representations, also known as word embedding, and their advantages over sparse representations, such as the popular term frequency-inverse document frequency (tf-idf) approach. It also discusses convolutional neural networks, a form of deep learning that is proving surprisingly effective in natural language processing tasks. Reference papers and tools are included for those interested in further details. Examples are drawn from the bio-medical domain.

Basically an abstract for the 58 slides you will find here: http://www.slideshare.net/DanSullivan10/text-mining-meets-neural-nets.

The best thing about these slides is the wealth of additional links to other resources. There is only so much you can say on a slide so links to more details should be a standard practice.

Slide 53: Formalize a Mathematical Model of Semantics, seems a bit ambitious to me. Considering mathematics are a subset of natural languages. Difficult to see how the lesser could model the greater.

You could create a mathematical model of some semantics and say it was all that is necessary, but that’s been done before. Always strive to make new mistakes.

Goodbye to True: Advancing semantics beyond the black and white

Thursday, October 15th, 2015

Goodbye to True: Advancing semantics beyond the black and white by Chris Welty.

Abstract:

The set-theoretic notion of truth proposed by Tarski is the basis of most work in machine semantics and probably has its roots in the work and influence of Aristotle. We take it for granted that the world can be described, not in shades of grey, but in terms of statements and propositions that are either true or false – and it seems most of western science stands on the same principle. This assumption at the core of our training as scientists should be questioned, because it stands in direct opposition to our human experience. Is there any statement that can be made that can actually be reduced to true or false? Only, it seems, in the artificial human-created realms of mathematics, games, and logic. We have been investigating a different mode of truth, inspired by results in Crowdsourcing, which allows for a highly dimension notion of semantic interpretation that makes true and false look like a childish simplifying assumption.

Chris was the keynote speaker at the Third International Workshop on Linked Data for Information Extraction (LD4IE2015). (Proceedings)

I wasn’t able to find a video for that presentation but I did find “Chris Welty formerly IBM Watson Team – Cognitive Computing GDG North Jersey at MSU” from about ten months ago.

Great presentation on “cognitive computing.”

Enjoy!

Are You Deep Mining Shallow Data?

Monday, September 21st, 2015

Do you remember this verse of Simple Simon?

Simple Simon went a-fishing,

For to catch a whale;

All the water he had got,

Was in his mother’s pail.

simple-simon-fishing

Shallow data?

To illustrate, fill in the following statement:

My mom makes the best _____.

Before completing that statement, you resolved the common noun, “mom,” differently that I did.

The string carries no clue as to the resolution of “mom” by any reader.

The string also gives no clues as to how it would be written in another language.

With a string, all you get is the string, or in other words:

All strings are shallow.

That applies to the strings we use to add depth to strings but we will reach that issue shortly.

One of the few things that RDF got right was:

…RDF puts the information in a formal way that a machine can understand. The purpose of RDF is to provide an encoding and interpretation mechanism so that resources can be described in a way that particular software can understand it; in other words, so that software can access and use information that it otherwise couldn’t use. (quote from Wikipedia on RDF)

In addition to the string, RDF posits an identifier in the form of a URI which you can follow to discover more information about that portion of string.

Unfortunately RDF was burdened by the need for all new identifiers to replace those already in place, an inability to easily distinguish identifier URIs from URIs that lead to subjects of conversation, and encoding requirements that reduced the population of potential RDF authors to a righteous remnant.

Despite its limitations and architectural flaws, RDF is evidence that strings are indeed shallow. Not to mention that if we could give strings depth, their usefulness would be greatly increased.

One method for imputing more depth to strings is natural language processing (NLP). Modern NLP techniques are based on statistical analysis of large data sets and are the most accurate for very common cases. The statistical nature of NLP makes application of those techniques to very small amounts of text or ones with unusual styles of usage problematic.

The limits of statistical techniques isn’t a criticism of NLP but rather an observation that depending on the level of accuracy desired and your data, such techniques may or may not be useful.

What is acceptable for imputing depth to strings in movie reviews is unlikely to be thought so when deciphering a manual for disassembling an atomic weapon. The question isn’t whether NLP can impute depth to strings but whether that imputation is sufficiently accurate for your use case.

Of course, RDF and NLP aren’t the only two means for imputing depth to strings.

We will take up another method for giving strings depth tomorrow.

Web Page Structure, Without The Semantic Web

Saturday, May 30th, 2015

Could a Little Startup Called Diffbot Be the Next Google?

From the post:


Diffbot founder and CEO Mike Tung started the company in 2009 to fix a problem: there was no easy, automated way for computers to understand the structure of a Web page. A human looking at a product page on an e-commerce site, or at the front page of a newspaper site, knows right away which part is the headline or the product name, which part is the body text, which parts are comments or reviews, and so forth.

But a Web-crawler program looking at the same page doesn’t know any of those things, since these elements aren’t described as such in the actual HTML code. Making human-readable Web pages more accessible to software would require, as a first step, a consistent labeling system. But the only such system to be seriously proposed, Tim Berners-Lee’s Semantic Web, has long floundered for lack of manpower and industry cooperation. It would take a lot of people to do all the needed markup, and developers around the world would have to adhere to the Resource Description Framework prescribed by the World Wide Web Consortium.

Tung’s big conceptual leap was to dispense with all that and attack the labeling problem using computer vision and machine learning algorithms—techniques originally developed to help computers make sense of edges, shapes, colors, and spatial relationships in the real world. Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page.

Using machine-learning techniques, this geometric data can then be compared to frameworks or “ontologies”—patterns distilled from training data, usually by humans who have spent time drawing rectangles on Web pages, painstakingly teaching the software what a headline looks like, what an image looks like, what a price looks like, and so on. The end result is a marked-up summary of a page’s important parts, built without recourse to any Semantic Web standards.

The irony here, of course, is that much of the information destined for publication on the Web starts out quite structured. The WordPress content-management system behind Xconomy’s site, for example, is built around a database that knows exactly which parts of this article should be presented as the headline, which parts should look like body text, and (crucially, to me) which part is my byline. But these elements get slotted into a layout designed for human readability—not for parsing by machines. Given that every content management system is different and that every site has its own distinctive tags and styles, it’s hard for software to reconstruct content types consistently based on the HTML alone.

There are several themes here that are relevant to topic maps.

First, it is true that most data starts with some structure, styles if you will, before it is presented for user consumption. Imagine an authoring application that automatically and unknown to its user, metadata that can then provide semantics for its data.

Second, the recognition of structure approach being used by Diffbot is promising in the large but should also be promising in the small as well. Local documents of a particular type are unlikely to have the variance of documents across the web. Meaning that with far less effort, you can build recognition systems that can empower more powerful searching of local document repositories.

Third, and perhaps most importantly, while the results may not be 100% accurate, the question for any such project should be how much accuracy is required? If I am mining social commentary blogs, a 5% error rate on recognition of speakers might be acceptable, because for popular threads or speakers, those errors are going to be quickly corrected. Unpopular threads or authors never followed, does that come under no harm/no foul?

Highly recommended for reading/emulation.

KDE and The Semantic Desktop

Saturday, March 14th, 2015

KDE and The Semantic Desktop by Vishesh Handa.

From the post:

During the KDE4 years the Semantic Desktop was one of the main pillars of KDE. Nepomuk was a massive, all encompassing, and integrated with many different part of KDE. However few people know what The Semantic Desktop was all about, and where KDE is heading.

History

The Semantic Desktop as it was originally envisioned comprised of both the technology and the philosophy behind The Semantic Web.

The Semantic Web is built on top of RDF and Graphs. This is a special way of storing data which focuses more on understanding what the data represents. This was primarily done by carefully annotating what everything means, starting with the definition of a resource, a property, a class, a thing, etc.

This process of all data being stored as RDF, having a central store, with applications respecting the store and following the ontologies was central to the idea of the Semantic Desktop.

The Semantic Desktop cannot exist without RDF. It is, for all intents and purposes, what the term “semantic” implies.

A brief post-mortem on the KDE Semantic Desktop which relied upon NEPOMUK (Networked Environment for Personal, Ontology-based Management of Unified Knowledge) for RDF-based features. (NEPOMUK was an EU project.)

The post mentions complexity more than once. A friend recently observed that RDF was all about supporting AI and not capturing arbitrary statements by a user.

Such as providing alternative identifiers for subjects. With enough alternative identifications (including context, which “scope” partially captures in topic maps), I suspect a deep learning application could do pretty well at subject recognition, including appropriate relationships (associations).

But that would not be by trying to guess or formulate formal rules (a la RDF/OWL) but by capturing the activities of users as they provide alternative identifications of and relationships for subjects.

Hmmm, merging then would be a learned behavior by our applications. Will have to give that some serious thought!

I first saw this in a tweet by Stefano Bertolo.

The Myth of Islamic/Muslim Terrorism

Friday, January 9th, 2015

The recent Charlie Hebdo attacks have given the media a fresh opportunity to refer to “Islamic or Muslim terrorists.” While there is no doubt those who attacked Charlie Hebdo were in fact Muslims, that does not justify referring to them as “Islamic or Muslim terrorists.”

Use of “Islamic or Muslim terrorists” reflects the underlying bigotry of the speaker and/or a failure to realize they are repeating the bigotry of others.

If you doubt my take on Islam since I am not a Muslim, consider What everyone gets wrong about Islam and cartoons of Mohammed by Amanda Taub, who talks to Muslims about the Charlie Hebdo event.

If you want to call the attackers of Charlie Hebdo, well, attackers, murderers, etc., all of that is true and isn’t problematic.

Do you use “Christian terrorists” to refer to American service personnel who kill women and children with cruise missiles, drones and bombs? Or perhaps you would prefer “American terrorists,” or “Israeli terrorists,” as news labels?

Using Islamic or Muslim, you aren’t identifying a person’s motivation, you are smearing a historic and honorable religion with the outrages of the few. Whether that is your intention or not.

I’m not advocating politically correct speech. You can wrap yourself in a cocoon of ignorance and intolerance and to speak freely from that position.

But before you are beyond the reach of reasonable speech, let me make a suggestion.

Contact a local mosque and make arrangements to attend outreach events/programs at the mosque. Not just once but go often enough to be a regular participant for several months. You will find Muslims are very much like other people you know. Some you will like and some perhaps not. But it will be as individuals that you like/dislike them, not because of their religion.

As a bonus, in addition to meeting Muslims, you will have an opportunity to learn about Islam first hand.

After such experiences, you will be able to distinguish the acts of a few criminals from a religion that numbers its followers in the millions.

Google’s Secretive DeepMind Startup Unveils a “Neural Turing Machine”

Wednesday, December 31st, 2014

Google’s Secretive DeepMind Startup Unveils a “Neural Turing Machine”

From the post:

One of the great challenges of neuroscience is to understand the short-term working memory in the human brain. At the same time, computer scientists would dearly love to reproduce the same kind of memory in silico.

Today, Google’s secretive DeepMind startup, which it bought for $400 million earlier this year, unveils a prototype computer that attempts to mimic some of the properties of the human brain’s short-term working memory. The new computer is a type of neural network that has been adapted to work with an external memory. The result is a computer that learns as it stores memories and can later retrieve them to perform logical tasks beyond those it has been trained to do.

Of particular interest to topic mappers and folks looking for realistic semantic solutions for big data. In particular the concept of “recoding,” which is how the human brain collapses multiple chunks of data into one chunk for easier access/processing.

It sounds close to referential transparency to me but where the transparency is optional. That is you don’t have to look unless you need the details.

The full article will fully repay the time to read it and then some:

Neural Turing Machines by Alex Graves, Greg Wayne, Ivo Danihelka.

Abstract:

We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

The paper was revised on 10 December 2014 so if you read an earlier version, you may want to read it again. Whether Google cracks this aspect of the problem of intelligence or not, it sounds like an intriguing technique with applications in topic map/semantic processing.

The Semantics of Victory

Tuesday, December 30th, 2014

NATO holds ceremony closing Afghan mission

From the post:

NATO has held a ceremony in Kabul formally ending its war in Afghanistan, officials said, after 13 years of conflict and gradual troop withdrawals that have left the country in the grip of worsening conflicts with armed groups.

The event was carried out on Sunday in secret due to the threat of Taliban strikes in the Afghan capital, which has been hit by repeated suicide bombings and gun attacks over recent years.

Compare that description to the AP story that appeared in the New York Times under: U.S. and NATO Formally End War in Afghanistan:

The war in Afghanistan, fought for 13 bloody years and still raging, came to a formal end Sunday with a quiet flag-lowering ceremony in Kabul that marked the transition of the fighting from U.S.-led combat troops to the country’s own security forces.

In front of a small, hand-picked audience at the headquarters of the NATO mission, the green-and-white flag of the International Security Assistance Force was ceremonially rolled up and sheathed, and the flag of the new international mission called Resolute Support was hoisted.

I assume from the dates and locations described these two accounts are describing the same event. Yes?

Does “…hand-picked audience…” translate to “…carried out … in secret due to the threat of Taliban strikes…?”

Bias isn’t unique to the United States, press or other sources but it is easier for me to spot. Examples from other sources are welcome.

The inevitable loss in Afghanistan is another example of failing to understand the semantics and culture of an opponent. (See my comments about Vietnam in Rare Find: Honest General Speaks Publicly About IS (ISIL, ISIS))

Let me summarize that lesson this way: An opponent cannot be “defeated” until you understand what “defeat” means to that opponent. And, you are capable of inflicting your opponent’s definition of “defeat” upon them.

It’s a two part requirement: 1) Opponent’s understanding of “defeat,” and 2) Inflicting opponent’s understanding of defeat. Fail on either requirement and your opponent has not been defeated.

Semantics are as important in war as in peace, if not more so.

Rare Find: Honest General Speaks Publicly About IS (ISIL, ISIS)

Monday, December 29th, 2014

In Battle to Defang ISIS, U.S. Targets Its Psychology by Eric Schmitt.

From the post:

Maj. Gen. Michael K. Nagata, commander of American Special Operations forces in the Middle East, sought help this summer in solving an urgent problem for the American military: What makes the Islamic State so dangerous?

Trying to decipher this complex enemy — a hybrid terrorist organization and a conventional army — is such a conundrum that General Nagata assembled an unofficial brain trust outside the traditional realms of expertise within the Pentagon, State Department and intelligence agencies, in search of fresh ideas and inspiration. Business professors, for example, are examining the Islamic State’s marketing and branding strategies.

“We do not understand the movement, and until we do, we are not going to defeat it,” he said, according to the confidential minutes of a conference call he held with the experts. “We have not defeated the idea. We do not even understand the idea.” (emphasis added)

An honest member of the any administration in Washington is so unusual that I wanted to draw your attention to Maj. General Michael K. Nagata.

His problem, as you will quickly recognize, is one of a diversity of semantics. What is heard one way by a Western audience is heard completely differently by an audience with a different tradition.

The general may not think of it as “progress,” but getting Washington policy makers to acknowledge that there is a legitimate semantic gap between Western policy makers and IS is a huge first step. It can’t be grudging or half-hearted. Western policy makers have to acknowledge that there are honest views of the world that are different from their own. IS isn’t practicing dishonest, deception, perversely refusing to acknowledge the truth of Western statements, etc. Members of IS have an honest but different semantic view of the world.

If the good general can get policy makers to take that step, then and only then can the discussion of what that “other” semantic is and how to map it into terms comprehensible to Western policy makers can begin. If that step isn’t taken, then the resources necessary to explore and map that “other” semantic are never going to be allocated. And even if allocated, the results will never figure into policy making with regard to IS.

Failing on any of those three points: failing to concede the legitimacy of the IS semantic, failing to allocate resources to explore and understand the IS semantic, failing to incorporate an understanding of the IS semantic into policy making, is going to result in a failure to “defeat” IS, if that remains a goal after understanding its semantic.

Need an example? Consider the Viet-Nam war, in which approximately 58,220 Americans died and millions of Vietnamese, Laotions and Cambodians died, not counting long term injuries among all of the aforementioned. In case you have not heard, the United States lost the Vietnam War.

The reasons for that loss are wide and varied but let me suggest two semantic differences that may have played a role in that defeat. First, the Vietnamese have a long term view of repelling foreign invaders. Consider that Vietnam was occupied by the Chinese from 111 BCE until 938 CE, a period of more than one thousand (1,000) years. American war planners had a war semantic of planning for the next presidential election, not a winning strategy for a foe with a semantic that was two hundred and fifty (250) times longer.

The other semantic difference (among many others) was the understanding of “democracy,” which is usually heralded by American policy makers as a grand prize resulting from American involvement. In Vietnam, however, the villages and hamlets already had what some would consider democracy for centuries. (Beyond Hanoi: Local Government in Vietnam) Different semantic for “democracy” to be sure but one that was left unexplored in the haste to import a U.S. semantic of the concept.

Fighting a war where you don’t understand the semantics in play for the “other” side is risky business.

General Nagata has taken the first step towards such an understanding by admitting that he and his advisors don’t understand the semantics of IS. The next step should be to find someone who does. May I suggest talking to members of IS under informal meeting arrangements? Such that diplomatic protocols and news reporting doesn’t interfere with honest conversations? I suspect IS members are as ignorant of U.S. semantics as U.S. planners are of IS semantics so there would be some benefit for all concerned.

Such meetings would yield more accurate understandings than U.S. born analysts who live in upper middle-class Western enclaves and attempt to project themselves into foreign cultures. The understanding derived from such meetings could well contradict current U.S. policy assessments and objectives. Whether any administration has the political will to act upon assessments that aren’t the product of a shared post-Enlightenment semantic remains to be seen. But such a assessments must be obtained first to answer that question.

Would topic maps help in such an endeavor? Perhaps, perhaps not. The most critical aspect of such a project would be conceding for all purposes, the legitimacy of the “other” semantic, where “other” depends on what side you are on. That is a topic map “state of mind” as it were, where all semantics are treated equally and not any one as more legitimate than any other.


PS: A litmus test for Major General Michael K. Nagata to use in assembling a team to attempt to understand IS semantics: Have each applicant write their description of the 9/11 hijackers in thirty (30) words or less. Any applicant who uses any variant of coward, extremist, terrorist, fanatic, etc. should be wished well and sent on their way. Not a judgement on their fitness for other tasks but they are not going to be able to bridge the semantic gap between current U.S. thinking and that of IS.

The CIA has a report on some of the gaps but I don’t know if it will be easier for General Nagata to ask the CIA for a copy or to just find a copy on the Internet. It illustrates, for example, why the American strategy of killing IS leadership is non-productive if not counter-productive.

If you have the means, please forward this post to General Nagata’s attention. I wasn’t able to easily find a direct means of contacting him.

Semantics of Shootings

Monday, December 22nd, 2014

Depending on how slow news is over the holidays, shootings will be the new hype category. Earlier today I saw a tweet by Sally Kohn that neatly summarizes the semantics of shootings in the United States (your mileage may vary in other places):


Muslim shooter = entire religion guilty

Black shooter = entire race guilty

White shooter = mentally troubled lone wolf

You should print that out and paste it to your television. To keep track of how reporters, elected officials and others react to different types of shootings. Or your own reaction.

PS: This is an example of sarcasm.

Monte-Carlo Tree Search for Multi-Player Games [Semantics as Multi-Player Game]

Saturday, December 20th, 2014

Monte-Carlo Tree Search for Multi-Player Games by Joseph Antonius Maria Nijssen.

From the introduction:

The topic of this thesis lies in the area of adversarial search in multi-player zero-sum domains, i.e., search in domains having players with conflicting goals. In order to focus on the issues of searching in this type of domains, we shift our attention to abstract games. These games provide a good test domain for Artificial Intelligence (AI). They offer a pure abstract competition (i.e., comparison), with an exact closed domain (i.e., well-defined rules). The games under investigation have the following two properties. (1) They are too complex to be solved with current means, and (2) the games have characteristics that can be formalized in computer programs. AI research has been quite successful in the field of two-player zero-sum games, such as chess, checkers, and Go. This has been achieved by developing two-player search techniques. However, many games do not belong to the area where these search techniques are unconditionally applicable. Multi-player games are an example of such domains. This thesis focuses on two different categories of multi-player games: (1) deterministic multi-player games with perfect information and (2) multi-player hide-and-seek games. In particular, it investigates how Monte-Carlo Tree Search can be improved for games in these two categories. This technique has achieved impressive results in computer Go, but has also shown to be beneficial in a range of other domains.

This chapter is structured as follows. First, an introduction to games and the role they play in the field of AI is provided in Section 1.1. An overview of different game properties is given in Section 1.2. Next, Section 1.3 defines the notion of multi-player games and discusses the two different categories of multi-player games that are investigated in this thesis. A brief introduction to search techniques for two-player and multi-player games is provided in Section 1.4. Subsequently, Section 1.5 defines the problem statement and four research questions. Finally, an overview of this thesis is provided in Section 1.6.

This thesis is great background reading on the use of Monte-Carol tree search in games. While reading the first chapter, I realized that assigning semantics to a token is an instance of a multi-player game with hidden information. That is the “semantic” of any token doesn’t exist in some Platonic universe but rather is the result of some N number of players who also accept a particular semantic for some given token in a particular context. And we lack knowledge of the semantic and the reasons for it that will be assigned by some N number of players, which may change over time and context.

The semiotic triangle of Ogden and Richards (The Meaning of Meaning):

300px-Ogden_semiotic_triangle

for any given symbol, represents the view of a single speaker. But as Ogden and Richards note, what is heard by listeners should be represented by multiple semiotic triangles:

Normally, whenever we hear anything said we spring spontaneously to an immediate conclusion, namely, that the speaker is referring to what we should be referring to were we speaking the words ourselves. In some cases this interpretation may be correct; this will prove to be what he has referred to. But in most discussions which attempt greater subtleties than could be handled in a gesture language this will not be so. (The Meaning of Meaning, page 15 of the 1923 edition)

Is RDF/OWL more subtle than can be handled by a gesture language? If you think so then you have discovered one of the central problems with the Semantic Web and any other universal semantic proposal.

Not that topic maps escape a similar accusation, but with topic maps you can encode additional semiotic triangles in an effort to avoid confusion, at least to the extent of funding and interest. And if you aren’t trying to avoid confusion, you can supply semiotic triangles that reach across understandings to convey additional information.

You can’t avoid confusion altogether nor can you achieve perfect communication with all listeners. But, for some defined set of confusions or listeners, you can do more than simply repeat your original statements in a louder voice.

Whether Monte-Carlo Tree searches will help deal with the multi-player nature of semantics isn’t clear but it is an alternative to repeating “…if everyone would use the same (my) system, the world would be better off…” ad nauseam.

I first saw this in a tweet by Ebenezer Fogus.

Semantic Parsing with Combinatory Categorial Grammars (Videos)

Thursday, December 11th, 2014

Semantic Parsing with Combinatory Categorial Grammars by Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer. (Tutorial)

Abstract:

Semantic parsers map natural language sentences to formal representations of their underlying meaning. Building accurate semantic parsers without prohibitive engineering costs is a long-standing, open research problem.

The tutorial will describe general principles for building semantic parsers. The presentation will be divided into two main parts: modeling and learning. The modeling section will include best practices for grammar design and choice of semantic representation. The discussion will be guided by examples from several domains. To illustrate the choices to be made and show how they can be approached within a real-life representation language, we will use λ-calculus meaning representations. In the learning part, we will describe a unified approach for learning Combinatory Categorial Grammar (CCG) semantic parsers, that induces both a CCG lexicon and the parameters of a parsing model. The approach learns from data with labeled meaning representations, as well as from more easily gathered weak supervision. It also enables grounded learning where the semantic parser is used in an interactive environment, for example to read and execute instructions.

The ideas we will discuss are widely applicable. The semantic modeling approach, while implemented in λ-calculus, could be applied to many other formal languages. Similarly, the algorithms for inducing CCGs focus on tasks that are formalism independent, learning the meaning of words and estimating parsing parameters. No prior knowledge of CCGs is required. The tutorial will be backed by implementation and experiments in the University of Washington Semantic Parsing Framework (UW SPF).

I previously linked to the complete slide set for this tutorial.

This page offers short videos (twelve (12) currently) and links into the slide set. More videos are forthcoming.

The goal of the project is “recover complete meaning representation” where complete meaning = “Complete meaning is sufficient to complete the task.” (from video 1).

That definition of “complete meaning” dodges a lot of philosophical as well as practical issues with semantic parsing.

Take the time to watch the videos, Yoav is a good presenter.

Enjoy!

AMR: Not semantics, but close (? maybe ???)

Wednesday, November 5th, 2014

AMR: Not semantics, but close (? maybe ???) by Hal Daumé.

From the post:

Okay, necessary warning. I’m not a semanticist. I’m not even a linguist. Last time I took semantics was twelve years ago (sigh.)

Like a lot of people, I’ve been excited about AMR (the “Abstract Meaning Representation”) recently. It’s hard not to get excited. Semantics is all the rage. And there are those crazy people out there who think you can cram meaning of a sentence into a !#$* vector [1], so the part of me that likes Language likes anything that has interesting structure and calls itself “Meaning.” I effluviated about AMR in the context of the (awesome) SemEval panel.

There is an LREC paper this year whose title is where I stole the title of this post from: Not an Interlingua, But Close: A Comparison of English AMRs to Chinese and Czech by Xue, Bojar, Hajič, Palmer, Urešová and Zhang. It’s a great introduction to AMR and you should read it (at least skim).

What I guess I’m interested in discussing is not the question of whether AMR is a good interlingua but whether it’s a semantic representation. Note that it doesn’t claim this: it’s not called ASR. But as semantics is the study of the relationship between signifiers and denotation, [Edit: it’s a reasonable place to look; see Emily Bender’s comment.] it’s probably the closest we have.

Deeply interesting work, particularly given the recent interest in Enhancing open data with identifiers. Be sure to read the comments to the post as well.

Who knew? Semantics are important!

😉

Topic maps take that a step further and capture your semantics, not necessarily the semantics of some expert unfamiliar with your domain.

Madison: Semantic Listening Through Crowdsourcing

Tuesday, October 28th, 2014

Madison: Semantic Listening Through Crowdsourcing by Jane Friedhoff.

From the post:

Our recent work at the Labs has focused on semantic listening: systems that obtain meaning from the streams of data surrounding them. Chronicle and Curriculum are recent examples of tools designed to extract semantic information (from our corpus of news coverage and our group web browsing history, respectively). However, not every data source is suitable for algorithmic analysis–and, in fact, many times it is easier for humans to extract meaning from a stream. Our new projects, Madison and Hive, are explorations of how to best design crowdsourcing projects for gathering data on cultural artifacts, as well as provocations for the design of broader, more modular kinds of crowdsourcing tools.

(image omitted)

Madison is a crowdsourcing project designed to engage the public with an under-viewed but rich portion of The New York Times’s archives: the historical ads neighboring the articles. News events and reporting give us one perspective on our past, but the advertisements running alongside these articles provide a different view, giving us a sense of the culture surrounding these events. Alternately fascinating, funny and poignant, they act as commentary on the technology, economics, gender relations and more of that time period. However, the digitization of our archives has primarily focused on news, leaving the ads with no metadata–making them very hard to find and impossible to search for them. Complicating the process further is that these ads often have complex layouts and elaborate typefaces, making them difficult to differentiate algorithmically from photographic content, and much more difficult to scan for text. This combination of fascinating cultural information with little structured data seemed like the perfect opportunity to explore how crowdsourcing could form a source of semantic signals.

From the projects homepage:

Help preserve history with just one click.

The New York Times archives are full of advertisements that give glimpses into daily life and cultural history. Help us digitize our historic ads by answering simple questions. You’ll be creating a unique resource for historians, advertisers and the public — and leaving your mark on history.

Get started with our collection of ads from the 1960s (additional decades will be opened later)!

I would like to see a Bible transcription project that was that user friendly!

But, then the goal of the New York Times is to include as many people as possible.

Looking forward to more news on Madison!

The Pretence of Knowledge

Friday, October 24th, 2014

The Pretence of Knowledge by Friedrich August von Hayek. (Nobel Prize Lecture in Economics, December 11, 1974)

From the lecture:

The particular occasion of this lecture, combined with the chief practical problem which economists have to face today, have made the choice of its topic almost inevitable. On the one hand the still recent establishment of the Nobel Memorial Prize in Economic Science marks a significant step in the process by which, in the opinion of the general public, economics has been conceded some of the dignity and prestige of the physical sciences. On the other hand, the economists are at this moment called upon to say how to extricate the free world from the serious threat of accelerating inflation which, it must be admitted, has been brought about by policies which the majority of economists recommended and even urged governments to pursue. We have indeed at the moment little cause for pride: as a profession we have made a mess of things.

It seems to me that this failure of the economists to guide policy more successfully is closely connected with their propensity to imitate as closely as possible the procedures of the brilliantly successful physical sciences – an attempt which in our field may lead to outright error. It is an approach which has come to be described as the “scientistic” attitude – an attitude which, as I defined it some thirty years ago, “is decidedly unscientific in the true sense of the word, since it involves a mechanical and uncritical application of habits of thought to fields different from those in which they have been formed.”1 I want today to begin by explaining how some of the gravest errors of recent economic policy are a direct consequence of this scientistic error.

If you have some time for serious thinking over the weekend, visit or re-visit this lecture.

Substitute “computistic” for “scientistic” and capturing semantics as the goal.

Google and other search engines are overwhelming proof that some semantics can be captured by computers, but they are equally evidence of a semantic capture gap.

Any number of proposals exist to capture semantics, ontologies, Description Logic, RDF, OWL, but none are based on an empirical study how semantics originate, change and function in human society. Such proposals are snapshots of a small group’s understanding of semantics. Your mileage may vary.

Depending on your goals and circumstances, one or more proposal may be useful. But capturing and maintaining semantics without a basis in empirical study of semantics seems like a hit or miss proposition.

Or at least historical experience with capturing and maintaining semantics points in that direction.

I first saw this in a tweet by Chris Diehl

Analyzing Schema.org

Thursday, October 23rd, 2014

Analyzing Schema.org by Peter F. Patel-Schneider.

Abstract:

Schema.org is a way to add machine-understandable information to web pages that is processed by the major search engines to improve search performance. The definition of schema.org is provided as a set of web pages plus a partial mapping into RDF triples with unusual properties, and is incomplete in a number of places. This analysis of and formal semantics for schema.org provides a complete basis for a plausible version of what schema.org should be.

Peter’s analysis is summarized when he says:

The lack of a complete definition of schema.org limits the possibility of extracting the correct information from web pages that have schema.org markup.

Ah, yes, “…the correct information from web pages….”

I suspect the lack of semantic precision has powered the success of schema.org. Each user of schema.org markup has their private notion of the meaning of their use of the markup and there is no formal definition to disabuse them of that notion. Not that formal definitions were enough to save owl:sameAs from varying interpretations.

Schema.org empowers varying interpretations without requiring users to ignore OWL or description logic.

For the domains that schema.org covers, eateries, movies, bars, whore houses, etc., the semantic slippage permitted by schema.org lowers the bar to usage of its markup. Which has resulted in its adoption more widely than other proposals.

The lesson of schema.org is the degree of semantic slippage you can tolerate depends upon your domain. For pharmaceuticals, I would assume that degree of slippage is as close to zero as possible. For movie reviews, not so much.

Any effort to impose the same degree of semantic slippage across all domains is doomed to failure.

I first saw this in a tweet by Bob DuCharme.

ADW (Align, Disambiguate and Walk) [Semantic Similarity]

Tuesday, October 14th, 2014

ADW (Align, Disambiguate and Walk) version 1.0 by Mohammad Taher Pilehvar.

From the webpage:

This package provides a Java implementation of ADW, a state-of-the-art semantic similarity approach that enables the comparison of lexical items at different lexical levels: from senses to texts. For more details about the approach please refer to: http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf

The abstract for the paper reads:

Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-of-the-art performance on three tasks: semantic textual similarity, word similarity, and word sense coarsening.

Online Demo.

The strength of this approach is the use of multiple levels of semantic similarity. It relies on WordNet but the authors promise to extend their approach to named entities and other tokens not appearing in WordNet (like your company or industry’s internal vocabulary).

The bibliography of the paper cites much of the recent work in this area so that will be an added bonus for perusing the paper.

I first saw this in a tweet by Gregory Piatetsky.

Microsoft’s Quantum Mechanics

Saturday, October 11th, 2014

Microsoft’s Quantum Mechanics by Tom Simonite.

From the post:

In 2012, physicists in the Netherlands announced a discovery in particle physics that started chatter about a Nobel Prize. Inside a tiny rod of semiconductor crystal chilled cooler than outer space, they had caught the first glimpse of a strange particle called the Majorana fermion, finally confirming a prediction made in 1937. It was an advance seemingly unrelated to the challenges of selling office productivity software or competing with Amazon in cloud computing, but Craig Mundie, then heading Microsoft’s technology and research strategy, was delighted. The abstruse discovery—partly underwritten by Microsoft—was crucial to a project at the company aimed at making it possible to build immensely powerful computers that crunch data using quantum physics. “It was a pivotal moment,” says Mundie. “This research was guiding us toward a way of realizing one of these systems.”

Microsoft is now almost a decade into that project and has just begun to talk publicly about it. If it succeeds, the world could change dramatically. Since the physicist Richard Feynman first suggested the idea of a quantum computer in 1982, theorists have proved that such a machine could solve problems that would take the fastest conventional computers hundreds of millions of years or longer. Quantum computers might, for example, give researchers better tools to design novel medicines or super-efficient solar cells. They could revolutionize artificial intelligence.

Fairly upbeat review of current efforts to build a quantum computer.

You may want to off-set it by reading Scott Aaronson’s blog, Shtetl-Optimized, which has the following header note:

If you take just one piece of information from this blog:
Quantum computers would not solve hard search problems
instantaneously by simply trying all the possible solutions at once. (emphasis added)

See in particular: Speaking Truth to Parallelism at Cornell

Whatever speedups are possible with quantum computers, getting a semantically incorrect answer faster isn’t an advantage.

Assumptions about faster computing platforms include an assumption of correct semantics. There have been no proofs of default correct handling of semantics by present day or future computing platforms.

I first saw this in a tweet by Peter Lee.

PS: I saw the reference to Scott Aaronson’s blog in a comment to Tom’s post.

Lingo of Lambda Land

Tuesday, September 30th, 2014

Lingo of Lambda Land by Katie Miller.

From the post:

Comonads, currying, compose, and closures
This is the language of functional coders
Equational reasoning, tail recursion
Lambdas and lenses and effect aversion
Referential transparency and pure functions
Pattern matching for ADT deconstructions
Functors, folds, functions that are first class
Monoids and monads, it’s all in the type class
Infinite lists, so long as they’re lazy
Return an Option or just call it Maybe
Polymorphism and those higher kinds
Monad transformers, return and bind
Catamorphisms, like from Category Theory
You could use an Either type for your query
Arrows, applicatives, continuations
IO actions and partial applications
Higher-order functions and dependent types
Bijection and bottom, in a way that’s polite
Programming of a much higher order
Can be found just around the jargon corner

I posted about Kate Miller’s presentation, Coder Decoder: Functional Programmer Lingo Explained, with Pictures but wanted to draw your attention to the poem she wrote to start the presentation.

In part because it is an amusing poem but also for you to attempt an experiment that Stanley Fish reports on interpretation of poems.

Stanley’s experiment is recounted in “How to Recognize a Poem When You See One,” which appears as chapter 14 in Is There A Text In This Class? The Authority of Interpretative Communities by Stanley Fish.

As functional programmers or wannabe functional programmers, you are probably not the “right” audience for this experiment. (But, feel free to try it.)

Stanley’s experiment came about from a list of authors given to one class, centered on a blackboard (yes, many years ago) to which, for the second class, Stanley drew a box around the list of names and inserted “p. 43” on the board. Those were the only changes between the classes.

The second class was one on interpretation of religious poetry and they were instructed this list was a religious poem and they should being applied the techniques learned in the class to its interpretation.

Stanley’s account of this experiment is masterful and I urge you to read his account in full.

At the same time, you will learn a lot about semantics if you ask a poetry professor to have one of their classes produce an interpretation of this poem. You will discover that “not knowing the meaning of the terms” is no barrier to the production of an interpretation. Sit in the back of the classroom and don’t betray the experiment by offering explanations of the terms.

The question to ask yourself at the end of the experiment is: Where did the semantics of the poem originate? Did Katie Miller imbue it with semantics that would be known to all readers? Or do the terms themselves carry semantics and Katie just selected them? If either answer is yes, how did the poetry class arrive at its rather divergent and colorful explanation of the poem?

Hmmm, if you were scanning this text with a parser, whose semantics would your parser attribute to the text? Katie’s? Any programmers? The class’?

Worthwhile to remember that data processing chooses “a” semantic, not “the” semantic in any given situation.

From Frequency to Meaning: Vector Space Models of Semantics

Thursday, September 18th, 2014

From Frequency to Meaning: Vector Space Models of Semantics by Peter D. Turney and Patrick Pantel.

Abstract:

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

At forty-eight (48) pages with a thirteen (13) page bibliography, this survey of vector space models (VSMs) of semantics should keep you busy for a while. You will have to fill in VSMs developments since 2010 but mastery of this paper will certain give you the foundation to do so. Impressive work.

I do disagree with the authors when they say:

Computers understand very little of the meaning of human language.

Truth be told, I would say:

Computers have no understanding of the meaning of human language.

What happens with a VSM of semantics is that we as human readers choose a model we think represents semantics we see in a text. Our computers blindly apply that model to text and report the results. We as human readers choose results that we think are closer to the semantics we see in the text, and adjust the model accordingly. Our computers then blindly apply the adjusted model to the text again and so on. At no time does the computer have any “understanding” of the text or of the model that it is applying to the text. Any “understanding” in such a model is from a human reader who adjusted the model based on their perception of the semantics of a text.

I don’t dispute that VSMs have been incredibly useful and like the authors, I think there is much mileage left in their development for text processing. That is not the same thing as imputing “understanding” of human language to devices that in fact have none at all. (full stop)

Enjoy!

I first saw this in a tweet by Christopher Phipps.

PS: You probably recall that VSMs are based on creating a metric space for semantics, which have no preordained metric space. Transitioning from a non-metric space to a metric space isn’t subject to validation, at least in my view.