Archive for the ‘Cyc’ Category

Algebraic and Analytic Programming

Monday, March 10th, 2014

Algebraic and Analytic Programming by Luke Palmer.

In a short post Luke does a great job contrasting algebraic versus analytic approaches to programming.

In an even shorter summary, I would say the difference is “truth” versus “acceptable results.”

Oddly enough, that difference shows up in other areas as well.

The major ontology projects, including linked data, are pushing one and only one “truth.”

Versus other approaches, such as topic maps (at least in my view), that tend towards “acceptable results.”

I am not sure what other measure of success you would have other than “acceptable results?”

Or what another measure for a semantic technology would be other than “acceptable results?”

Whether the universal truth of the world folks admit it or not, they just have a different definition of “acceptable results.” Their “acceptable results” means their world view.

I appreciate the work they put into their offer but I have to decline. I already have a world view of my own.

You?

I first saw this in a tweet by Computer Science.

Topic Maps as Jigsaw Puzzles?

Tuesday, January 17th, 2012

I ran across:

How could a data governance framework possibly predict how you will assemble the puzzle pieces? Or how the puzzle pieces will fit together within your unique corporate culture? Or which of the many aspects of data governance will turn out to be the last (or even the first) piece of the puzzle to fall into place in your organization? And, of course, there is truly no last piece of the puzzle, since data governance is an ongoing program because the business world constantly gets jumbled up by change.

So, data governance frameworks are useful, but only if you realize that data governance frameworks are like jigsaw puzzles. (emphasis added)

in A Data Governance Framework Jigsaw Puzzle by Jim Harris.

I rather liked the comparison to a jigsaw puzzle and the argument that the last piece seems magical only because it is the last piece. You could jumble them up and some other piece would be the last piece.

The other part that I liked was the conclusion that “…the business world constantly gets jumbled up by change.”

Might want to read that again: “…the business world constantly gets jumbled up by change.”

I will boldly generalize that to: the world constantly gets jumbled by change.

Well, perhaps not such a bold statement as I think anyone old enough to be reading this blog realizes the world of today isn’t the world it was ten years ago. Or five years ago. Or in many cases one year ago.

I think that may explain some of my unease with ontologies that claim to have captured something fundamental rather than something fit for a particular use.

At one time an ontology based on earth, wind, fire and water would have been sufficient for most purposes. It isn’t necessary to claim more than fitness for use and in so doing, it leaves us the ready option to change should a new use come along. One that isn’t served by the old ontology.

Interchange is one use case and if you want to claim that Cyc or SUMO are appropriate for a particular case of interchange, that is a factual claim that can be evaluated. Or to claim that either one is sufficient for “reasoning” about a particular domain. Again, a factual question subject to evaluation.

But the world that produced both Cyc and SUMO isn’t the world of today. Both remain useful but the times they are a changing. Enough change and both ontologies and topic maps will need to change to suit your present needs.

Ontologies and topic maps are jigsaw puzzles with no final piece.

Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

Wednesday, November 23rd, 2011

Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

From the post:

Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.

Background: Adverse Drug Events

An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.

Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.

Because clinical trials study a relatively small number of patients, both regulatory agencies and pharmaceutical companies maintain databases in order to track adverse events that occur after drugs have been approved for market. In the United States, the FDA uses the Adverse Event Reporting System (AERS), where healthcare professionals and consumers may report the details of ADEs they experienced. The FDA makes a well-formatted sample of the reports available for download from their website, to the benefit of data scientists everywhere.

Methodology

Identifying ADEs is primarily a signal detection problem: we have a collection of events, where each event has multiple attributes (in this case, the drugs the patient was taking) and multiple outcomes (the adverse reactions that the patient experienced), and we would like to understand how the attributes correlate with the outcomes. One simple technique for analyzing these relationships is a 2×2 contingency table:

For All Drugs/Reactions:

Reaction = Rj

Reaction != Rj

Total

Drug = Di

A

B

A + B

Drug != Di

C

D

C + D

Total

A + C

B + D

A + B + C + D

Based on the values in the cells of the tables, we can compute various measures of disproportionality to find drug-reaction pairs that occur more frequently than we would expect if they were independent.

For this project, we analyzed interactions involving multiple drugs, using a generalization of the contingency table method that is described in the paper, “Empirical bayes screening for multi-item associations” by DuMouchel and Pregibon. Their model computes a Multi-Item Gamma-Poisson Shrinkage (MGPS) estimator for each combination of drugs and outcomes, and gives us a statistically sound measure of disproportionality even if we only have a handful of observations for a particular combination of drugs. The MGPS model has been used for a variety of signal detection problems across multiple industries, such as identifying fraudulent phone calls, performing market basket analyses and analyzing defects in automobiles.

Apologies for the long setup:

Solving the Hard Problem with Apache Hadoop

At first glance, it doesn’t seem like we would need anything beyond a laptop to analyze ADEs, since the FDA only receives about one million reports a year. But when we begin to examine these reports, we discover a problem that is similar to what happens when we attempt to teach computers to play chess: a combinatorial explosion in the number of possible drug interactions we must consider. Even restricting ourselves to analyzing pairs of drugs, there are more than 3 trillion potential drug-drug-reaction triples in the AERS dataset, and tens of millions of triples that we actually see in the data. Even including the iterative Expectation Maximization algorithm that we use to fit the MGPS model, the total runtime of our analysis is dominated by the process of counting how often the various interactions occur.

The good news is that MapReduce running on a Hadoop cluster is ideal for this problem. By creating a pipeline of MapReduce jobs to clean, aggregate, and join our data, we can parallelize the counting problem across multiple machines to achieve a linear speedup in our overall runtime. The faster runtime for each individual analysis allows us to iterate rapidly on smaller models and tackle larger problems involving more drug interactions than anyone has ever looked at before.

Where have I heard about combinatorial explosions before?

If you think about it, semantic environments (except for artificial ones) are inherently noisy and the signal we are looking for to trigger merging may be hard to find.

Semantic environments like Cyc are noise free, but they are also not the semantic environments in which most data exists and in which we have to make decisions.

Questions: To what extent are “clean” semantic environments artifacts of adapting to the capacities of existing hardware/software? What aspects of then current hardware/software would you point to in making that case?

Sowa on Watson

Friday, February 11th, 2011

John Sowa’s posting on Watson merits reproduction in its entirety (lite editing to make it format for easy reading):

Peter,

Thanks for the reminder:

Dave Ferrucci gave a talk on UIMA (the Unstructured Information Management Architecture) back in May-2006, entitled: “Putting the Semantics in the Semantic Web: An overview of UIMA and its role in Accelerating the Semantic Revolution”

I recommend that readers compare Ferrucci’s talk about UIMA in 2006 with his talk about the Watson system and Jeopardy in 2011. In less than 5 years, they built Watson on the UIMA foundation, which contained a reasonable amount of NLP tools, a modest ontology, and some useful tools for knowledge acquisition. During that time, they added quite a bit of machine learning, reasoning, statistics, and heuristics. But most of all, they added terabytes of documents.

For the record, following are Ferrucci’s slides from 2006:

http://ontolog.cim3.net/file/resource/presentation/DavidFerrucci_20060511/UIMA-SemanticWeb–DavidFerrucci_20060511.pdf

Following is the talk that explains the slides:

http://ontolog.cim3.net/file/resource/presentation/DavidFerrucci_20060511/UIMA-SemanticWeb–DavidFerrucci_20060511_Recording-2914992-460237.mp3

And following is his recent talk about the DeepQA project for building and extending that foundation for Jeopardy:

http://www-943.ibm.com/innovation/us/watson/watson-for-a-smarter-planet/building-a-jeopardy-champion/how-watson-works.html

Compared to Ferrucci’s talks, the PBS Nova program was a disappointment. It didn’t get into any technical detail, but it did have a few cameo appearances from AI researchers. Terry Winograd and Pat Winston, for example, said that the problem of language understanding is hard.

But I thought that Marvin Minsky and Doug Lenat said more with their tone of voice than with their words. My interpretation (which could, of course, be wrong) is that both of them were seething with jealousy that IBM built a system that was competing with Jeopardy champions on national TV — and without their help.

In any case, the Watson project shows that terabytes of documents are far more important for commonsense reasoning than the millions of formal axioms in Cyc. That does not mean that the Cyc ontology is useless, but it undermines the original assumptions for the Cyc project: commonsense reasoning requires a huge knowledge base of hand-coded axioms together with a powerful inference engine.

An important observation by Ferrucci: The URIs of the Semantic Web are *not* useful for processing natural languages — not for ordinary documents, not for scientific documents, and especially not for Jeopardy questions:

1. For scientific documents, words like ‘H2O’ are excellent URIs. Adding an http address in front of them is pointless.

2. A word like ‘water’, which is sometimes a synonym for ‘H2O’, has an open-ended number of senses and microsenses.

3. Even if every microsense could be precisely defined and cataloged on the WWW, that wouldn’t help determine which one is appropriate for any particular context.

4. Any attempt to force human being(s) to specify or select a precise sense cannot succeed unless *every* human understands and consistently selects the correct sense at *every* possible occasion.

5. Given that point #4 is impossible to enforce and dangerous to assume, any software that uses URIs will have to verify that the selected sense is appropriate to the context.

6. Therefore, URIs found “in the wild” on the WWW can never be assumed to be correct unless they have been guaranteed to be correct by a trusted source.

These points taken together imply that annotations on documents can’t be trusted unless (a) they have been generated by your own system or (b) they were generated by a system which is at least as trustworthy as your own and which has been verified to be 100% compatible with yours.

In summary, the underlying assumptions for both Cyc and the Semantic Web need to be reconsidered.

You can see the post at: http://ontolog.cim3.net/forum/ontolog-forum/2011-02/msg00114.html

I don’t always agree with Sowa but he has written extensively on conceptual graphs, knowledge representation and ontological matters. See http://www.jfsowa.com/

I missed the local showing but found the video at: Smartest Machine on Earth.

You will find a link to an interview with Minsky at that same location.

I don’t know that I would describe Minsky as “…seething with jealousy….”

While I enjoy Jeopardy and it is certainly more cerebral than say American Idol, I think Minsky is right in seeing the Watson effort as something other than artificial intelligence.

Q: In 2011, who was the only non-sentient contestant on the TV show Jeopardy?

A: What is IBM’s Watson?