Archive for the ‘UIMA’ Category

Running a UIMA Analysis Engine in a Lucene Analyzer Chain

Tuesday, July 31st, 2012

Running a UIMA Analysis Engine in a Lucene Analyzer Chain by Sujit Pal.

From the post:

Last week, I wrote about a UIMA Aggregate Analysis Engine (AE) that annotates keywords in a body of text, optionally inserting synonyms, using a combination of pattern matching and dictionary lookups. The idea is that this analysis will be done on text on its way into a Lucene index. So this week, I describe the Lucene Analyzer chain that I built around the AE I described last week.

A picture is worth a thousand words, so here is one that shows what I am (or will be soon, in much greater detail) talking about.

[Graphic omitted]

As you can imagine, most of the work happens in the UimaAETokenizer. The tokenizer is a buffering (non-streaming) Tokenizer, ie, the entire text is read from the Reader and analyzed by the UIMA AE, then individual tokens returned on successive calls to its incrementToken() method. I decided to use the new (to me) AttributeSource.State object to keep track of the tokenizer’s state between calls to incrementToken() (found out about it by grokking through the Synonym filter example in the LIA2 book).

After (UIMA) analysis, the annotated tokens are marked as Keyword, any transformed values for the annotation are set into the SynonymMap (for use by the synonym filter, next in the chain). Text that is not annotated are split up (by punctuation and whitespace) and returned as plain Lucene Term (or CharTerm since Lucene 3.x) tokens. Here is the code for the Tokenizer class.

The second of two posts from Jack Park.

Part of my continuing interest in indexing. In part because we know that indexing scales. Seriously scales.

UIMA Analysis Engine for Keyword Recognition and Transformation

Tuesday, July 31st, 2012

UIMA Analysis Engine for Keyword Recognition and Transformation by Sujit Pal.

From the post:

You have probably noticed that I’ve been playing with UIMA lately, perhaps a bit aimlessly. One of my goals with UIMA is to create an Analysis Engine (AE) that I can plug into the front of the Lucene analyzer chain for one of my applications. The AE would detect and mark keywords in the input stream so they would be exempt from stemming by downstream Lucene analyzers.

So couple of weeks ago, I picked up the bits and pieces of UIMA code that I had written and started to refactor them to form a sequence of primitive AEs that detected keywords in text using pattern and dictionary recognition. Each primitive AE places new KeywordAnnotation objects into an annotation index.

The primitive AEs I came up with are pretty basic, but offers a surprising amount of bang for the buck. There are just two annotators – the PatternAnnotator and DictionaryAnnotator – that do the processing for my primitive AEs listed below. Obviously, more can be added (and will, eventually) as required.

  • Pattern based keyword recognition
  • Pattern based keyword recognition and transformation
  • Dictionary based keyword recognition, case sensitive
  • Dictionary based keyword recognition and transformation, case sensitive
  • Dictionary based keyword recognition, case insensitive
  • Dictionary based keyword recognition and transformation, case insensitive

The first of two posts that I missed from last year, recently brought to my attention by Jack Park.

The ability to annotate, implying, among other things, the ability to create synonym annotations for keywords.

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

Thursday, August 11th, 2011

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

From the post:

Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chain and stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.

Sounds interesting.

In particular:

…concepts (and their associated synonyms) are passed through….

sounds like topic map talk to me partner!

Depends on where you put your emphasis.

Watson and Healthcare

Monday, April 18th, 2011

Watson and Healthcare

From the webpage:

IBM Watson’s stellar performance in the Jeopardy! show captured the world’s imagination. The first real world application for Watson involves healthcare. How does Watson address issues that previous generations of tools have not been able to address? What are the technical approaches Watson takes to set it apart from other systems? This article answers how Watson takes on those questions and gives you a sneak peek of the technology behind Watson based on scientific papers, specifications, articles published by the IBM team, and interviews with university collaborators.

One of the things I find the most interesting about the Watson project is that Apache UIMA (Unstructured Information Management Architecture) lies at its very core.

In part because I don’t think a significant part of all data is ever going to appear in structured format, whether that be Linked Data, some fuller/lesser form of RDF, Topic Maps or some other structured format. That being the case, we are going to have to deal with data as it presents itself and not the soft pitch form of say Linked Data.

That will include “merging” based on assumptions we impose on or derive from data and later verify or revise.

How skillful we are at building evolving systems with “memories” of past choices will determine how useful our systems are to users.

The article has a number of resources and pointers.