Archive for the ‘UIMA’ Category

Leveraging UIMA in Spark

Wednesday, December 17th, 2014

Leveraging UIMA in Spark by Philip Ogren.


Much of the Big Data that Spark welders tackle is unstructured text that requires text processing techniques. For example, performing named entity extraction on tweets or sentiment analysis on customer reviews are common activities. The Unstructured Information Management Architecture (UIMA) framework is an Apache project that provides APIs and infrastructure for building complex and robust text analytics systems. A typical system built on UIMA defines a collection of analysis engines (such as e.g. a tokenizer, part-of-speech tagger, named entity recognizer, etc.) which are executed according to arbitrarily complex flow control definitions. The framework makes it possible to have interoperable components in which best-of-breed solutions can be mixed and matched and chained together to create sophisticated text processing pipelines. However, UIMA can seem like a heavy weight solution that has a sprawling API, is cumbersome to configure, and is difficult to execute. Furthermore, UIMA provides its own distributed computing infrastructure and run time processing engines that overlap, in their own way, with Spark functionality. In order for Spark to benefit from UIMA, the latter must be light-weight and nimble and not impose its architecture and tooling onto Spark.

In this talk, I will introduce a project that I started called uimaFIT which is now part of the UIMA project ( With uimaFIT it is possible to adopt UIMA in a very light-weight way and leverage it for what it does best: text processing. An entire UIMA pipeline can be encapsulated inside a single function call that takes, for example, a string input parameter and returns named entities found in the input string. This allows one to call a Spark RDD transform (e.g. map) that performs named entity recognition (or whatever text processing tasks your UIMA components accomplish) on string values in your RDD. This approach requires little UIMA tooling or configuration and effectively reduces UIMA to a text processing library that can be called rather than requiring full-scale adoption of another platform. I will prepare a companion resource for this talk that will provide a complete, self-contained, working example of how to leverage UIMA using uimaFIT from within Spark.

The necessity of creating light-weight ways to bridge the gaps between applications and frameworks is a signal that every solution is trying to be the complete solution. Since we have different views of what any “complete” solution would look like, wheels are re-invented time and time again. Along with all the parts necessary to use those wheels. Resulting in a tremendous duplication of effort.

A component based approach attempts to do one thing. Doing any one thing well, is challenging enough. (Self-test: How many applications do more than one thing well? Assuming they do one thing well. BTW, for programmers, the test isn’t that other programs fail to do it any better.)

Until more demand results in easy to pipeline components, Philip’s uimaFIT is a great way to incorporate text processing from UIMA into Spark.


Running a UIMA Analysis Engine in a Lucene Analyzer Chain

Tuesday, July 31st, 2012

Running a UIMA Analysis Engine in a Lucene Analyzer Chain by Sujit Pal.

From the post:

Last week, I wrote about a UIMA Aggregate Analysis Engine (AE) that annotates keywords in a body of text, optionally inserting synonyms, using a combination of pattern matching and dictionary lookups. The idea is that this analysis will be done on text on its way into a Lucene index. So this week, I describe the Lucene Analyzer chain that I built around the AE I described last week.

A picture is worth a thousand words, so here is one that shows what I am (or will be soon, in much greater detail) talking about.

[Graphic omitted]

As you can imagine, most of the work happens in the UimaAETokenizer. The tokenizer is a buffering (non-streaming) Tokenizer, ie, the entire text is read from the Reader and analyzed by the UIMA AE, then individual tokens returned on successive calls to its incrementToken() method. I decided to use the new (to me) AttributeSource.State object to keep track of the tokenizer’s state between calls to incrementToken() (found out about it by grokking through the Synonym filter example in the LIA2 book).

After (UIMA) analysis, the annotated tokens are marked as Keyword, any transformed values for the annotation are set into the SynonymMap (for use by the synonym filter, next in the chain). Text that is not annotated are split up (by punctuation and whitespace) and returned as plain Lucene Term (or CharTerm since Lucene 3.x) tokens. Here is the code for the Tokenizer class.

The second of two posts from Jack Park.

Part of my continuing interest in indexing. In part because we know that indexing scales. Seriously scales.

UIMA Analysis Engine for Keyword Recognition and Transformation

Tuesday, July 31st, 2012

UIMA Analysis Engine for Keyword Recognition and Transformation by Sujit Pal.

From the post:

You have probably noticed that I’ve been playing with UIMA lately, perhaps a bit aimlessly. One of my goals with UIMA is to create an Analysis Engine (AE) that I can plug into the front of the Lucene analyzer chain for one of my applications. The AE would detect and mark keywords in the input stream so they would be exempt from stemming by downstream Lucene analyzers.

So couple of weeks ago, I picked up the bits and pieces of UIMA code that I had written and started to refactor them to form a sequence of primitive AEs that detected keywords in text using pattern and dictionary recognition. Each primitive AE places new KeywordAnnotation objects into an annotation index.

The primitive AEs I came up with are pretty basic, but offers a surprising amount of bang for the buck. There are just two annotators – the PatternAnnotator and DictionaryAnnotator – that do the processing for my primitive AEs listed below. Obviously, more can be added (and will, eventually) as required.

  • Pattern based keyword recognition
  • Pattern based keyword recognition and transformation
  • Dictionary based keyword recognition, case sensitive
  • Dictionary based keyword recognition and transformation, case sensitive
  • Dictionary based keyword recognition, case insensitive
  • Dictionary based keyword recognition and transformation, case insensitive

The first of two posts that I missed from last year, recently brought to my attention by Jack Park.

The ability to annotate, implying, among other things, the ability to create synonym annotations for keywords.

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

Thursday, August 11th, 2011

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

From the post:

Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chain and stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.

Sounds interesting.

In particular:

…concepts (and their associated synonyms) are passed through….

sounds like topic map talk to me partner!

Depends on where you put your emphasis.

Watson and Healthcare

Monday, April 18th, 2011

Watson and Healthcare

From the webpage:

IBM Watson’s stellar performance in the Jeopardy! show captured the world’s imagination. The first real world application for Watson involves healthcare. How does Watson address issues that previous generations of tools have not been able to address? What are the technical approaches Watson takes to set it apart from other systems? This article answers how Watson takes on those questions and gives you a sneak peek of the technology behind Watson based on scientific papers, specifications, articles published by the IBM team, and interviews with university collaborators.

One of the things I find the most interesting about the Watson project is that Apache UIMA (Unstructured Information Management Architecture) lies at its very core.

In part because I don’t think a significant part of all data is ever going to appear in structured format, whether that be Linked Data, some fuller/lesser form of RDF, Topic Maps or some other structured format. That being the case, we are going to have to deal with data as it presents itself and not the soft pitch form of say Linked Data.

That will include “merging” based on assumptions we impose on or derive from data and later verify or revise.

How skillful we are at building evolving systems with “memories” of past choices will determine how useful our systems are to users.

The article has a number of resources and pointers.