Parsing Drug Dosages in text…

Parsing Drug Dosages in text using Finite State Machines by Sujit Pal.

From the post:

Someone recently pointed out an issue with the Drug Dosage FSM in Apache cTakes on the cTakes mailing list. Looking at the code for it revealed a fairly complex implementation based on a hierarchy of Finite State Machines (FSM). The intuition behind the implementation is that Drug Dosage text in doctor’s notes tend to follow a standard-ish format, and FSMs can be used to exploit this structure and pull out relevant entities out of this text. The paper Extracting Structured Medication Event Information from Discharge Summaries has more information about this problem. The authors provide their own solution, called the Merki Medication Parser. Here is a link to their Online Demo and source code (Perl).

I’ve never used FSMs myself, although I have seen it used to model (more structured) systems. So the idea of using FSMs for parsing semi-structured text such as this seemed interesting and I decided to try it out myself. The implementation I describe here is nowhere nearly as complex as the one in cTakes, but on the flip side, is neither as accurate, nor broad nor bulletproof either.

My solution uses drug dosage phrase data provided in this Pattern Matching article by Erin Rhode (which also comes with a Perl based solution), as well as its dictionaries (with additions by me), to model the phrases with the state diagram below. I built the diagram by eyeballing the outputs from Erin Rhode’s program. I then implement the state diagram with a home-grown FSM implementation based on ideas from Electric Monk’s post on FSMs in Python and the documentation for the Java library Tungsten FSM. I initially tried to use Tungsten-FSM, but ended up with extremely verbose Scala code because of Scala’s stricter generics system.

This caught my attention because I was looking at a data import handler recently that was harvesting information from a minimal XML wrapper around mediawiki markup. Works quite well but seems like a shame to miss all the data in wiki markup.

I say “miss all the data in wiki markup” and that’s not really fair. It is dumped into a single field for indexing. But that is a field that loses the context distinctions between a note, appendix, bibliography, or even the main text.

If you need distinctions that aren’t the defaults, you may be faced with rolling your own FSM. This post should help get you started.

Comments are closed.