Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 22, 2012

TXR: a Pattern Matching Language (Not Just)….

Filed under: Pattern Matching,Text Extraction — Patrick Durusau @ 7:34 pm

TXR: a Pattern Matching Language (Not Just) for Convenient Text Extraction

From the webpage:

TXR (“texer” or “tee ex ar”) is a new and growing language oriented toward processing text, packaged as a utility (the txr command) that runs in POSIX environments and on Microsoft Windows.

Working with TXR is different from most text processing programming languages. Constructs in TXR aren’t imperative statements, but rather pattern-matching directives: each construct terminates by matching, failing, or throwing an exception. Searching and backtracking behaviors are implicit.

The development of TXR began when I needed a utility to be used in shell programming which would reverse the action of a “here-document”. Here-documents are a programming language feature for generating multi-line text from a boiler-plate template which contains variables to be substituted, and possibly other constructs such as various functions that generate text. Here-documents appeared in the Unix shell decades ago, but most of today’s web is basically a form of here-document, because all non-trivial websites generate HTML dynamically, substituting variable material into templates on the fly. Well, in the given situation I was programming in, I didn’t want here documents as much as “there documents”: I wanted to write a template of text containing variables, but not to generate text but to do the reverse: match the template against existing text which is similar to it, and capture pieces of it into variables. So I developed a utility to do just that: capture these variables from a template, and then generate a set of variable assignments that could be eval-ed in a shell script.

That was sometime in the middle of 2009. Since then TXR has become a lot more powerful. It has features like structured named blocks with nonlocal exits, structured exception handling, pattern matching functions, and numerous other features. TXR is powerful enough to parse grammars, yet simple to use on trivial tasks.

For things that can’t be easily done in the pattern matching language, TXR has a built-in Lisp dialect, which supports goodies like first class functions with closures over lexical environments, I/O (including string and list streams), hash tables with optional weak semantics, and arithmetic with arbitrary precision (“bignum”) integers.

A powerful tool for text extraction/manipulation.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress