Archive for the ‘Pattern Matching’ Category

SnoPy – SNOBOL Pattern Matching for Python

Friday, March 22nd, 2013

SnoPy – SNOBOL Pattern Matching for Python

Description:

SnoPy – A Python alternative to regular expressions. Borrowed from SNOBOL this alternative is both easier to use and more powerful than regular expressions. NO backslashes to count.

See also: SnoPy – SNOBOL Pattern Matching for Python Web Site.

For cross-disciplinary data mining, what could be more useful than SNOBOL pattern matching?

I first saw this in a Facebook link posted by Sam Hunting.

TXR: a Pattern Matching Language (Not Just)….

Sunday, January 22nd, 2012

TXR: a Pattern Matching Language (Not Just) for Convenient Text Extraction

From the webpage:

TXR (“texer” or “tee ex ar”) is a new and growing language oriented toward processing text, packaged as a utility (the txr command) that runs in POSIX environments and on Microsoft Windows.

Working with TXR is different from most text processing programming languages. Constructs in TXR aren’t imperative statements, but rather pattern-matching directives: each construct terminates by matching, failing, or throwing an exception. Searching and backtracking behaviors are implicit.

The development of TXR began when I needed a utility to be used in shell programming which would reverse the action of a “here-document”. Here-documents are a programming language feature for generating multi-line text from a boiler-plate template which contains variables to be substituted, and possibly other constructs such as various functions that generate text. Here-documents appeared in the Unix shell decades ago, but most of today’s web is basically a form of here-document, because all non-trivial websites generate HTML dynamically, substituting variable material into templates on the fly. Well, in the given situation I was programming in, I didn’t want here documents as much as “there documents”: I wanted to write a template of text containing variables, but not to generate text but to do the reverse: match the template against existing text which is similar to it, and capture pieces of it into variables. So I developed a utility to do just that: capture these variables from a template, and then generate a set of variable assignments that could be eval-ed in a shell script.

That was sometime in the middle of 2009. Since then TXR has become a lot more powerful. It has features like structured named blocks with nonlocal exits, structured exception handling, pattern matching functions, and numerous other features. TXR is powerful enough to parse grammars, yet simple to use on trivial tasks.

For things that can’t be easily done in the pattern matching language, TXR has a built-in Lisp dialect, which supports goodies like first class functions with closures over lexical environments, I/O (including string and list streams), hash tables with optional weak semantics, and arithmetic with arbitrary precision (“bignum”) integers.

A powerful tool for text extraction/manipulation.

Tech Survivors: Geek Technologies…

Thursday, October 20th, 2011

Tech Survivors: Geek Technologies That Still Survive 25 to 50 Years Later

Simply awesome!

Useful review for a couple of reasons:

First: New languages, formats, etc., will emerge but legacy systems “….will be with you always.” (Or at least it will feel that way so being able to interface with legacy systems (understand their semantics) is going to be important for very long time.)

Second: What was it about these technologies that made them succeed? (I don’t have the answer or I would be at the USPTO filing every patent and variant of patent that I could think of. ;-) It is clearly non-obvious because no one else is there either.) Equally long-lived technologies are with us today, we just don’t know which ones.

Would not hurt to put this on your calendar to review every year or so. The more you know about new technologies, the more likely you are to spot a resemblance or pattern matching one of these technologies. Maybe.

Tying up the loose ends in fully LZW-compressed pattern matching

Tuesday, September 27th, 2011

Tying up the loose ends in fully LZW-compressed pattern matching by Pawel Gawrychowski.

Abstract:

We consider a natural generalization of the classical pattern matching problem: given compressed representations of a pattern p[1..M] and a text t[1..N] of sizes m and n, respectively, does p occur in t? We develop an optimal linear time solution for the case when both p and t are compressed using the LZW method. This improves the previously known O((n+m)log(n+m)) time solution of Gasieniec and Rytter, and essentially closes the line of research devoted to studying LZW-compressed exact pattern matching.

I don’t know of any topic maps that are yet of the size that they and queries against them need to take advantage of compressed queries against compress data but this paper outlines “an optimal linear time solution … for fully LZW-compressed pattern matching.” (page 2)

I suspect it may be more relevant to data mining prior to the construction of a topic map. But in either case, when needed it will be a welcome solution.

Faster Approximate Pattern Matching in Compressed Repetitive Texts

Saturday, September 17th, 2011

Faster Approximate Pattern Matching in Compressed Repetitive Texts by Travis Gagie, Pawel Gawrychowski, and Simon J. Puglisi.

Abstract:

Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with $r$ rules for a string $s$ of length $n$, we can build an \Oh\({r}\)-word data structure that allows us to extract any substring \(s [i..j]\) in $\Oh{\log n + j – i}$ time. They also showed how, given a pattern $p$ of length $m$ and an edit distance (k \leq m), their data structure supports finding all \occ approximate matches to $p$ in $s$ in $\Oh{r (\min (m k, k^4 + m) + \log n) + \occ}$ time. Rytter (2003) and Charikar et al. (2005) showed that $r$ is always at least the number $z$ of phrases in the LZ77 parse of $s$, and gave algorithms for building straight-line programs with $\Oh{z \log n}$ rules. In this paper we give a simple $\Oh{z \log n}$-word data structure that takes the same time for substring extraction but only $\Oh{z (\min (m k, k^4 + m)) + \occ}$ time for approximate pattern matching.

It occurs to me that this could be useful for bioinformatic applications of topic maps that map between data sets and literature.

Interesting that the authors mention redundancy in web crawls. I suspect that there are a number of domains that have highly repetitive data should we choose to look at them that way. Markup documents for example.

Mining Associations and Patterns from Semantic Data

Friday, September 2nd, 2011

The editors of a special issue of the International Journal on Semantic Web and Information Systems on Mining Associations and Patterns from Semantic Data have issued the following call for papers:

Guest editors: Kemafor Anyanwu, Ying Ding, Jie Tang, and Philip Yu

Large amounts of Semantic Data is being generated through semantic extractions from and annotation of traditional Web, social and sensor data. Linked Open Data has provided excellent vehicle for representation and sharing of such data. Primary vehicle to get semantics useful for better integration, search and decision making is to find interesting relationships or associations, expressed as meaningful paths, subgraphs and patterns. This special issue seeks theories, algorithms and applications of extracting such semantic relationships from large amount of semantic data. Example topics include:

  • Theories to ground associations and patterns with social, socioeconomic, biological semantics
  • Representation (e.g. language extensions) to express meaningful relationships and patterns
  • Algorithms to efficiently compute and mine semantic associations and patterns
  • Techniques for filtering, ranking and/or visualization of semantic associations and patterns
  • Application of semantic associations and patterns in a domain with significant social or society impact

IJSWIS is included in most major indices including CSI, with Thomson Scientific impact factor 2.345. We seek high quality manuscripts suitable for an archival journal based on original research. If the manuscript is based on a prior workshop or conference submission, submissions should reflect significant novel contribution/extension in conceptual terms and/or scale of implementation and evaluation (authors are highly encouraged to clarify new contributions in a cover letter or within the submission).

Important Dates:
Submission of full papers: Feb 29, 2012
Notification of paper acceptance: May 30, 2012
Publication target: 3Q 2012

Details of the journal, manuscript preparation, and recent articles are available on the website:
http://www.igi-global.com/bookstore/titledetails.aspx?titleid=1092 or http://ijswis.org

Guest Editors: Prof. Kemafor Anyanwu, North Carolina State University
Prof. Ying Ding, Indiana University
Prof. Jie Tang, Tsinghua University
Prof. Philip Yu, University of Illinois, Chicago
Contact Guest Editor: Ying Ding <dingying@indiana.edu>

Pattern Matching & Predicate Dispatch

Friday, August 19th, 2011

Pattern Matching & Predicate Dispatch by David Nolen, NYC Clojure Users Group 8/17/2011.

From the description:

From David:

“Pattern matching is a powerful tool for processing data based on type and structure. In this talk I’ll discuss a new library I’ve been working on that provides optimized, extensible pattern matching to Clojure. We’ll cover the theory, the implementation, and future directions for this library, in particular the true goal – predicate dispatch.”

David Nolen is the primary developer of the Clojure contrib library core.logic – an efficient Prolog-like logic engine. The pattern matching library continues his investigations into the relationships between object-oriented, functional, and logic programming.

Slides: http://www.scribd.com/doc/62571669/Patterns

Source: https://github.com/swannodette/match

Playing with Scala’s pattern matching – Post

Wednesday, February 16th, 2011

Playing with Scala’s pattern matching

François Sarradin writes:

How many times have you been stuck in your frustration because you were unable to use strings as entries in switch-case statements. Such an ability would be really useful for example to analyze the arguments of your application or to parse a file, or any content of a string. Meanwhile, you have to write a series of if-else-if statements (and this is annoying). Another solution is to use a hash map, where the keys are those strings and values are the associated reified processes, for example a Runnable or a Callable in Java (but this is not really natural, long to develop, and boring too).

If a switch-case statement that accepts strings as entries would be a revolution for you, the Scala’s pattern matching says that this is not enough! Indeed, there are other cases where a series of if-else-if statements would be generously transformed into a look-alike switch-case statement. For example, it would be really nice to simplify a series of instanceof and cast included in if-else-if to execute the good process according to the type of a parameter.

In this post, we see the power of the Scala’s pattern matching in different use cases.

What language you choose for topic map development is going to depend upon its pattern matching abilities.

Here’s a chance to evaluate Scala in that regard.