Archive for the ‘Regexes’ Category

RegexBuddy (Think Occur Mode for Emacs)

Saturday, March 18th, 2017


From the webpage:

RegexBuddy is your perfect companion for working with regular expressions. Easily create regular expressions that match exactly what you want. Clearly understand complex regexes written by others. Quickly test any regex on sample strings and files, preventing mistakes on actual data. Debug without guesswork by stepping through the actual matching process. Use the regex with source code snippets automatically adjusted to the particulars of your programming language. Collect and document libraries of regular expressions for future reuse. GREP (search-and-replace) through files and folders. Integrate RegexBuddy with your favorite searching and editing tools for instant access.

Learn all there is to know about regular expressions from RegexBuddy’s comprehensive documentation and regular expression tutorial.

I was reminded of RegexBuddy when I stumbled on the RegexBuddy Manual in a search result.

The XQuery/XPath regex treatment is far briefer than I would like but at 500+ pages, it’s an impressive bit of work. Even without a copy of RegexBuddy, working through the examples will make you a regex terrorist.

The only unfortunate aspect, for *nix users, is that you need to run RegexBuddy in a Windows VM. 🙁

If you are comfortable with Emacs, Windows or otherwise, then the Occur mode comes to mind. It doesn’t have the visuals of RegexBuddy but then you are accustomed to a power-user environment.

In terms of productivity, it’s hard to beat regexes. I passed along a one liner awk regex tip today to extract content from a “…pile of nonstandard multiply redundant JavaScript infested pseudo html.”

I’ve seen the HTML in question. The description seems a bit generous to me. 😉

Try your hand at regexes and see if your productivity increases!

Regexer [JavaScript Regexes – Railroad Diagrams]

Monday, August 22nd, 2016


From the documentation page:

The images generated by Regexper are commonly referred to as “Railroad Diagrams”. These diagram are a straight-forward way to illustrate what can sometimes become very complicated processing in a regular expression, with nested looping and optional elements. The easiest way to read these diagrams to to start at the left and follow the lines to the right. If you encounter a branch, then there is the option of following one of multiple paths (and those paths can loop back to earlier parts of the diagram). In order for a string to successfully match the regular expression in a diagram, you must be able to fulfill each part of the diagram as you move from left to right and proceed through the entire diagram to the end.

As an example, this expression will match “Lions and tigers and bears. Oh my!” or the more grammatically correct “Lions, tigers, and bears. Oh my!” (with or without an Oxford comma). The diagram first matches the string “Lions”; you cannot proceed without that in your input. Then there is a choice between a comma or the string ” and”. No matter what choice you make, the input string must then contain ” tigers” followed by an optional comma (your path can either go through the comma or around it). Finally the string must end with ” and bears. Oh my!”.


JavaScript-style regular expression input and railroad diagram output.

Can you think of a better visualization for teaching regexes? (Or analysis when they get hairy.)

Regular Expression Crossword Puzzle

Friday, December 25th, 2015

Regular Expression Crossword Puzzle by Greg Grothaus.

From the post:

If you know regular expressions, you might find this to be geek fun. A friend of mine posted this, without a solution, but once I started working it, it seemed put together well enough it was likely solvable. Eventually I did solve it, but not before coding up a web interface for verifying my solution and rotating the puzzle in the browser, which I recommend using if you are going to try this out. Or just print it out.

It’s actually quite impressive of a puzzle in it’s own right. It must have taken a lot of work to create.


The image is a link to the interactive version with the rules.

Other regex crossword puzzle resources:

RegHex – An alternative web interface to help solve the MIT hexagonal regular expression puzzle.

Regex Cross­word – Starting with a tutorial, the site offers 9 levels/types of games, concluding with five (5) hexagonal ones (only a few blocks on the first one and increasingly complex).

Regex Crosswords by Nikola Terziev – Generates regex crosswords, only squares at the moment.

In case you need help with some of the regex puzzles, you can try: Awesome Regex – A collection of regex resources.

If you are really adventuresome, try Constraint Reasoning Over Strings (2003) by Keith Golden and Wanlin Pang.


This paper discusses an approach to representing and reasoning about constraints over strings. We discuss how many string domains can often be concisely represented using regular languages, and how constraints over strings, and domain operations on sets of strings, can be carried out using this representation.

Each regex clue you add is a constraint on all the intersecting cells. Your first regex clue is unbounded, but every clue after that has a constraint. Wait, that’s not right! Constraints arise only when cells governed by different regexes intersect.

Anyone interested in going beyond hexagons and/or 2 dimensions?

I first saw this in a tweet by Alexis Lloyd.

Mastering Emacs (new book)

Saturday, May 23rd, 2015

Mastering Emacs by Mickey Petersen.

I can’t recommend Mastering Emacs as lite beach reading but next to a computer, it has few if any equals.

I haven’t ordered a copy (yet) but based on the high quality of Mickey’s Emacs posts, I recommend it sight unseen.

You can look inside at the TOC.

If you still need convincing, browse Mickey’s Full list of Tips, Tutorials and Articles for a generous sampling of his writing.


Removing blank lines in a buffer (Emacs)

Saturday, April 18th, 2015

Removing blank lines in a buffer by Mickey Petersen.

I was mining Twitter addresses from list embedded in HTML markup in Emacs (great way to practice regexes) and as a result, had lots of blank lines. Before running sort or uniq, I wanted to remove the blank lines.

All of Mickey’s posts are great resources but I found this one particularly helpful.

Look-behind regex

Wednesday, December 10th, 2014

Look-behind regex by John D. Cook.

From the post:

Look-behind is one of those advanced/obscure regular expression features that I don’t use frequently enough to remember the syntax, but just frequently enough that I wish I could remember it.

Look-behind can be positive or negative. Look-behind says “match this position only if the preceding text matches (does not match) the following pattern.”

I wish I had read this post before writing regular expressions to clean up over 4K of scanning results recently. I can think of several cases where this could have been helpful.

If you want to practice your regex writing skills, visit Stack Overflow and try your hand at recent regex questions. Or stroll through some of the older questions for tips/techniques.


Monday, June 9th, 2014


RegexTip is a Twitter account maintained by John D. Cook and it sends out one (1) regex tip per week.

Regexes or regular expressions are everywhere in computer science but especially in search.

I just saw a tweet by Scientific Python that the cycle of regex tips has restarted with the basics.

Good time to follow RegexTip.


Saturday, May 3rd, 2014


From the webpage:

RegExr is an online tool to learn, build, & test Regular Expressions (RegEx / RegExp).

  • Results update in real-time as you type.
  • Roll over a match or expression for details.
  • Save & share expressions with others.
  • Explore the Library for help & examples.
  • Undo & Redo with Ctrl-Z / Y.
  • Search for & rate Community patterns.

For fast text processing, very little can touch regexes and Unix command line utilities.

I first saw this at Nathan Yau’s Learn regular expressions with RegExr.

Regular expressions unleashed

Wednesday, April 16th, 2014

Regular expressions unleashed by Hans-Juergen Schoenig.

From the post:

When cleaning up some old paperwork this weekend I stumbled over a very old tutorial. In fact, I have received this little handout during a UNIX course I attended voluntarily during my first year at university. It seems that those two days have really changed my life – the price tag: 100 Austrian Schillings which translates to something like 7 Euros in today’s money.

When looking at this old thing I noticed a nice example showing how to test regular expression support in grep. Over the years I had almost forgotten this little test. Here is the idea: There is no single way to print the name of Libya’s former dictator. According to this example there are around 30 ways to do it:…

Thirty (30) sounds a bit low to me but it’s sufficient to point out that mining all thirty (30) is going to give you a number of false positives, when searching for news on the former dictator of Libya.

The regex to capture all thirty (30) variant forms in a PostgreSQL database is great but once you have it, now what?

Particularly if you have sorted out the dictator from the non-dictators and/or placed them in other categories.

Do you pass that sorting and classifying onto the next user or do you flush the knowledge toilet and all that hard work just drains away?

Learn regex the hard way

Wednesday, April 16th, 2014

Learn regex the hard way by Zed A. Shaw.

From the preface:

This is a rough in-progress dump of the book. The grammar will probably be bad, there will be sections missing, but you get to watch me write the book and see how I do things.

Finally, don’t forget that I have href{}{Learn Python The Hard Way, 2nd Edition} which you should read if you can’t code yet.

Exercises 1 – 16 have some content (out of 27) so it is incomplete but still a goodly amount of material.

Zed has other “hard way” titles on:

Regexes are useful all contexts so you won’t regret learning or brushing up on them.


Saturday, August 17th, 2013


From the webpage:

frak transforms collections of strings into regular expressions for matching those strings. The primary goal of this library is to generate regular expressions from a known set of inputs which avoid backtracking as much as possible.

This looks quite useful for text mining.

A large amount of which is on the near horizon.

I first saw this in Nat Torkington’s Four short links: 16 August 2013.

Debuggex [Emacs Alternative, Others?]

Saturday, April 13th, 2013

Debuggex: A visual regex helper

Regexes (regular expressions) are a mainstay of data mining/extraction.

Debuggex is a regex debugger with visual cues to help you with writing/debugging regular expressions.

The webpage reports full JS regexes are not yet supported.

If you need a fuller alternative, consider debugging regex expressions in Emacs.

M - x regexp-builder

which shows matches as you type.

Be aware that regex languages vary (no real surprise).

One helpful resource: Regular Expression Flavor Comparison

Working with Pig

Saturday, February 16th, 2013

Working with Pig by Dan Morrill. (video)

From the description:

Pig is a SQL like command language for use with Hadoop, we review a simple PIG script line by line to help you understand how pig works, and regular expressions to help parse data. If you want a copy of the slide presentation – they are over on slide share

Very good intro to PIG!

Mentions a couple of resources you need to bookmark:

Input Validation Cheat Sheet (The Open Web Security Application Project – OWASP) – regexes to re-use in Pig scripts. Lots of other regex cheat sheet pointers. (Being mindful that “\” must be escaped in PIG.) A more general resource on regexes.

I first saw this at: This Quick Pig Overview Brings You Up to Speed Line by Line.

C++11 regex cheatsheet

Sunday, July 22nd, 2012

C++11 regex cheatsheet

A one page C++11 regex cheatsheet that you may find useful.

Curious though, how useful do you find colors on cheatsheets?

Or are there cheatsheets where you find colors useful and others not?

If so, what seems to be the difference?

Not an entirely idle query. I want to author a cheatsheet or two, but want them to be useful to others.

At one level, I see cheatsheets as being extremely minimalistic, no commentary, just short reminders of the correct syntax.

A step up from that level, perhaps for rarely used commands, a bit more than bare syntax.

Suggestions? Pointers to cheatsheets you have found useful?

BADREX: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions

Friday, June 22nd, 2012

BADREX: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions by Phil Gooch.


BADREX uses dynamically generated regular expressions to annotate term definition-term abbreviation pairs, and corefers unpaired acronyms and abbreviations back to their initial definition in the text. Against the Medstract corpus BADREX achieves precision and recall of 98% and 97%, and against a much larger corpus, 90% and 85%, respectively. BADREX yields improved performance over previous approaches, requires no training data and allows runtime customisation of its input parameters. BADREX is freely available from as a plugin for the General Architecture for Text Engineering (GATE) framework and is licensed under the GPLv3.

From the conclusion:

The use of regular expressions dynamically generated from document content yields modestly improved performance over previous approaches to identifying term definition–term abbreviation pairs, with the benefit of providing in-place annotation, expansion and coreference in a single pass. BADREX requires no training data and allows runtime customisation of its input parameters.

Although not mentioned by the author, a reader can agree/disagree with an expansion as they are reading the text. Could provide for faster feedback/correction of the expansion.

Assuming you accept a correct/incorrect view of expansions. I prefer agree/disagree as the more general rule. Correct/incorrect is the result of the application of a specified rule.

Clojure and XNAT: Introduction

Saturday, February 4th, 2012

Clojure and XNAT: Introduction

Over the last two years, I’ve been using Clojure quite a bit for managing, testing, and exploratory development in XNAT. Clojure is a new member of the Lisp family of languages that runs in the Java Virtual Machine. Two features of Clojure that I’ve found particularly useful are seamless Java interoperability and good support for interactive development.

“Interactive development” is a term that may need some explanation: With many languages — Java, C, and C++ come to mind — you write your code, compile it, and then run your program to test. Most Lisps, including Clojure, have a different model: you start the environment, write some code, test a function, make changes, and rerun your test with the new code. Any state necessary for the test stays in memory, so each write/compile/test iteration is fast. Developing in Clojure feels a lot like running an interpreted environment like Matlab, Mathematica, or R, but Clojure is a general-purpose language that compiles to JVM bytecode, with performance comparable to plain old Java.

One problem that comes up again and again on the XNAT discussion group and in our local XNAT support is that received DICOM files land in the unassigned prearchive rather than the intended project. Usually when this happens, there’s a custom rule for project identification where the regular expression doesn’t quite match what’s in the DICOM headers. Regular expressions are a wonderfully concise way of representing text patterns, but this sentence is equally true if you replace “wonderfully concise” with “maddeningly cryptic.”

Interesting “introduction” that focuses on regular expressions.

If you don’t know XNAT (I didn’t):

XNAT is an open source imaging informatics platform, developed by the Neuroinformatics Research Group at Washington University. It facilitates common management, productivity, and quality assurance tasks for imaging and associated data. Thanks to its extensibility, XNAT can be used to support a wide range of imaging-based projects.

Important neuroinformatics project based at Washington University, which has a history of very successful public technology projects.

Never hurts to learn more about any informatics project, particularly one in the medical sciences. With an introduction to Clojure as well, what more could you want?

How Google Code Search Worked

Tuesday, January 24th, 2012

Regular Expression Matching with a Trigram Index or How Google Code Search Worked by Russ Cox.

In the summer of 2006, I was lucky enough to be an intern at Google. At the time, Google had an internal tool called gsearch that acted as if it ran grep over all the files in the Google source tree and printed the results. Of course, that implementation would be fairly slow, so what gsearch actually did was talk to a bunch of servers that kept different pieces of the source tree in memory: each machine did a grep through its memory and then gsearch merged the results and printed them. Jeff Dean, my intern host and one of the authors of gsearch, suggested that it would be cool to build a web interface that, in effect, let you run gsearch over the world’s public source code. I thought that sounded fun, so that’s what I did that summer. Due primarily to an excess of optimism in our original schedule, the launch slipped to October, but on October 5, 2006 we did launch (by then I was back at school but still a part-time intern).

I built the earliest demos using Ken Thompson’s Plan 9 grep, because I happened to have it lying around in library form. The plan had been to switch to a “real” regexp library, namely PCRE, probably behind a newly written, code reviewed parser, since PCRE’s parser was a well-known source of security bugs. The only problem was my then-recent discovery that none of the popular regexp implementations – not Perl, not Python, not PCRE – used real automata. This was a surprise to me, and even to Rob Pike, the author of the Plan 9 regular expression library. (Ken was not yet at Google to be consulted.) I had learned about regular expressions and automata from the Dragon Book, from theory classes in college, and from reading Rob’s and Ken’s code. The idea that you wouldn’t use the guaranteed linear time algorithm had never occurred to me. But it turned out that Rob’s code in particular used an algorithm only a few people had ever known, and the others had forgotten about it years earlier. We launched with the Plan 9 grep code; a few years later I did replace it, with RE2.

Russ covers inverted indexes, tri-grams, regexes, pointers to working code and examples of how to use the code searcher locally on Linux source code for example.

Extremely useful article as an introduction to indexes and regexes.