Archive for the ‘Regex’ Category

Readable Regexes In Python?

Friday, August 19th, 2016

Doug Mahugh retweeted Raymond Hettinger tweeting:

#python tip: Complicated regexes can be organized into readable, commented chucks.
https://docs.python.org/3/library/re.html#re.X

Twitter hasn’t gotten around to censoring Python related tweets for accuracy so I did check the reference:

re.X
re.VERBOSE

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

This means that the two following regular expression objects that match a decimal number are functionally equal:

Which is the better question?

Why would anyone want to produce a readable regex in Python?

or,

Why would anyone NOT produce a readable regex given the opportunity?

Enjoy!

PS: It occurs to me that with a search expression you could address such strings as subjects in a topic map. A more robust form of documentation than # syntax.

Regular Expression Crossword Puzzle

Friday, December 25th, 2015

Regular Expression Crossword Puzzle by Greg Grothaus.

From the post:

If you know regular expressions, you might find this to be geek fun. A friend of mine posted this, without a solution, but once I started working it, it seemed put together well enough it was likely solvable. Eventually I did solve it, but not before coding up a web interface for verifying my solution and rotating the puzzle in the browser, which I recommend using if you are going to try this out. Or just print it out.

It’s actually quite impressive of a puzzle in it’s own right. It must have taken a lot of work to create.

regexpuzzle

The image is a link to the interactive version with the rules.

Other regex crossword puzzle resources:

RegHex – An alternative web interface to help solve the MIT hexagonal regular expression puzzle.

Regex Cross­word – Starting with a tutorial, the site offers 9 levels/types of games, concluding with five (5) hexagonal ones (only a few blocks on the first one and increasingly complex).

Regex Crosswords by Nikola Terziev – Generates regex crosswords, only squares at the moment.

In case you need help with some of the regex puzzles, you can try: Awesome Regex – A collection of regex resources.

If you are really adventuresome, try Constraint Reasoning Over Strings (2003) by Keith Golden and Wanlin Pang.

Abstract:

This paper discusses an approach to representing and reasoning about constraints over strings. We discuss how many string domains can often be concisely represented using regular languages, and how constraints over strings, and domain operations on sets of strings, can be carried out using this representation.

Each regex clue you add is a constraint on all the intersecting cells. Your first regex clue is unbounded, but every clue after that has a constraint. Wait, that’s not right! Constraints arise only when cells governed by different regexes intersect.

Anyone interested in going beyond hexagons and/or 2 dimensions?

I first saw this in a tweet by Alexis Lloyd.

Desperately Seeking a Regex

Sunday, August 9th, 2015

A Javascript regex to match a regex was posted to Stackoverflow (formerly computer programming) by Mike Samuel:

/\/((?![*+?])(?:[^\r\n\[/\\]|\\.|\[(?:[^\r\n\]\\]|\\.)*\])+)\/((?:g(?:im?|m)?|i(?:gm?|m)?|m(?:gi?|i)?)?)/

Along with a nifty explanation and caveats about its use.

Other candidates?

Thinking this could be very useful for mining regexes out of discussion groups, etc.

Look-behind regex

Wednesday, December 10th, 2014

Look-behind regex by John D. Cook.

From the post:

Look-behind is one of those advanced/obscure regular expression features that I don’t use frequently enough to remember the syntax, but just frequently enough that I wish I could remember it.

Look-behind can be positive or negative. Look-behind says “match this position only if the preceding text matches (does not match) the following pattern.”

I wish I had read this post before writing regular expressions to clean up over 4K of scanning results recently. I can think of several cases where this could have been helpful.

If you want to practice your regex writing skills, visit Stack Overflow and try your hand at recent regex questions. Or stroll through some of the older questions for tips/techniques.

RegexTip

Monday, June 9th, 2014

RegexTip

RegexTip is a Twitter account maintained by John D. Cook and it sends out one (1) regex tip per week.

Regexes or regular expressions are everywhere in computer science but especially in search.

I just saw a tweet by Scientific Python that the cycle of regex tips has restarted with the basics.

Good time to follow RegexTip.

RegExr

Saturday, May 3rd, 2014

RegExr

From the webpage:

RegExr is an online tool to learn, build, & test Regular Expressions (RegEx / RegExp).

  • Results update in real-time as you type.
  • Roll over a match or expression for details.
  • Save & share expressions with others.
  • Explore the Library for help & examples.
  • Undo & Redo with Ctrl-Z / Y.
  • Search for & rate Community patterns.

For fast text processing, very little can touch regexes and Unix command line utilities.

I first saw this at Nathan Yau’s Learn regular expressions with RegExr.

Learn regex the hard way

Wednesday, April 16th, 2014

Learn regex the hard way by Zed A. Shaw.

From the preface:

This is a rough in-progress dump of the book. The grammar will probably be bad, there will be sections missing, but you get to watch me write the book and see how I do things.

Finally, don’t forget that I have href{http://learnpythonthehardway.org}{Learn Python The Hard Way, 2nd Edition} which you should read if you can’t code yet.

Exercises 1 – 16 have some content (out of 27) so it is incomplete but still a goodly amount of material.

Zed has other “hard way” titles on:

Regexes are useful all contexts so you won’t regret learning or brushing up on them.

xkcd 1313: Something is Wrong on the Internet!

Friday, January 10th, 2014

xkcd 1313: Something is Wrong on the Internet!

Serious geekdom here!

An xkcd comic inspires an algorithm that generates a regex to extract winners from U.S. presidential elections. (Applicable to other lists as well.)

Remembering that some U.S. presidents both won and lost races for the presidency.

A very clever piece of work. At the same time, I must have the winner/loser lists in order to generate the regex.

So good exercise but I can’t apply it beyond the lists I used to generate the regex.

Yes?

BTW, do make a trip by Regex Golf to try your hand at writing regexes against different lists.

Automata [Starts 4 Nov. 2013]

Tuesday, October 8th, 2013

Automata by Jeff Ullman.

From the course description:

Why Study Automata Theory?

This subject is not just for those planning to enter the field of complexity theory, although it is a good place to start if that is your goal. Rather, the course will emphasize those aspects of the theory that people really use in practice. Finite automata, regular expressions, and context-free grammars are ideas that have stood the test of time. They are essential tools for compilers. But more importantly, they are used in many systems that require input that is less general than a full programming language yet more complex than “push this button.”

The concepts of undecidable problems and intractable problems serve a different purpose. Undecidable problems are those for which no computer solution can ever exist, while intractable problems are those for which there is strong evidence that, although they can be solved by a computer, they cannot be solved sufficiently fast that the solution is truly useful in practice. Understanding this theory, and in particular being able to prove that a problem you are facing belongs to one of these classes, allows you to justify taking another approach — simplifying the problem or writing code to approximate the solution, for example.

During the course, I’m going to prove a number of things. The purpose of these proofs is not to torture you or confuse you. Neither are the proofs there because I doubt you would believe me were I merely to state some well-known fact. Rather, understanding how these proofs, especially inductive proofs, work, lets you think more clearly about your own work. I do not advocate proofs that programs are correct, but whenever you attempt something a bit complex, it is good to have in mind the inductive proofs that would be needed to guarantee that what you are doing really works in all cases.

Recommended Background

You should have had a second course in Computer Science — one that covers basic data structures (e.g., lists, trees, hashing), and basic algorithms (e.g., tree traversals, recursive programming, big-oh running time). In addition, a course in discrete mathematics covering propositional logic, graphs, and inductive proofs is valuable background.

If you need to review or learn some of these topics, there is a free on-line textbook Foundations of Computer Science, written by Al Aho and me, available at http://i.stanford.edu/~ullman/focs.html. Recommended chapters include 2 (Recursion and Induction), 3 (Running Time of Programs), 5 (Trees), 6 (Lists), 7 (Sets), 9 (Graphs), and 12 (Propositional Logic). You will also find introductions to finite automata, regular expressions, and context-free grammars in Chapters 10 and 11. Reading Chapter 10 would be good preparation for the first week of the course.

The course includes two programming exercises for which a knowledge of Java is required. However, these exercises are optional. You will receive automated feedback, but the results will not be recorded or used to grade the course. So if you are not familiar with Java, you can still take the course without concern for prerequisites.

All of “Foundations of Computer Science” is worth reading but for this course:

Chapter 2 Iteration, Induction, and Recursion
Chapter 3 The Running Time of Programs
Chapter 5 The Tree Data Model
Chapter 6 The List Data Model
Chapter 7 The Set Data Model
Chapter 9 The Graph Data Model
Chapter 10 Patterns, Automata, and Regular Expressions
Chapter 11 Recursive Description of Patterns
Chapter 12 Propositional Logic

Six very intensive weeks but on the bright side, you will be done before the holiday season. 😉

frak

Saturday, August 17th, 2013

frak

From the webpage:

frak transforms collections of strings into regular expressions for matching those strings. The primary goal of this library is to generate regular expressions from a known set of inputs which avoid backtracking as much as possible.

This looks quite useful for text mining.

A large amount of which is on the near horizon.

I first saw this in Nat Torkington’s Four short links: 16 August 2013.

libsregex

Tuesday, June 4th, 2013

libsregex by Yichun Zhang.

From the homepage:

libsregex – A non-backtracking regex engine library for large data streams

And see:

Streaming regex matching and substitution by the sregex library by Yichun Zhang.

This looks quite good!

I first saw this at Nat Torkinton’s Four short links: 4 June 2013.

Debuggex [Emacs Alternative, Others?]

Saturday, April 13th, 2013

Debuggex: A visual regex helper

Regexes (regular expressions) are a mainstay of data mining/extraction.

Debuggex is a regex debugger with visual cues to help you with writing/debugging regular expressions.

The webpage reports full JS regexes are not yet supported.

If you need a fuller alternative, consider debugging regex expressions in Emacs.

M - x regexp-builder

which shows matches as you type.

Be aware that regex languages vary (no real surprise).

One helpful resource: Regular Expression Flavor Comparison

Working with Pig

Saturday, February 16th, 2013

Working with Pig by Dan Morrill. (video)

From the description:

Pig is a SQL like command language for use with Hadoop, we review a simple PIG script line by line to help you understand how pig works, and regular expressions to help parse data. If you want a copy of the slide presentation – they are over on slide share http://www.slideshare.net/rmorrill.

Very good intro to PIG!

Mentions a couple of resources you need to bookmark:

Input Validation Cheat Sheet (The Open Web Security Application Project – OWASP) – regexes to re-use in Pig scripts. Lots of other regex cheat sheet pointers. (Being mindful that “\” must be escaped in PIG.)

Regular-Expressions.info A more general resource on regexes.

I first saw this at: This Quick Pig Overview Brings You Up to Speed Line by Line.

C++11 regex cheatsheet

Sunday, July 22nd, 2012

C++11 regex cheatsheet

A one page C++11 regex cheatsheet that you may find useful.

Curious though, how useful do you find colors on cheatsheets?

Or are there cheatsheets where you find colors useful and others not?

If so, what seems to be the difference?

Not an entirely idle query. I want to author a cheatsheet or two, but want them to be useful to others.

At one level, I see cheatsheets as being extremely minimalistic, no commentary, just short reminders of the correct syntax.

A step up from that level, perhaps for rarely used commands, a bit more than bare syntax.

Suggestions? Pointers to cheatsheets you have found useful?

ack

Tuesday, October 18th, 2011

ack

From the webpage:

ack is a tool like grep, designed for programmers with large trees of heterogeneous source code.

ack is written purely in Perl, and takes advantage of the power of Perl’s regular expressions.

It is said to be “pure Perl” so Robert shouldn’t have any problems running it on Windows. 😉

Seriously, the more I think about something Lars Marius said to me years ago, about it all being about string matching, the more that rings true.

Granting that we attach semantics to the results of that string matching but insofar as our machines are concerned, it’s just strings. We may have defined complex processing for strings, but they remain, so long as they are not viewed by us, simply strings.

(What I remember of conversations, remarks is always subject to correction by others who were present. I am sure their memories are better than mine.)