Archive for the ‘Parsers’ Category

ANTLR Parser Generator (4.7)

Wednesday, June 28th, 2017

ANTLR Parser Generator

From the about page:

ANTLR is a powerful parser generator that you can use to read, process, execute, or translate structured text or binary files. It’s widely used in academia and industry to build all sorts of languages, tools, and frameworks. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day. The languages for Hive and Pig, the data warehouse and analysis systems for Hadoop, both use ANTLR. Lex Machina uses ANTLR for information extraction from legal texts. Oracle uses ANTLR within SQL Developer IDE and their migration tools. NetBeans IDE parses C++ with ANTLR. The HQL language in the Hibernate object-relational mapping framework is built with ANTLR.

Aside from these big-name, high-profile projects, you can build all sorts of useful tools like configuration file readers, legacy code converters, wiki markup renderers, and JSON parsers. I’ve built little tools for object-relational database mappings, describing 3D visualizations, injecting profiling code into Java source code, and have even done a simple DNA pattern matching example for a lecture.

From a formal language description called a grammar, ANTLR generates a parser for that language that can automatically build parse trees, which are data structures representing how a grammar matches the input. ANTLR also automatically generates tree walkers that you can use to visit the nodes of those trees to execute application-specific code.

There are thousands of ANTLR downloads a month and it is included on all Linux and OS X distributions. ANTLR is widely used because it’s easy to understand, powerful, flexible, generates human-readable output, comes with complete source under the BSD license, and is actively supported.
… (emphasis in original)

A friend wants to explore the OpenOffice schema by visualizing a parse from the Multi-Schema Validator.

ANTLR is probably more firepower than needed but the extra power may encourage creative thinking. Maybe.


Parsing JSON is a Minefield

Wednesday, October 26th, 2016

Parsing JSON is a Minefield by Nicolas Seriot.


JSON is the de facto standard when it comes to (un)serialising and exchanging data in web and mobile programming. But how well do you really know JSON? We’ll read the specifications and write test cases together. We’ll test common JSON libraries against our test cases. I’ll show that JSON is not the easy, idealised format as many do believe. Indeed, I did not find two libraries that exhibit the very same behaviour. Moreover, I found that edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because JSON libraries rely on specifications that have evolved over time and that let many details loosely specified or not specified at all.
(emphasis in original)

Or the summary (tweet) that caught my attention:

I published: Parsing JSON is a Minefield … in which I could not find two parsers that exhibited the same behaviour

Or consider this graphic, which in truth needs a larger format than even the original:


Don’t worry, you can’t read the original at its default resolution. I had to enlarge the view several times to get a legible display.

More suitable for a poster sized print.

Perhaps something to consider for Balisage 2017 as swag?

Excellent work and a warning against the current vogue of half-ass standardization in some circles.

“We know what we meant” is a sure sign of poor standards work.

Open Sourcing Duckling, our probabilistic (date) parser [Clojure]

Friday, October 3rd, 2014

Open Sourcing Duckling, our probabilistic (date) parser

From the post:

We’ve previously discussed ambiguity in natural language. What’s really fascinating is that even the simplest, seemingly most structured parts of natural language, like the way we humans describe dates and times, are actually so difficult to turn into structured data.

The wild world of temporal expressions in human language

All the following expressions describe the same point in time (at least in some contexts):

  • “December 30th, at 3 in the afternoon”
  • “The day before New Year’s Eve at 3pm”
  • “At 1500 three weeks from now”
  • “The last Tuesday of December at 3pm”

But wait… is it really equivalent to say 3pm and 1500? In the latter case, it seems that speaker meant to be more precise. Is it OK to drop this information?

And what about “next Tuesday”? If today is Monday, is that tomorrow or in 8 days? When I say “last month”, is it the last full month or the last 30 days?

A last example: “one month” looks like a well defined duration. That is, until you try to normalize durations in seconds, and you realize different months have anywhere between 28 and 31 days! Even “one day” is difficult. Yes, a day can last between 23 and 25 hours, because of daylight savings. Oh, and did I mention that at midnight at the end of 1927 in Shanghai, the clocks went back 5 minutes and 52 seconds? So “1927-12-31 23:54:08” actually happened twice there.

There are hundreds of hard things like these, and the more you dig into this, believe me, the more you’ll encounter. But that’s out of the scope of this post.

An introduction to the vagaries of date statements in natural language, a probabilistic (date) parser in Clojure, and an opportunity to extend said parser to other data types.

Nice way to end the week!

Python-ZPar – Python Wrapper for ZPAR

Monday, September 8th, 2014

Python-ZPar – Python Wrapper for ZPAR by Nitin Madnani.

From the webpage:

python-zpar is a python wrapper around the ZPar parser. ZPar was written by Yue Zhang while he was at Oxford University. According to its home page: ZPar is a statistical natural language parser, which performs syntactic analysis tasks including word segmentation, part-of-speech tagging and parsing. ZPar supports multiple languages and multiple grammar formalisms. ZPar has been most heavily developed for Chinese and English, while it provides generic support for other languages. ZPar is fast, processing above 50 sentences per second using the standard Penn Teebank (Wall Street Journal) data.

I wrote python-zpar since I needed a fast and efficient parser for my NLP work which is primarily done in Python and not C++. I wanted to be able to use this parser directly from Python without having to create a bunch of files and running them through subprocesses. python-zpar not only provides a simply python wrapper but also provides an XML-RPC ZPar server to make batch-processing of large files easier.

python-zpar uses ctypes, a very cool foreign function library bundled with Python that allows calling functions in C DLLs or shared libraries directly.

Just in case you are looking for a language parser for Chinese or English.

It is only a matter of time before commercial opportunities are going to force greater attention on non-English languages. Forewarned is forearmed.

Applicative Parser [Clojure]

Sunday, June 15th, 2014

Applicative Parser by Jim Duey.

In the next couple of posts, I’m going to show how to build a parser library based on the Free Applicative Functor and what you can do with it. To follow along, clone (or update) this repo. Then ‘lein repl’ and you should be able to copy the code to see what it does.

This post runs long (again). The code is really not that long and it’s mostly very small functions. But the amount of details hidden by the abstractions takes a lot of prose to explain. Which is actually one of the very real benefits of using these abstractions. They let you implement an enormous amount of functionality in very few lines of code with fewer bugs.

If you want to see what the point of all this prose is, skip to the “Other Interpretations” section at the bottom and then come back to see how it was done.

Warning: Heavy sledding!

You may want to start with: The basics of applicative functors, put to practical work [Haskell] by Bryan O’Sullivan, which parses “[an] application/x-www-form-urlencoded string.”

On the other hand, if you want the full explanation, consider Applicative Programming with Effects by Conor McBride and Ross Paterson. in Journal of Functional Programming 18:1 (2008), pages 1-13.


In this paper, we introduce Applicative functors–an abstract characterisation of an applicative style of effectful programming, weaker than Monads and hence more widespread. Indeed, it is the ubiquity of this programming pattern that drew us to the abstraction. We retrace our steps in this paper, introducing the applicative pattern by diverse examples, then abstracting it to define the Applicative type class and introducing a bracket notation which interprets the normal application syntax in the idiom of an Applicative functor. Further, we develop the properties of applicative functors and the generic operations they support. We close by identifying the categorical structure of applicative functors and examining their relationship both with Monads and with Arrows.

The page for McBride and Patterson’s paper points to later resources as well.


Puck, a high-speed GPU-powered parser

Monday, June 9th, 2014

Puck, a high-speed GPU-powered parser by David Hall.

From the post:

I’m pleased to announce yet another new parser called Puck. Puck is a lightning-fast parser that uses Nvidia GPUs to do most of its computation. It is capable of parsing over 400 sentences per second, or about half a million words per minute. (Most CPU constituency parsers of its quality are on the order of 10 sentences per second.)

Puck is based on the same grammars used in the Berkeley Parser, and produces nearly identical trees. Puck is only available for English right now.

For more information about Puck, please see the project github page ( , or the accompanying paper (

Because of some its dependencies are not formally released yet (namely the wonderful JavaCL library), I can’t push artifacts to Maven Central. Instead I’ve uploaded a fat assembly jar here: (See the readme on github for how to use it.) It’s better used as a command line tool, anyway.

Even more motivation for learning to use GPUs!

I first saw this in a tweet by Jason Baldridge.

A multi-Teraflop Constituency Parser using GPUs

Sunday, November 3rd, 2013

A multi-Teraflop Constituency Parser using GPUs by John Canny, David Hall and Dan Klein.


Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length- 30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.

Just in case you are interested in parsing “unstructured” data, mostly what they also call “texts.”

I first saw the link: BIDParse: GPU-accelerated natural language parser at Then I started looking for the paper. 😉

Solr and available query parsers

Tuesday, August 20th, 2013

Solr and available query parsers

From the post:

Every now and than there is a question appearing on the mailing list – what type of query parsers are available in Solr. So we decided to make such a list with a short description about each of the query parsers available. If you are interested to see what Solr has to offer, please read the rest of this post.

I count eighteen (18) query parsers available for Solr.

If you can’t name each one and give a brief description of its use, you need to read this post.

Parsing with Pictures

Wednesday, October 17th, 2012

Parsing with Pictures by Keshav Pingali and Gianfranco Bilardi. (PDF file)

From an email that Keshav sent to the email list:

Gianfranco Bilardi and I have developed a new approach to parsing context-free languages that we call “Parsing with pictures”. It provides an alternative (and, we believe, easier to understand) approach to context-free language parsing than the standard presentations using derivations or pushdown automata. It also unifies Earley, SLL, LL, SLR, and LR parsers among others.

Parsing problems are formulated as path problems in a graph called the grammar flow graph (GFG) that is easily constructed from a given grammar. Intuitively, the GFG is to context-free grammars what NFAs are to regular languages. Among other things, the paper has :

(i) an elementary derivation of Earley’s algorithm for parsing general context-free grammars, showing that it is an easy generalization of the well-known reachability-based NFA simulation algorithm,

(ii) a presentation of look-ahead that is independent of particular parsing strategies, and is based on a simple inter-procedural dataflow analysis,

(iii) GFG structural characterizations of LL and LR grammars that are simpler to understand than the standard definitions, and bring out a symmetry between these grammar classes,

(iv) derivations of recursive-descent and shift-reduce parsers for LL and LR grammars by optimizing the Earley parser to exploit this structure, and

(v) a connection between GFGs and NFAs for regular grammars based on the continuation-passing style (CPS) optimization.

Or if you prefer the more formal abstract:

The development of elegant and practical algorithms for parsing context-free languages is one of the major accomplishments of 20th century Computer Science. These algorithms are presented in the literature using string rewriting systems or abstract machines like pushdown automata, but the resulting descriptions are unsatisfactory for several reasons. First, even a basic understanding of parsing algorithms for some grammar classes such as LR(k) grammars requires mastering a formidable number of difficult concepts and terminology. Second, parsing algorithms for different grammar classes are often presented using entirely different formalisms, so the relationships between these grammar classes are obscured. Finally, these algorithms seem unrelated to algorithms for regular language recognition even though regular languages are a subset of context-free languages.

In this paper, we show that these problems are avoided if parsing is reformulated as the problem of finding certain kinds of paths in a graph called the Grammar Flow Graph (GFG) that is easily constructed from a context-free grammar. Intuitively, GFG’s permit parsing problems for context-free grammars to be formulated as path problems in graphs in the same way that non-deterministic finite-state automata do for regular grammars. We show that the GFG enables a unified treatment of Earley’s parser for general context-free grammars, recursive-descent parsers for LL(k) and SLL(k) grammars, and shift-reduce parsers for LR(k) and SLR(k) grammars. Computation of look-ahead sets becomes a simple interprocedural dataflow analysis. These results suggest that the GFG can be a new foundation for the study of context-free languages.

Odd as it may sound, some people want to be understood.

If you think being understood isn’t all that weird, do a slow read on this paper and provide feedback to the authors.

GraphChi parsers toolkit

Tuesday, September 11th, 2012

GraphChi parsers toolkit by Danny Bickson.

From the post:

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.

Danny starts us off with an LDA parser (with worked example of its use) and then adds a Twitter parser that creates a graph of retweets.