Archive for the ‘Parsing’ Category

Parsing with Pictures

Wednesday, October 17th, 2012

Parsing with Pictures by Keshav Pingali and Gianfranco Bilardi. (PDF file)

From an email that Keshav sent to the compilers@iecc.com email list:

Gianfranco Bilardi and I have developed a new approach to parsing context-free languages that we call “Parsing with pictures”. It provides an alternative (and, we believe, easier to understand) approach to context-free language parsing than the standard presentations using derivations or pushdown automata. It also unifies Earley, SLL, LL, SLR, and LR parsers among others.

Parsing problems are formulated as path problems in a graph called the grammar flow graph (GFG) that is easily constructed from a given grammar. Intuitively, the GFG is to context-free grammars what NFAs are to regular languages. Among other things, the paper has :

(i) an elementary derivation of Earley’s algorithm for parsing general context-free grammars, showing that it is an easy generalization of the well-known reachability-based NFA simulation algorithm,

(ii) a presentation of look-ahead that is independent of particular parsing strategies, and is based on a simple inter-procedural dataflow analysis,

(iii) GFG structural characterizations of LL and LR grammars that are simpler to understand than the standard definitions, and bring out a symmetry between these grammar classes,

(iv) derivations of recursive-descent and shift-reduce parsers for LL and LR grammars by optimizing the Earley parser to exploit this structure, and

(v) a connection between GFGs and NFAs for regular grammars based on the continuation-passing style (CPS) optimization.

Or if you prefer the more formal abstract:

The development of elegant and practical algorithms for parsing context-free languages is one of the major accomplishments of 20th century Computer Science. These algorithms are presented in the literature using string rewriting systems or abstract machines like pushdown automata, but the resulting descriptions are unsatisfactory for several reasons. First, even a basic understanding of parsing algorithms for some grammar classes such as LR(k) grammars requires mastering a formidable number of difficult concepts and terminology. Second, parsing algorithms for different grammar classes are often presented using entirely different formalisms, so the relationships between these grammar classes are obscured. Finally, these algorithms seem unrelated to algorithms for regular language recognition even though regular languages are a subset of context-free languages.

In this paper, we show that these problems are avoided if parsing is reformulated as the problem of finding certain kinds of paths in a graph called the Grammar Flow Graph (GFG) that is easily constructed from a context-free grammar. Intuitively, GFG’s permit parsing problems for context-free grammars to be formulated as path problems in graphs in the same way that non-deterministic finite-state automata do for regular grammars. We show that the GFG enables a unified treatment of Earley’s parser for general context-free grammars, recursive-descent parsers for LL(k) and SLL(k) grammars, and shift-reduce parsers for LR(k) and SLR(k) grammars. Computation of look-ahead sets becomes a simple interprocedural dataflow analysis. These results suggest that the GFG can be a new foundation for the study of context-free languages.

Odd as it may sound, some people want to be understood.

If you think being understood isn’t all that weird, do a slow read on this paper and provide feedback to the authors.

Importing public data with SAS instructions into R

Wednesday, July 11th, 2012

Importing public data with SAS instructions into R by David Smith.

From the post:

Many public agencies release data in a fixed-format ASCII (FWF) format. But with the data all packed together without separators, you need a “data dictionary” defining the column widths (and metadata about the variables) to make sense of them. Unfortunately, many agencies make such information available only as a SAS script, with the column information embedded in a PROC IMPORT statement.

David reports on the SAScii package from Anthony Damico.

You still have to parse the files but it gets you one step closer to having useful information.

SP-Sem-MRL 2012

Wednesday, December 7th, 2011

ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages (SP-Sem-MRL 2012)

Important dates:

Submission deadline: March 31, 2012 (PDT, GMT-8)
Notification to authors: April 21, 2012
Camera ready copy: May 5, 2012
Workshop: TBD, during the ACL 2012 workshop period (July 12-14, 2012)

From the website:

Morphologically Rich Languages (MRLs) are languages in which grammatical relations such as Subject, Predicate, Object, etc., are indicated morphologically (e.g. through inflection) instead of positionally (as in, e.g. English), and the position of words and phrases in the sentence may vary substantially. The tight connection between the morphology of words and the grammatical relations between them, and the looser connection between the position and grouping of words to their syntactic roles, pose serious challenges for syntactic and semantic processing. Furthermore, since grammatical relations provide the interface to compositional semantics, morpho-syntactic phenomena may significantly complicate processing the syntax–semantics interface. In statistical parsing, which has been a cornerstone of research in NLP and had seen great advances due to the widespread availability of syntactically annotated corpora, English parsing performance has reached a high plateau in certain genres, which is however not always indicative of parsing performance in MRLs, dependency-based and constituency-based alike . Semantic processing of natural language has similarly seen much progress in recent years. However, as in parsing, the bulk of the work has concentrated on English, and MRLs may present processing challenges that the community is as of yet unaware of, and which current semantic processing technologies may have difficulty coping with. These challenges may lurk in areas where parses may be used as input, such as semantic role labeling, distributional semantics, paraphrasing and textual entailments, or where inadequate pre-processing of morphological variation hurts parsing and semantic tasks alike.

This joint workshop aims to build upon the first and second SPMRL workshops (at NAACL-HLT 2010 and IWPT 2011, respectively) while extending the overall scope to include semantic processing where MRLs pose challenges for algorithms or models initially designed to process English. In particular, we seek to explore the use of newly available syntactically and/or semantically annotated corpora, or data sets for semantic evaluation that can contribute to our understanding of the difficulty that such phenomena pose. One goal of this workshop is to encourage cross-fertilization among researchers working on different languages and among those working on different levels of processing. Of particular interest is work addressing the lexical sparseness and out-of-vocabulary (OOV) issues that occur in both syntactic and semantic processing.

The exploration of non-English languages will replicate many of the outstanding entity recognition/data integration problems experienced in English. Considering that there are massive economic markets that speak non-English languages, the first to make progress on such issues will have a commercial advantage. How much of one I suspect depends on how well your software works in a non-English language.