Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 28, 2017

ANTLR Parser Generator (4.7)

Filed under: ANTLR,Parsers,Parsing — Patrick Durusau @ 8:38 pm

ANTLR Parser Generator

From the about page:

ANTLR is a powerful parser generator that you can use to read, process, execute, or translate structured text or binary files. It’s widely used in academia and industry to build all sorts of languages, tools, and frameworks. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day. The languages for Hive and Pig, the data warehouse and analysis systems for Hadoop, both use ANTLR. Lex Machina uses ANTLR for information extraction from legal texts. Oracle uses ANTLR within SQL Developer IDE and their migration tools. NetBeans IDE parses C++ with ANTLR. The HQL language in the Hibernate object-relational mapping framework is built with ANTLR.

Aside from these big-name, high-profile projects, you can build all sorts of useful tools like configuration file readers, legacy code converters, wiki markup renderers, and JSON parsers. I’ve built little tools for object-relational database mappings, describing 3D visualizations, injecting profiling code into Java source code, and have even done a simple DNA pattern matching example for a lecture.

From a formal language description called a grammar, ANTLR generates a parser for that language that can automatically build parse trees, which are data structures representing how a grammar matches the input. ANTLR also automatically generates tree walkers that you can use to visit the nodes of those trees to execute application-specific code.

There are thousands of ANTLR downloads a month and it is included on all Linux and OS X distributions. ANTLR is widely used because it’s easy to understand, powerful, flexible, generates human-readable output, comes with complete source under the BSD license, and is actively supported.
… (emphasis in original)

A friend wants to explore the OpenOffice schema by visualizing a parse from the Multi-Schema Validator.

ANTLR is probably more firepower than needed but the extra power may encourage creative thinking. Maybe.

Enjoy!

October 26, 2016

Parsing JSON is a Minefield

Filed under: JSON,Parsers,Parsing — Patrick Durusau @ 8:45 pm

Parsing JSON is a Minefield by Nicolas Seriot.

Description:

JSON is the de facto standard when it comes to (un)serialising and exchanging data in web and mobile programming. But how well do you really know JSON? We’ll read the specifications and write test cases together. We’ll test common JSON libraries against our test cases. I’ll show that JSON is not the easy, idealised format as many do believe. Indeed, I did not find two libraries that exhibit the very same behaviour. Moreover, I found that edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because JSON libraries rely on specifications that have evolved over time and that let many details loosely specified or not specified at all.
(emphasis in original)

Or the summary (tweet) that caught my attention:

I published: Parsing JSON is a Minefield  http://seriot.ch/parsing_json.php … in which I could not find two parsers that exhibited the same behaviour

Or consider this graphic, which in truth needs a larger format than even the original:

json-parser-tests-460

Don’t worry, you can’t read the original at its default resolution. I had to enlarge the view several times to get a legible display.

More suitable for a poster sized print.

Perhaps something to consider for Balisage 2017 as swag?

Excellent work and a warning against the current vogue of half-ass standardization in some circles.

“We know what we meant” is a sure sign of poor standards work.

July 28, 2016

greek-accentuation 1.0.0 Released

Filed under: Greek,Language,Parsing,Python — Patrick Durusau @ 4:32 pm

greek-accentuation 1.0.0 Released by James Tauber.

From the post:

greek-accentuation has finally hit 1.0.0 with a couple more functions and a module layout change.

The library (which I’ve previously written about here) has been sitting on 0.9.9 for a while and I’ve been using it sucessfully in my inflectional morphology work for 18 months. There were, however, a couple of functions that lived in the inflectional morphology repos that really belonged in greek-accentuation. They have now been moved there.

If that sounds a tad obscure, some additional explanation from an earlier post by James:

It [greek-accentuation] consists of three modules:

  • characters
  • syllabify
  • accentuation

The characters module provides basic analysis and manipulation of Greek characters in terms of their Unicode diacritics as if decomposed. So you can use it to add, remove or test for breathing, accents, iota subscript or length diacritics.

The syllabify module provides basic analysis and manipulation of Greek syllables. It can syllabify words, give you the onset, nucleus, code, rime or body of a syllable, judge syllable length or give you the accentuation class of word.

The accentuation module uses the other two modules to accentuate Ancient Greek words. As well as listing possible_accentuations for a given unaccented word, it can produce recessive and (given another form with an accent) persistent accentuations.

Another name from my past and a welcome reminder that not all of computer science is focused on recommending ephemera for our consumption.

November 22, 2015

Deep Learning and Parsing

Filed under: Deep Learning,Linguistics,Parsing — Patrick Durusau @ 2:03 pm

Jason Baldridge tweets that the work of James Henderson (Google Scholar) should get more cites for deep learning and parsing.

Jason points to the following two works (early 1990’s) in particular:

Description Based Parsing in a Connectionist Network by James B. Henderson.

Abstract:

Recent developments in connectionist architectures for symbolic computation have made it possible to investigate parsing in a connectionist network while still taking advantage of the large body of work on parsing in symbolic frameworks. This dissertation investigates syntactic parsing in the temporal synchrony variable binding model of symbolic computation in a connectionist network. This computational architecture solves the basic problem with previous connectionist architectures,
while keeping their advantages. However, the architecture does have some limitations, which impose computational constraints on parsing in this architecture. This dissertation argues that, despite these constraints, the architecture is computationally adequate for syntactic parsing, and that these constraints make signi cant linguistic predictions. To make these arguments, the nature of the architecture’s limitations are fi rst characterized as a set of constraints on symbolic
computation. This allows the investigation of the feasibility and implications of parsing in the architecture to be investigated at the same level of abstraction as virtually all other investigations of syntactic parsing. Then a specifi c parsing model is developed and implemented in the architecture. The extensive use of partial descriptions of phrase structure trees is crucial to the ability of this model to recover the syntactic structure of sentences within the constraints. Finally, this parsing model is tested on those phenomena which are of particular concern given the constraints, and on an approximately unbiased sample of sentences to check for unforeseen difficulties. The results show that this connectionist architecture is powerful enough for syntactic parsing. They also show that some linguistic phenomena are predicted by the limitations of this architecture. In particular, explanations are given for many cases of unacceptable center embedding, and for several signifi cant constraints on long distance dependencies. These results give evidence for the cognitive signi ficance
of this computational architecture and parsing model. This work also shows how the advantages of both connectionist and symbolic techniques can be uni ed in natural language processing applications. By analyzing how low level biological and computational considerations influence higher level processing, this work has furthered our understanding of the nature of language and how it can be efficiently and e ffectively processed.

Connectionist Syntactic Parsing Using Temporal Variable Binding by James Henderson.

Abstract:

Recent developments in connectionist architectures for symbolic computation have made it possible to investigate parsing in a connectionist network while still taking advantage of the large body of work on parsing in symbolic frameworks. The work discussed here investigates syntactic parsing in the temporal synchrony variable binding model of symbolic computation in a connectionist network. This computational architecture solves the basic problem with previous connectionist architectures, while keeping their advantages. However, the architecture does have some limitations, which impose constraints on parsing in this architecture. Despite these constraints, the architecture is computationally adequate for syntactic parsing. In addition, the constraints make some signifi cant linguistic predictions. These arguments are made using a specifi c parsing model. The extensive use of partial descriptions of phrase structure trees is crucial to the ability of this model to recover the syntactic structure of sentences within the constraints imposed by the architecture.

Enjoy!

October 29, 2015

Parsing Academic Articles on Deadline

Filed under: Journalism,Natural Language Processing,News,Parsing — Patrick Durusau @ 8:10 pm

A group of researchers is trying to help science journalists parse academic articles on deadline by Joseph Lichterman.

From the post:

About 1.8 million new scientific papers are published each year, and most are of little consequence to the general public — or even read, really; one study estimates that up to half of all academic studies are only read by their authors, editors, and peer reviewers.

But the papers that are read can change our understanding of the universe — traces of water on Mars! — or impact our lives here on earth — sea levels rising! — and when journalists get called upon to cover these stories, they’re often thrown into complex topics without much background or understanding of the research that led to the breakthrough.

As a result, a group of researchers at Columbia and Stanford are in the process of developing Science Surveyor, a tool that algorithmically helps journalists get important context when reporting on scientific papers.

“The idea occurred to me that you could characterize the wealth of scientific literature around the topic of a new paper, and if you could do that in a way that showed the patterns in funding, or the temporal patterns of publishing in that field, or whether this new finding fit with the overall consensus with the field — or even if you could just generate images that show images very rapidly what the huge wealth, in millions of articles, in that field have shown — [journalists] could very quickly ask much better questions on deadline, and would be freed to see things in a larger context,” Columbia journalism professor Marguerite Holloway, who is leading the Science Surveyor effort, told me.

Science Surveyor is still being developed, but broadly the idea is that the tool takes the text of an academic paper and searches academic databases for other studies using similar terms. The algorithm will surface relevant articles and show how scientific thinking has changed through its use of language.

For example, look at the evolving research around neurogenesis, or the growth of new brain cells. Neurogenesis occurs primarily while babies are still in the womb, but it continues through adulthood in certain sections of the brain.

Up until a few decades ago, researchers generally thought that neurogenesis didn’t occur in humans — you had a set number of brain cells, and that’s it. But since then, research has shown that neurogenesis does in fact occur in humans.

“This tells you — aha! — this discovery is not an entirely new discovery,” Columbia professor Dennis Tenen, one of the researchers behind Science Surveyor, told me. “There was a period of activity in the ’70s, and now there is a second period of activity today. We hope to produce this interactive visualization, where given a paper on neurogenesis, you can kind of see other related papers on neurogenesis to give you the context for the story you’re telling.”

Given the number of papers published every year, an algorithmic approach like Science Surveyor is an absolute necessity.

But imagine how much richer the results would be if one of the three or four people who actually read the paper could easily link it to other research and context?

Or perhaps being a researcher who discovers the article and then blazes a trail to non-obvious literature that is also relevant?

Search engines now capture what choices users make in the links they follow but that’s a fairly crude approximation of relevancy of a particular resource. Such as not specifying why a particular resource is relevant.

Usage of literature should decide which articles merit greater attention from machine or human annotators. The last amount of humanities literature is never cited by anyone. Why spend resources annotating content that no one is likely to read?

January 5, 2015

Shallow Discourse Parsing

Filed under: Discourse,Natural Language Processing,Parsing — Patrick Durusau @ 5:32 pm

Shallow Discourse Parsing

From the webpage:

A participant system is given a piece of newswire text as input and returns discourse relations in the form of a discourse connective (explicit or implicit) taking two arguments (which can be clauses, sentences, or multi-sentence segments). Specifically, the participant system needs to i) locate both explicit (e.g., “because”, “however”, “and”) and implicit discourse connectives (often signaled by periods) in the text, ii) identify the spans of text that serve as the two arguments for each discourse connective, and iii) predict the sense of the discourse connectives (e.g., “Cause”, “Condition”, “Contrast”). Understanding such discourse relations is clearly an important part of natural language understanding that benefits a wide range of natural language applications.

Important Dates

  • January 26, 2015: registration begins, and release of training set and scorer
  • March 1, 2015: Registration deadline.
  • April 20, 2015: Test set available.
  • April 24, 2015: Systems collected.
  • May 1, 2015: System results due to participants
  • May 8, 2015: System papers due.
  • May 18, 2015: Reviews due.
  • May 21, 2015: notification of acceptance.
  • May 28, 2015: camera-ready version of system papers due.
  • July 30-31, 2015. CoNLL conference (Beijing China).

You have to admire the ambiguity of the title.

Does it mean the parsing of shallow discourse (my first bet) or does it mean shallow parsing of discourse (my unlikely)?

What do you think?

With the recent advances in deep learning, I am curious if the Turing test could be passed by training an algorithm on sitcom dialogue over the last two or three years?

Would you use regular TV viewers as part of the test or use people who rarely watch TV? Could make a difference in the outcome of the test.

I first saw this in a tweet by Jason Baldridge.

December 11, 2014

Semantic Parsing with Combinatory Categorial Grammars (Videos)

Filed under: Grammar,Linguistics,Parsing,Semantics — Patrick Durusau @ 10:45 am

Semantic Parsing with Combinatory Categorial Grammars by Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer. (Tutorial)

Abstract:

Semantic parsers map natural language sentences to formal representations of their underlying meaning. Building accurate semantic parsers without prohibitive engineering costs is a long-standing, open research problem.

The tutorial will describe general principles for building semantic parsers. The presentation will be divided into two main parts: modeling and learning. The modeling section will include best practices for grammar design and choice of semantic representation. The discussion will be guided by examples from several domains. To illustrate the choices to be made and show how they can be approached within a real-life representation language, we will use λ-calculus meaning representations. In the learning part, we will describe a unified approach for learning Combinatory Categorial Grammar (CCG) semantic parsers, that induces both a CCG lexicon and the parameters of a parsing model. The approach learns from data with labeled meaning representations, as well as from more easily gathered weak supervision. It also enables grounded learning where the semantic parser is used in an interactive environment, for example to read and execute instructions.

The ideas we will discuss are widely applicable. The semantic modeling approach, while implemented in λ-calculus, could be applied to many other formal languages. Similarly, the algorithms for inducing CCGs focus on tasks that are formalism independent, learning the meaning of words and estimating parsing parameters. No prior knowledge of CCGs is required. The tutorial will be backed by implementation and experiments in the University of Washington Semantic Parsing Framework (UW SPF).

I previously linked to the complete slide set for this tutorial.

This page offers short videos (twelve (12) currently) and links into the slide set. More videos are forthcoming.

The goal of the project is “recover complete meaning representation” where complete meaning = “Complete meaning is sufficient to complete the task.” (from video 1).

That definition of “complete meaning” dodges a lot of philosophical as well as practical issues with semantic parsing.

Take the time to watch the videos, Yoav is a good presenter.

Enjoy!

March 30, 2014

Parsing Drug Dosages in text…

Filed under: Finite State Automata,Parsing,Pattern Recognition — Patrick Durusau @ 7:37 pm

Parsing Drug Dosages in text using Finite State Machines by Sujit Pal.

From the post:

Someone recently pointed out an issue with the Drug Dosage FSM in Apache cTakes on the cTakes mailing list. Looking at the code for it revealed a fairly complex implementation based on a hierarchy of Finite State Machines (FSM). The intuition behind the implementation is that Drug Dosage text in doctor’s notes tend to follow a standard-ish format, and FSMs can be used to exploit this structure and pull out relevant entities out of this text. The paper Extracting Structured Medication Event Information from Discharge Summaries has more information about this problem. The authors provide their own solution, called the Merki Medication Parser. Here is a link to their Online Demo and source code (Perl).

I’ve never used FSMs myself, although I have seen it used to model (more structured) systems. So the idea of using FSMs for parsing semi-structured text such as this seemed interesting and I decided to try it out myself. The implementation I describe here is nowhere nearly as complex as the one in cTakes, but on the flip side, is neither as accurate, nor broad nor bulletproof either.

My solution uses drug dosage phrase data provided in this Pattern Matching article by Erin Rhode (which also comes with a Perl based solution), as well as its dictionaries (with additions by me), to model the phrases with the state diagram below. I built the diagram by eyeballing the outputs from Erin Rhode’s program. I then implement the state diagram with a home-grown FSM implementation based on ideas from Electric Monk’s post on FSMs in Python and the documentation for the Java library Tungsten FSM. I initially tried to use Tungsten-FSM, but ended up with extremely verbose Scala code because of Scala’s stricter generics system.

This caught my attention because I was looking at a data import handler recently that was harvesting information from a minimal XML wrapper around mediawiki markup. Works quite well but seems like a shame to miss all the data in wiki markup.

I say “miss all the data in wiki markup” and that’s not really fair. It is dumped into a single field for indexing. But that is a field that loses the context distinctions between a note, appendix, bibliography, or even the main text.

If you need distinctions that aren’t the defaults, you may be faced with rolling your own FSM. This post should help get you started.

January 8, 2014

Morpho project

Filed under: Language,Parsing — Patrick Durusau @ 2:02 pm

Morpho project

From the webpage:

The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, we are focusing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms.

This may not be of general interest but I mention it as one aspect of data-driven linguistics.

Long dead languages are often victims of well-meaning but highly imaginative work meant to explain those languages.

Grounding work in texts of a language introduces a much needed sanity check.

December 5, 2013

TextBlob: Simplified Text Processing

Filed under: Natural Language Processing,Parsing,Text Mining — Patrick Durusau @ 7:31 pm

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

….

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • JSON serialization
  • Add new models or languages through extensions
  • WordNet integration

Knowing that TextBlob plays well with NLTK is a big plus!

November 3, 2013

A multi-Teraflop Constituency Parser using GPUs

Filed under: GPU,Grammar,Language,Parsers,Parsing — Patrick Durusau @ 4:45 pm

A multi-Teraflop Constituency Parser using GPUs by John Canny, David Hall and Dan Klein.

Abstract:

Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length- 30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.

Just in case you are interested in parsing “unstructured” data, mostly what they also call “texts.”

I first saw the link: BIDParse: GPU-accelerated natural language parser at hgup.org. Then I started looking for the paper. 😉

August 31, 2013

textfsm

Filed under: Parsing,State Machine,Text Mining — Patrick Durusau @ 6:20 pm

textfsm

From the webpage:

Python module which implements a template based state machine for parsing semi-formatted text. Originally developed to allow programmatic access to information returned from the command line interface (CLI) of networking devices.

TextFSM was developed internally at Google and released under the Apache 2.0 licence for the benefit of the wider community.

See: TextFSMHowto for details.

TextFSM looks like a useful Python module for extracting data from “semi-structured” text.

I first saw this in Nat Torkington’s Four short links: 29 August 2013.

August 15, 2013

RE|Parse

Filed under: Parsing,Python — Patrick Durusau @ 6:45 pm

RE|PARSE

From the webpage:

Python library/tools for combining and using Regular Expressions in a maintainable way

This library also allows you to:

  • Maintain a database of Regular Expressions
  • Combine them together using Patterns
  • Search, Parse and Output data matched by combined Regex using Python functions.

If you know Regular Expressions already, this library basically just gives you a way to combine them together and hook them up to some callback functions in Python.

This looks like a very useful tool.

August 5, 2013

Semantic Parsing with Combinatory Categorial Grammars

Filed under: Parsing,Semantics — Patrick Durusau @ 10:19 am

Semantic Parsing with Combinatory Categorial Grammars by Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer.

Slides from an ACL tutorial, 2013. Three hundred and fifty-one (351) slides.

You may want to also visit: The University of Washington Semantic Parsing Framework v1.3 site where you can download source or binary files.

The ACL wiki introduces combinatory categorical grammars with:

Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure. CCG relies on combinatory logic, which has the same expressive power as the lambda calculus, but builds its expressions differently.

The first linguistic and psycholinguistic arguments for basing the grammar on combinators were put forth by Mark Steedman and Anna Szabolcsi. More recent proponents of the approach are Jacobson and Baldridge. For example, the combinator B (the compositor) is useful in creating long-distance dependencies, as in “Who do you think Mary is talking about?” and the combinator W (the duplicator) is useful as the lexical interpretation of reflexive pronouns, as in “Mary talks about herself”. Together with I (the identity mapping) and C (the permutator) these form a set of primitive, non-interdefinable combinators. Jacobson interprets personal pronouns as the combinator I, and their binding is aided by a complex combinator Z, as in “Mary lost her way”. Z is definable using W and B.

CCG is known to define the same language class as tree-adjoining grammar, linear indexed grammar, and head grammar, and is said to be mildly context-sensitive.

One of the key publications of CCG is The Syntactic Process by Mark Steedman. There are various efficient parsers available for CCG.

The ACL wiki page also lists other software packages and references.

Machine parsing/searching are absolute necessities if you want to create topic maps on a human scale. (Web Scale? Or do you want to try for human scale?)

To surpass current search results, build correction/interaction with users directly into your interface. So that search results “get smarter” the more your interface is used.

In contrast to the pagerank/lemming approach to document searching.

October 17, 2012

Parsing with Pictures

Filed under: Compilers,Graphs,Parsers,Parsing — Patrick Durusau @ 9:07 am

Parsing with Pictures by Keshav Pingali and Gianfranco Bilardi. (PDF file)

From an email that Keshav sent to the compilers@iecc.com email list:

Gianfranco Bilardi and I have developed a new approach to parsing context-free languages that we call “Parsing with pictures”. It provides an alternative (and, we believe, easier to understand) approach to context-free language parsing than the standard presentations using derivations or pushdown automata. It also unifies Earley, SLL, LL, SLR, and LR parsers among others.

Parsing problems are formulated as path problems in a graph called the grammar flow graph (GFG) that is easily constructed from a given grammar. Intuitively, the GFG is to context-free grammars what NFAs are to regular languages. Among other things, the paper has :

(i) an elementary derivation of Earley’s algorithm for parsing general context-free grammars, showing that it is an easy generalization of the well-known reachability-based NFA simulation algorithm,

(ii) a presentation of look-ahead that is independent of particular parsing strategies, and is based on a simple inter-procedural dataflow analysis,

(iii) GFG structural characterizations of LL and LR grammars that are simpler to understand than the standard definitions, and bring out a symmetry between these grammar classes,

(iv) derivations of recursive-descent and shift-reduce parsers for LL and LR grammars by optimizing the Earley parser to exploit this structure, and

(v) a connection between GFGs and NFAs for regular grammars based on the continuation-passing style (CPS) optimization.

Or if you prefer the more formal abstract:

The development of elegant and practical algorithms for parsing context-free languages is one of the major accomplishments of 20th century Computer Science. These algorithms are presented in the literature using string rewriting systems or abstract machines like pushdown automata, but the resulting descriptions are unsatisfactory for several reasons. First, even a basic understanding of parsing algorithms for some grammar classes such as LR(k) grammars requires mastering a formidable number of difficult concepts and terminology. Second, parsing algorithms for different grammar classes are often presented using entirely different formalisms, so the relationships between these grammar classes are obscured. Finally, these algorithms seem unrelated to algorithms for regular language recognition even though regular languages are a subset of context-free languages.

In this paper, we show that these problems are avoided if parsing is reformulated as the problem of finding certain kinds of paths in a graph called the Grammar Flow Graph (GFG) that is easily constructed from a context-free grammar. Intuitively, GFG’s permit parsing problems for context-free grammars to be formulated as path problems in graphs in the same way that non-deterministic finite-state automata do for regular grammars. We show that the GFG enables a unified treatment of Earley’s parser for general context-free grammars, recursive-descent parsers for LL(k) and SLL(k) grammars, and shift-reduce parsers for LR(k) and SLR(k) grammars. Computation of look-ahead sets becomes a simple interprocedural dataflow analysis. These results suggest that the GFG can be a new foundation for the study of context-free languages.

Odd as it may sound, some people want to be understood.

If you think being understood isn’t all that weird, do a slow read on this paper and provide feedback to the authors.

July 11, 2012

Importing public data with SAS instructions into R

Filed under: Data,Government Data,Parsing,Public Data,R — Patrick Durusau @ 2:28 pm

Importing public data with SAS instructions into R by David Smith.

From the post:

Many public agencies release data in a fixed-format ASCII (FWF) format. But with the data all packed together without separators, you need a “data dictionary” defining the column widths (and metadata about the variables) to make sense of them. Unfortunately, many agencies make such information available only as a SAS script, with the column information embedded in a PROC IMPORT statement.

David reports on the SAScii package from Anthony Damico.

You still have to parse the files but it gets you one step closer to having useful information.

December 7, 2011

SP-Sem-MRL 2012

Filed under: Conferences,Parsing,Semantics,Statistics — Patrick Durusau @ 8:13 pm

ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages (SP-Sem-MRL 2012)

Important dates:

Submission deadline: March 31, 2012 (PDT, GMT-8)
Notification to authors: April 21, 2012
Camera ready copy: May 5, 2012
Workshop: TBD, during the ACL 2012 workshop period (July 12-14, 2012)

From the website:

Morphologically Rich Languages (MRLs) are languages in which grammatical relations such as Subject, Predicate, Object, etc., are indicated morphologically (e.g. through inflection) instead of positionally (as in, e.g. English), and the position of words and phrases in the sentence may vary substantially. The tight connection between the morphology of words and the grammatical relations between them, and the looser connection between the position and grouping of words to their syntactic roles, pose serious challenges for syntactic and semantic processing. Furthermore, since grammatical relations provide the interface to compositional semantics, morpho-syntactic phenomena may significantly complicate processing the syntax–semantics interface. In statistical parsing, which has been a cornerstone of research in NLP and had seen great advances due to the widespread availability of syntactically annotated corpora, English parsing performance has reached a high plateau in certain genres, which is however not always indicative of parsing performance in MRLs, dependency-based and constituency-based alike . Semantic processing of natural language has similarly seen much progress in recent years. However, as in parsing, the bulk of the work has concentrated on English, and MRLs may present processing challenges that the community is as of yet unaware of, and which current semantic processing technologies may have difficulty coping with. These challenges may lurk in areas where parses may be used as input, such as semantic role labeling, distributional semantics, paraphrasing and textual entailments, or where inadequate pre-processing of morphological variation hurts parsing and semantic tasks alike.

This joint workshop aims to build upon the first and second SPMRL workshops (at NAACL-HLT 2010 and IWPT 2011, respectively) while extending the overall scope to include semantic processing where MRLs pose challenges for algorithms or models initially designed to process English. In particular, we seek to explore the use of newly available syntactically and/or semantically annotated corpora, or data sets for semantic evaluation that can contribute to our understanding of the difficulty that such phenomena pose. One goal of this workshop is to encourage cross-fertilization among researchers working on different languages and among those working on different levels of processing. Of particular interest is work addressing the lexical sparseness and out-of-vocabulary (OOV) issues that occur in both syntactic and semantic processing.

The exploration of non-English languages will replicate many of the outstanding entity recognition/data integration problems experienced in English. Considering that there are massive economic markets that speak non-English languages, the first to make progress on such issues will have a commercial advantage. How much of one I suspect depends on how well your software works in a non-English language.

Powered by WordPress