Archive for the ‘Stanford NLP’ Category

Python interface to Stanford Core NLP tools v1.3.3

Sunday, November 11th, 2012

Python interface to Stanford Core NLP tools v1.3.3

From the README.md:

This is a Python wrapper for Stanford University’s NLP group’s Java-based CoreNLP tools. It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.

  • Python interface to Stanford CoreNLP tools: tagging, phrase-structure parsing, dependency parsing, named entity resolution, and coreference resolution.
  • Runs an JSON-RPC server that wraps the Java server and outputs JSON.
  • Outputs parse trees which can be used by nltk.

It requires pexpect and (optionally) unidecode to handle non-ASCII text. This script includes and uses code from jsonrpc and python-progressbar.

It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on Core NLP tools version 1.3.3 released 2012-07-09.

If you have NLP requirements and work in Python, this may be of interest.

Stanford NLP

Monday, November 7th, 2011

Stanford NLP

Usually a reference to the Stanford NLP parser but I have put in the link to the “The Stanford Natural Language Processing Group.”

From its webpage:

The Natural Language Processing Group at Stanford University is a team of faculty, research scientists, postdocs, programmers and students who work together on algorithms that allow computers to process and understand human languages. Our work ranges from basic research in computational linguistics to key applications in human language technology, and covers areas such as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical information extraction, grammar induction, word sense disambiguation, and automatic question answering.

A distinguishing feature of the Stanford NLP Group is our effective combination of sophisticated and deep linguistic modeling and data analysis with innovative probabilistic and machine learning approaches to NLP. Our research has resulted in state-of-the-art technology for robust, broad-coverage natural-language processing in many languages. These technologies include our part-of-speech tagger, which currently has the best published performance in the world; a high performance probabilistic parser; a competition-winning biological named entity recognition system; and algorithms for processing Arabic, Chinese, and German text.

The Stanford NLP Group includes members of both the Linguistics Department and the Computer Science Department, and is affiliated with the Stanford AI Lab and the Stanford InfoLab.

Quick link to Stanford NLP Software page.

Using Lucene and Cascalog for Fast Text Processing at Scale

Monday, November 7th, 2011

Using Lucene and Cascalog for Fast Text Processing at Scale

From the post:

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure’s awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

The world of text exploration just gets better all the time!