## Archive for the ‘Stanford NLP’ Category

### pynlp – Pythonic Wrapper for Stanford CoreNLP [& Rand Paul]

Tuesday, November 14th, 2017

The example text for this wrapper:

text = (
'GOP Sen. Rand Paul was assaulted in his home in Bowling Green,
Kentucky, on Friday, ''according to Kentucky State Police. State
troopers responded to a call to the senator\'s ''residence at 3:21
p.m. Friday. Police arrested a man named Rene Albert Boucher, who
they ''allege "intentionally assaulted" Paul, causing him "minor
injury. Boucher, 59, of Bowling ''Green was charged with one count of
fourth-degree assault. As of Saturday afternoon, he ''was being held
in the Warren County Regional Jail on a \$5,000 bond.')


[Warning: Reformatted for readability. See the Github page for the text]

Nice to see examples using contemporary texts. Any of the recent sexual abuse apologies or non-apologies would work as well.

Enjoy!

### Building Data Science with JS – Lifting the Curtain on Game Reviews

Saturday, October 7th, 2017

Building Data Science with JS by Tim Ermilov.

Three videos thus far:

Building Data Science with JS – Part 1 – Introduction

Building Data Science with JS – Part 2 – Microservices

Building Data Science with JS – Part 3 – RabbitMQ and OpenCritic microservice

Tim starts with the observation that the percentage of users assigning a score to a game isn’t very helpful. It tells you nothing about the content of the game and/or the person rating it.

In subject identity terms, each level, mighty, strong, weak, fair, collapses information about the game and a particular reviewer into a single summary subject. OpenCritic then displays the percent of reviewers who are represented by that summary subject.

The problem with the summary subject is that one critic may have down rated the game for poor content, another for sexism and still another for bad graphics. But a user only knows for reasons unknown, a critic whose past behavior is unknown, evaluated unknown content and assigned it a rating.

A user could read all the reviews, study the history of each reviewer, along with the other movies they have evaluated, but Ermilov proposes a more efficient means to peak behind the curtain of game ratings. (part 1)

In part 2, Ermilov designs a microservice based application to extract, process and display game reviews.

If you thought the first two parts were slow, you should enjoy Part 3. 😉 Ermilov speeds through a number of resources, documents, JS libraries, not to mention his source code for the project. You are likely to hit pause during this video.

Some links you will find helpful for Part 3:

AMQP 0-9-1 library and client for Node.JS – Channel-oriented API reference

AMQP 0-9-1 library and client for Node.JS (Github)

https://github.com/BuildingXwithJS

https://github.com/BuildingXwithJS/building-data-science-with-js

Microwork – simple creation of distributed scalable microservices in node.js with RabbitMQ (simplifies use of AMQP)

node-unfluff – Automatically extract body content (and other cool stuff) from an html document

OpenCritic

RabbitMQ. (Recommends looking at the RabbitMQ tutorials.)

### Stanford CoreNLP – a suite of core NLP tools (3.7.0)

Thursday, January 12th, 2017

Stanford CoreNLP – a suite of core NLP tools

The beta is over and Stanford CoreNLP 3.7.0 is on the street!

From the webpage:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get quotes people said, etc.

Choose Stanford CoreNLP if you need:

• An integrated toolkit with a good range of grammatical analysis tools
• Fast, reliable analysis of arbitrary texts
• The overall highest quality text analytics
• Support for a number of major (human) languages
• Available interfaces for most major modern programming languages
• Ability to run as a simple web service

Stanford CoreNLP’s goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A tool pipeline can be run on a piece of plain text with just two lines of code. CoreNLP is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Moreover, an annotator pipeline can include additional custom or third-party annotators. CoreNLP’s analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

What stream of noise, sorry, news are you going to pipeling into the Stanford CoreNLP framework?

😉

Imagine a web service that offers levels of analysis alongside news text.

Or does the same with leaked emails and/or documents?

### Ulysses, Joyce and Stanford CoreNLP

Saturday, November 26th, 2016

Introduction to memory and time usage

From the webpage:

People not infrequently complain that Stanford CoreNLP is slow or takes a ton of memory. In some configurations this is true. In other configurations, this is not true. This section tries to help you understand what you can or can’t do about speed and memory usage. The advice applies regardless of whether you are running CoreNLP from the command-line, from the Java API, from the web service, or from other languages. We show command-line examples here, but the principles are true of all ways of invoking CoreNLP. You will just need to pass in the appropriate properties in different ways. For these examples we will work with chapter 13 of Ulysses by James Joyce. You can download it if you want to follow along.

You have to appreciate the use of a non-trivial text for advice on speed and memory usage of CoreNLP.

How does your text stack up against Chapter 13 of Ulysses?

I’m supposed to be reading Ulysses long distance with a friend. I’m afraid we have both fallen behind. Perhaps this will encourage me to have another go at it.

What favorite or “should read” text would you use to practice with CoreNLP?

Suggestions?

### Stanford CoreNLP v3.7.0 beta is out! [Time is short, comments, bug reports, now!]

Thursday, November 3rd, 2016

Stanford CoreNLP v3.7.0 beta

The tweets I saw from Stanford NLP Group read:

Stanford CoreNLP v3.7.0 beta is out—improved coreference, dep parsing—KBP relation annotator—Arabic pipeline #NLProc

We‘re doing an official CoreNLP beta release this time, so bugs, comments, and fixes especially appreciated over the next two weeks!

OK, so, what are you waiting for? 😉

Oh, the standard blurb for your boss on why Stanford CoreNLP should be taking up your time:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

• An integrated toolkit with a good range of grammatical analysis tools
• Fast, reliable analysis of arbitrary texts
• The overall highest quality text analytics
• Support for a number of major (human) languages
• Interfaces available for various major modern programming languages
• Ability to run as a simple web service

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Using the standard blurb about the Stanford CoreNLP has these advantages:

• It’s copy-n-paste, you didn’t have to write it
• It’s appeal to authority (Stanford)
• It’s truthful

The truthful point is a throw-away these days but thought I should mention it. 😉

### Tokenizing and Named Entity Recognition with Stanford CoreNLP

Friday, September 19th, 2014

From the post:

I got into NLP using Java, but I was already using Python at the time, and soon came across the Natural Language Tool Kit (NLTK), and just fell in love with the elegance of its API. So much so that when I started working with Scala, I figured it would be a good idea to build a NLP toolkit with an API similar to NLTKs, primarily as a way to learn NLP and Scala but also to build something that would be as enjoyable to work with as NLTK and have the benefit of Java’s rich ecosystem.

The project is perenially under construction, and serves as a test bed for my NLP experiments. In the past, I have used OpenNLP and LingPipe to build Tokenizer implementations that expose an API similar to NLTK’s. More recently, I have built an Named Entity Recognizer (NER) with OpenNLP’s NameFinder. At the recommendation of one of my readers, I decided to take a look at Stanford CoreNLP, with which I ended up building a Tokenizer and a NER implementation. This post describes that work.

Truly a hard core way to learn NLP and Scala!

Excellent!

### Use Cases for Taming Text, 2nd ed.

Saturday, January 25th, 2014

Use Cases for Taming Text, 2nd ed. by Grant Ingersoll.

From the post:

Drew Farris, Tom Morton and I are currently working on the 2nd Edition of Taming Text (http://www.manning.com/ingersoll for first ed.) and are soliciting interested parties who would be willing to contribute to a chapter on practical use cases (i.e. you have something in production and are willing to write about it) for search with Solr, NLP using OpenNLP or Stanford NLP and machine learning using Mahout, OpenNLP or MALLET — ideally you are using combinations of 2 or more of these to solve your problems. We are especially interested in large scale use cases in eCommerce, Advertising, social media analytics, fraud, etc.

The writing process is fairly straightforward. A section roughly equates to somewhere between 3 – 10 pages, including diagrams/pictures. After writing, there will be some feedback from editors and us, but otherwise the process is fairly simple.

In order to participate, you must have permission from your company to write on the topic. You would not need to divulge any proprietary information, but we would want enough information for our readers to gain a high-level understanding of your use case. In exchange for your participation, you will have your name and company published on that section of the book as well as in the acknowledgments section. If you have a copy of Lucene in Action or Mahout In Action, it would be similar to the use case sections in those books.

Cool!

I am guessing the second edition isn’t going to take as long as the first. 😉

Couldn’t be in better company as far as co-authors.

See the post for the contact details.

### Entity recognition with Scala and…

Wednesday, June 5th, 2013

From the post:

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.

(…)

The voting on entity recognition made me curious about interactive entity resolution where a user has a voice.

See the next post.

### Python interface to Stanford Core NLP tools v1.3.3

Sunday, November 11th, 2012

Python interface to Stanford Core NLP tools v1.3.3

This is a Python wrapper for Stanford University’s NLP group’s Java-based CoreNLP tools. It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.

• Python interface to Stanford CoreNLP tools: tagging, phrase-structure parsing, dependency parsing, named entity resolution, and coreference resolution.
• Runs an JSON-RPC server that wraps the Java server and outputs JSON.
• Outputs parse trees which can be used by nltk.

It requires pexpect and (optionally) unidecode to handle non-ASCII text. This script includes and uses code from jsonrpc and python-progressbar.

It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on Core NLP tools version 1.3.3 released 2012-07-09.

If you have NLP requirements and work in Python, this may be of interest.

### Stanford NLP

Monday, November 7th, 2011

Stanford NLP

Usually a reference to the Stanford NLP parser but I have put in the link to the “The Stanford Natural Language Processing Group.”

From its webpage:

The Natural Language Processing Group at Stanford University is a team of faculty, research scientists, postdocs, programmers and students who work together on algorithms that allow computers to process and understand human languages. Our work ranges from basic research in computational linguistics to key applications in human language technology, and covers areas such as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical information extraction, grammar induction, word sense disambiguation, and automatic question answering.

A distinguishing feature of the Stanford NLP Group is our effective combination of sophisticated and deep linguistic modeling and data analysis with innovative probabilistic and machine learning approaches to NLP. Our research has resulted in state-of-the-art technology for robust, broad-coverage natural-language processing in many languages. These technologies include our part-of-speech tagger, which currently has the best published performance in the world; a high performance probabilistic parser; a competition-winning biological named entity recognition system; and algorithms for processing Arabic, Chinese, and German text.

The Stanford NLP Group includes members of both the Linguistics Department and the Computer Science Department, and is affiliated with the Stanford AI Lab and the Stanford InfoLab.

Quick link to Stanford NLP Software page.

### Using Lucene and Cascalog for Fast Text Processing at Scale

Monday, November 7th, 2011

Using Lucene and Cascalog for Fast Text Processing at Scale

From the post:

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure’s awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.

The world of text exploration just gets better all the time!