Archive for the ‘Natural Language Processing’ Category

Understanding Natural Language with Deep Neural Networks Using Torch

Tuesday, March 3rd, 2015

Understanding Natural Language with Deep Neural Networks Using Torch by Soumith Chintala and Wojciech Zaremba.

This is a deeply impressive article and a good introduction to Torch (scientific computing package with neural network, optimization, etc.)

In the preliminary materials, the authors illustrate one of the difficulties of natural language processing by machine:

For a machine to understand language, it first has to develop a mental map of words, their meanings and interactions with other words. It needs to build a dictionary of words, and understand where they stand semantically and contextually, compared to other words in their dictionary. To achieve this, each word is mapped to a set of numbers in a high-dimensional space, which are called “word embeddings”. Similar words are close to each other in this number space, and dissimilar words are far apart. Some word embeddings encode mathematical properties such as addition and subtraction (For some examples, see Table 1).

Word embeddings can either be learned in a general-purpose fashion before-hand by reading large amounts of text (like Wikipedia), or specially learned for a particular task (like sentiment analysis). We go into a little more detail on learning word embeddings in a later section.

You can already see the problem but just to call it out, the language usage in Wikipedia, for example, may or may not match the domain of interest. You could certainly use it as a general case but it will produce very odd results when the text to be “understood” in a regional version of a language where common words have meanings other than you will find in Wikipedia.

Slang is a good example. In the 17th century for example, “cab” was a term used for a brothel. To take a “hit” has a different meaning than being struck by a boxer, would be a more recent example.

“Understanding” natural language with machines is a great leap forward but one should never leap without looking.

Using NLP to measure democracy

Tuesday, February 24th, 2015

Using NLP to measure democracy by Thiago Marzagão.


This paper uses natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The ADS are based on 42 million news articles from 6,043 different sources and cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS are replicable and have standard errors small enough to actually distinguish between cases.

The ADS are produced with supervised learning. Three approaches are tried: a) a combination of Latent Semantic Analysis and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods; and c) the Wordscores algorithm. The Wordscores algorithm outperforms the alternatives, so it is the one on which the ADS are based.

There is a web application where anyone can change the training set and see how the results change:

Automated Democracy Scores Part of the PhD work of Thiago Marzagão. An online interface that allows you to change democracy scores by the year and country and run the analysis against 200 billion data points on an Amazon cluster.

Quite remarkable although I suspect this level of PhD work and public access to it will grow rapidly in the near future.

Do read the paper and don’t jump straight to the data. 😉 Take a minute to see what results Thiago has reached thus far.

Personally I was expecting the United States and China to be running neck and neck. Mostly because the wealthy choose candidates for public office in the United States and in China the Party chooses them. Not all that different, perhaps a bit more formalized and less chaotic in China. Certainly less in the way of campaign costs. (humor)

I was seriously surprised to find that democracy was lowest in Africa and the Middle East. Evaluated on a national basis that may be correct but Western definitions aren’t easy to apply to Africa and the Middle East. Nation, Tribe and Ethnic Group in Africa And Democracy and Consensus in African Traditional Politics for one tip of the iceberg on decision making in Africa.

TextBlob: Simplified Text Processing

Tuesday, February 24th, 2015

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.


  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration

Has anyone compared this head to head with NLTK?

Neo4j: Building a topic graph with Prismatic Interest Graph API

Sunday, February 22nd, 2015

Neo4j: Building a topic graph with Prismatic Interest Graph API by Mark Needham.

From the post:

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API.

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

A great walk through from accessing the Interest Graph API to loading the data into Neo4j and querying it with Cypher.

I can’t profess a lot of interest in How I Met Your Mother episodes but the techniques can be applied to other content. 😉

50 Shades Sex Scene detector

Sunday, February 15th, 2015

NLP-in-Python by Lynn Cherny.

No, the title is not “click-bait” because section 4 of Lynn’s tutorial is titled:

4. Naive Bayes Classification – the infamous 50 Shades Sex Scene Detection because spam is boring

Titles can be accurate and NLP can be interesting.

Imagine an ebook reader that accepts 3rd party navigation for ebooks. Running NLP on novels could provide navigation that isolates the sex or other scenes for rapid access.

An electronic abridging of the original. Not unlike CliffsNotes.

I suspect that could be a marketable information product separate from the original ebook.

As would the ability to overlay 3rd party content on original ebook publications.

Are any of the open source ebook readers working on such a feature? Easier to develop demand for that feature on open source ebook readers and then tackle the DRM/proprietary format stuff.

Natural Language Analytics made simple and visual with Neo4j

Friday, January 9th, 2015

Natural Language Analytics made simple and visual with Neo4j by Michael Hunger.

From the post:

I was really impressed by this blog post on Summarizing Opinions with a Graph from Max and always waited for Part 2 to show up 🙂

The blog post explains an really interesting approach by Kavita Ganesan which uses a graph representation of sentences of review content to extract the most significant statements about a product.

From later in the post:

The essence of creating the graph can be formulated as: “Each word of the sentence is represented by a shared node in the graph with order of words being reflected by relationships pointing to the next word”.

Michael goes on to create features with Cypher and admits near the end that “LOAD CSV” doesn’t really care if you have CSV files or not. You can split on a space and load text such as the “Lord of the Rings poem of the One Ring” into Neo4j.

Interesting work and a good way to play with text and Neo4j.

The single node per unique word presented here will be problematic if you need to capture the changing roles of words in a sentence.

Special Issue on Arabic NLP

Thursday, January 8th, 2015

Special Issue on Arabic NLP Editor-in-Chief M.M. Alsulaiman

Including the introduction, twelve open access articles on Arabic NLP.

From the introduction:

Arabic natural language processing (NLP) is still in its initial stage compared to the work in English and other languages. NLP is made possible by the collaboration of many disciplines including computer science, linguistics, mathematics, psychology and artificial intelligence. The results of which is highly beneficial to many applications such as Machine Translation, Information Retrieval, Information Extraction, Text Summarization and Question Answering.

This special issue of the Journal of King Saud University – Computer and Information Sciences (CIS) synthesizes current research in the field of Arabic NLP. A total of 56 submissions was received, 11 of which were finally accepted for this special issue. Each accepted paper has gone through three rounds of reviews, each round with two to three reviewers. The content of this special issue covers different topics such as: Dialectal Arabic Morphology, Arabic Corpus, Transliteration, Annotation, Discourse Relations, Sentiment Lexicon, Arabic named entities, Arabic Treebank, Text Summarization, Ontological Relations and Authorship attribution. The following is a brief summary of each of the main articles in this issue.

If you are interested in doing original NLP work, not a bad place to start looking for projects.

I first saw this in a tweet by Tony McEnery.

Shallow Discourse Parsing

Monday, January 5th, 2015

Shallow Discourse Parsing

From the webpage:

A participant system is given a piece of newswire text as input and returns discourse relations in the form of a discourse connective (explicit or implicit) taking two arguments (which can be clauses, sentences, or multi-sentence segments). Specifically, the participant system needs to i) locate both explicit (e.g., “because”, “however”, “and”) and implicit discourse connectives (often signaled by periods) in the text, ii) identify the spans of text that serve as the two arguments for each discourse connective, and iii) predict the sense of the discourse connectives (e.g., “Cause”, “Condition”, “Contrast”). Understanding such discourse relations is clearly an important part of natural language understanding that benefits a wide range of natural language applications.

Important Dates

  • January 26, 2015: registration begins, and release of training set and scorer
  • March 1, 2015: Registration deadline.
  • April 20, 2015: Test set available.
  • April 24, 2015: Systems collected.
  • May 1, 2015: System results due to participants
  • May 8, 2015: System papers due.
  • May 18, 2015: Reviews due.
  • May 21, 2015: notification of acceptance.
  • May 28, 2015: camera-ready version of system papers due.
  • July 30-31, 2015. CoNLL conference (Beijing China).

You have to admire the ambiguity of the title.

Does it mean the parsing of shallow discourse (my first bet) or does it mean shallow parsing of discourse (my unlikely)?

What do you think?

With the recent advances in deep learning, I am curious if the Turing test could be passed by training an algorithm on sitcom dialogue over the last two or three years?

Would you use regular TV viewers as part of the test or use people who rarely watch TV? Could make a difference in the outcome of the test.

I first saw this in a tweet by Jason Baldridge.

EMNLP 2014: Conference on Empirical Methods in Natural Language Processing

Thursday, December 11th, 2014

EMNLP 2014: Conference on Empirical Methods in Natural Language Processing

I rather quickly sorted these tutorials into order by the first author’s last name:

The links will take you to the conference site and descriptions with links to videos and other materials.

You can download the complete conference proceedings: EMNLP 2014 The 2014 Conference on Empirical Methods In Natural Language Processing Proceedings of the Conference, which at two thousand one hundred and ninety-one (2191) pages, should keep you busy through the holiday season. 😉

Or if you are interested in a particular paper, see the Main Conference Program, which has links to individual papers and videos of the presentations in many cases.

A real wealth of materials here! I must say the conference servers are the most responsive I have ever seen.

I first saw this in a tweet by Jason Baldridge.

Tweet NLP

Tuesday, October 21st, 2014

TWeet NLP (Carnegie Mellon)

From the webpage:

We provide a tokenizer, a part-of-speech tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools.

See the website for further details.

I can understand vendors mining tweets and try to react to every twitch in some social stream but the U.S. military is interested as well.

“Customer targeting” in their case has a whole different meaning.

Assuming you can identify one or more classes of tweets, would it be possible to mimic those patterns, albeit with some deviation in the content of the tweets? That is what tweet content is weighted heavier that other tweet content?

I first saw this in a tweet by Peter Skomoroch.

ADW (Align, Disambiguate and Walk) [Semantic Similarity]

Tuesday, October 14th, 2014

ADW (Align, Disambiguate and Walk) version 1.0 by Mohammad Taher Pilehvar.

From the webpage:

This package provides a Java implementation of ADW, a state-of-the-art semantic similarity approach that enables the comparison of lexical items at different lexical levels: from senses to texts. For more details about the approach please refer to:

The abstract for the paper reads:

Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-of-the-art performance on three tasks: semantic textual similarity, word similarity, and word sense coarsening.

Online Demo.

The strength of this approach is the use of multiple levels of semantic similarity. It relies on WordNet but the authors promise to extend their approach to named entities and other tokens not appearing in WordNet (like your company or industry’s internal vocabulary).

The bibliography of the paper cites much of the recent work in this area so that will be an added bonus for perusing the paper.

I first saw this in a tweet by Gregory Piatetsky.

Scientists confess to sneaking Bob Dylan lyrics into their work for the past 17 years

Sunday, September 28th, 2014

Scientists confess to sneaking Bob Dylan lyrics into their work for the past 17 years by Rachel Feltman.

From the post:

While writing an article about intestinal gasses 17 years ago, Karolinska Institute researchers John Lundberg and Eddie Weitzberg couldn’t resist a punny title: “Nitric Oxide and inflammation: The answer is blowing in the wind”.

Thus began their descent down the slippery slope of Bob Dylan call-outs. While the two men never put lyrics into their peer-reviewed studies, The Local Sweden reports, they started a personal tradition of getting as many Dylan quotes as possible into everything else they wrote — articles about other peoples’ work, editorials, book introductions, and so on.

An amusing illustration of one difficulty in natural language processing, allusion.

The Wikipedia article on allusion summarizes one typology of allusion (R. F. Thomas, “Virgil’s Georgics and the art of reference” Harvard Studies in Classical Philology 90 (1986) pp 171–98) as:

  1. Casual Reference, “the use of language which recalls a specific antecedent, but only in a general sense” that is relatively unimportant to the new context;
  2. Single Reference, in which the hearer or reader is intended to “recall the context of the model and apply that context to the new situation”; such a specific single reference in Virgil, according to Thomas, is a means of “making connections or conveying ideas on a level of intense subtlety”;
  3. Self-Reference, where the locus is in the poet’s own work;
  4. Corrective Allusion, where the imitation is clearly in opposition to the original source’s intentions;
  5. Apparent Reference, “which seems clearly to recall a specific model but which on closer inspection frustrates that intention”; and
  6. Multiple Reference or Conflation, which refers in various ways simultaneously to several sources, fusing and transforming the cultural traditions.

(emphasis in original)

Allusion is a sub-part of the larger subject of intertextuality.

Thinking of the difficulties that allusions introduce into NLP. With “Dylan lyrics meaning” as a quoted search string, I get over 60,000 “hits” consisting of widely varying interpretations. Add to that the interpretation of a Dylan allusion in a different context and you have a truly worthy NLP problem.

Two questions:

The Dylan post is one example of allusion. Is there any literature or sense of how much allusion occurs in specific types of writing?

Any literature on NLP techniques for dealing with allusion in general?

I first saw this in a tweet by Carl Anderson.

Tokenizing and Named Entity Recognition with Stanford CoreNLP

Friday, September 19th, 2014

Tokenizing and Named Entity Recognition with Stanford CoreNLP by Sujit Pal.

From the post:

I got into NLP using Java, but I was already using Python at the time, and soon came across the Natural Language Tool Kit (NLTK), and just fell in love with the elegance of its API. So much so that when I started working with Scala, I figured it would be a good idea to build a NLP toolkit with an API similar to NLTKs, primarily as a way to learn NLP and Scala but also to build something that would be as enjoyable to work with as NLTK and have the benefit of Java’s rich ecosystem.

The project is perenially under construction, and serves as a test bed for my NLP experiments. In the past, I have used OpenNLP and LingPipe to build Tokenizer implementations that expose an API similar to NLTK’s. More recently, I have built an Named Entity Recognizer (NER) with OpenNLP’s NameFinder. At the recommendation of one of my readers, I decided to take a look at Stanford CoreNLP, with which I ended up building a Tokenizer and a NER implementation. This post describes that work.

Truly a hard core way to learn NLP and Scala!


Looking forward to hearing more about this project.

Getting Started with S4, The Self-Service Semantic Suite

Tuesday, September 16th, 2014

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.

New Directions in Vector Space Models of Meaning

Tuesday, September 16th, 2014

New Directions in Vector Space Models of Meaning by Edward Grefenstette, Karl Moritz Hermann, Georgiana Dinu, and Phil Blunsom. (video)

From the description:

This is the video footage, aligned with slides, of the ACL 2014 Tutorial on New Directions in Vector Space Models of Meaning, by Edward Grefenstette (Oxford), Karl Moritz Hermann (Oxford), Georgiana Dinu (Trento) and Phil Blunsom (Oxford).

This tutorial was presented at ACL 2014 in Baltimore by Ed, Karl and Phil.

The slides can be found at

Running time is 2:45:12 so you had better get a cup of coffee before you start.

Includes a review of distributional models of semantics.

The sound isn’t bad but the acoustics are so you will have to listen closely. Having the slides in front of you helps as well.

The semantics part starts to echo topic map theory with the realization that having a single token isn’t going to help you with semantics. Tokens don’t stand alone but in a context of other tokens. Each of which has some contribution to make to the meaning of a token in question.

Topic maps function in a similar way with the realization that identifying any subject of necessity involves other subjects, which have their own identifications. For some purposes, we may assume some subjects are sufficiently identified without specifying the subjects that in our view identify it, but that is merely a design choice that others may choose to make differently.

Working through this tutorial and the cited references (one advantage to the online version) will leave you with a background in vector space models and the contours of the latest research.

I first saw this in a tweet by Kevin Safford.

Deep dive into understanding human language with Python

Saturday, September 13th, 2014

Deep dive into understanding human language with Python by Alyona Medelyan.


Whenever your data is text and you need to analyze it, you are likely to need Natural Language Processing algorithms that help make sense of human language. They will help you answer questions like: Who is the author of this text? What is his or her attitude? What is it about? What facts does it mention? Do I have similar texts like this one already? Where does it belong to?

This tutorial will cover several open-source Natural Language Processing Python libraries such as NLTK, Gensim and TextBlob, show you how they work and how you can use them effectively.

Level: Intermediate (knowledge of basic Python language features is assumed)

Pre-requisites: a Python environment with NLTK, Gensim and TextBlob already installed. Please make sure to run and install movie_reviews and stopwords (under Corpora), as well as POS model (under Models).

Code examples, data and slides from Alyona’s NLP tutorial at KiwiPyCon 2014.

Introduction to NLTK, Gensim and TextBlob.

Not enough to make you dangerous but enough to get you interested in natural language processing.

Recursive Deep Learning For Natural Language Processing And Computer Vision

Wednesday, September 10th, 2014

Recursive Deep Learning For Natural Language Processing And Computer Vision by Richard Socher.

From the abstract:

As the amount of unstructured text data that humanity produces overall and on the Internet grows, so does the need to intelligently process it and extract diff erent types of knowledge from it. My research goal in this thesis is to develop learning models that can automatically induce representations of human language, in particular its structure and meaning in order to solve multiple higher level language tasks.

There has been great progress in delivering technologies in natural language processing such as extracting information, sentiment analysis or grammatical analysis. However, solutions are often based on diff erent machine learning models. My goal is the development of general and scalable algorithms that can jointly solve such tasks and learn the necessary intermediate representations of the linguistic units involved. Furthermore, most standard approaches make strong simplifying language assumptions and require well designed feature representations. The models in this thesis address these two shortcomings. They provide eff ective and general representations for sentences without assuming word order independence. Furthermore, they provide state of the art performance with no, or few manually designed features.

The new model family introduced in this thesis is summarized under the term Recursive Deep Learning. The models in this family are variations and extensions of unsupervised and supervised recursive neural networks (RNNs) which generalize deep and feature learning ideas to hierarchical structures. The RNN models of this thesis obtain state of the art performance on paraphrase detection, sentiment analysis, relation classifi cation, parsing, image-sentence mapping and knowledge base completion, among other tasks.

Socher’s models offer two significant advances:

  • No assumption of word order independence
  • No or few manually designed features

Of the two, I am more partial to elimination of the assumption of word order independence. I suppose in part because I see that leading to abandoning that assumption that words have some fixed meaning separate and apart from the other words used to define them.

Or in topic maps parlance, identifying a subject always involves the use of other subjects, which are themselves capable of being identified. Think about it. When was the last time you were called upon to identify a person, object or thing and you uttered an IRI? Never right?

That certainly works, at least in closed domains, in some cases, but other than simply repeating the string, you have no basis on which to conclude that is the correct IRI. Nor does anyone else have a basis to accept or reject your IRI.

I suppose that is another one of those “simplifying” assumptions. Useful in some cases but not all.

BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web

Tuesday, September 9th, 2014

BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web

From the webpage:

Despite certain obvious drawbacks (e.g. lack of control, sampling, documentation etc.), there is no doubt that the World Wide Web is a mine of language data of unprecedented richness and ease of access.

It is also the only viable source of “disposable” corpora built ad hoc for a specific purpose (e.g. a translation or interpreting task, the compilation of a terminological database, domain-specific machine learning tasks). These corpora are essential resources for language professionals who routinely work with specialized languages, often in areas where neologisms and new terms are introduced at a fast pace and where standard reference corpora have to be complemented by easy-to-construct, focused, up-to-date text collections.

While it is possible to construct a web-based corpus through manual queries and downloads, this process is extremely time-consuming. The time investment is particularly unjustified if the final result is meant to be a single-use corpus.

The command-line scripts included in the BootCaT toolkit implement an iterative procedure to bootstrap specialized corpora and terms from the web, requiring only a list of “seeds” (terms that are expected to be typical of the domain of interest) as input.

In implementing the algorithm, we followed the old UNIX adage that each program should do only one thing, but do it well. Thus, we developed a small, independent tool for each separate subtask of the algorithm.

As a result, BootCaT is extremely modular: one can easily run a subset of the programs, look at intermediate output files, add new tools to the suite, or change one program without having to worry about the others.

Any application following “the old UNIX adage that each program should do only one thing, but do it well” merits serious consideration.

Occurs to me that BootCaT would also be useful for creating small text collections for comparison to each other.


I first saw this in a tweet by Alyona Medelyan.

Python-ZPar – Python Wrapper for ZPAR

Monday, September 8th, 2014

Python-ZPar – Python Wrapper for ZPAR by Nitin Madnani.

From the webpage:

python-zpar is a python wrapper around the ZPar parser. ZPar was written by Yue Zhang while he was at Oxford University. According to its home page: ZPar is a statistical natural language parser, which performs syntactic analysis tasks including word segmentation, part-of-speech tagging and parsing. ZPar supports multiple languages and multiple grammar formalisms. ZPar has been most heavily developed for Chinese and English, while it provides generic support for other languages. ZPar is fast, processing above 50 sentences per second using the standard Penn Teebank (Wall Street Journal) data.

I wrote python-zpar since I needed a fast and efficient parser for my NLP work which is primarily done in Python and not C++. I wanted to be able to use this parser directly from Python without having to create a bunch of files and running them through subprocesses. python-zpar not only provides a simply python wrapper but also provides an XML-RPC ZPar server to make batch-processing of large files easier.

python-zpar uses ctypes, a very cool foreign function library bundled with Python that allows calling functions in C DLLs or shared libraries directly.

Just in case you are looking for a language parser for Chinese or English.

It is only a matter of time before commercial opportunities are going to force greater attention on non-English languages. Forewarned is forearmed.

New York Times Annotated Corpus Add-On

Wednesday, August 27th, 2014

New York Times corpus add-on annotations: MIDs and Entity Salience. (GitHub – Data)

From the webpage:

The data included in this release accompanies the paper, entitled “A New Entity Salience Task with Millions of Training Examples” by Jesse Dunietz and Dan Gillick (EACL 2014).

The training data includes 100,834 documents from 2003-2006, with 19,261,118 annotated entities. The evaluation data includes 9,706 documents from 2007, with 187,080 annotated entities.

An empty line separates each document annotation. The first line of a document’s annotation contains the NYT document id followed by the title. Each subsequent line refers to an entity, with the following tab-separated fields:

entity index automatically inferred salience {0,1} mention count (from our coreference system) first mention’s text byte offset start position for the first mention byte offset end position for the first mention MID (from our entity resolution system)

The background in Teaching machines to read between the lines (and a new corpus with entity salience annotations) by Dan Gillick and Dave Orr, will be useful.

From the post:

Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about. (emphasis added)

Truly an important data set but I’m rather partial to that last line. 😉

So the question is if we “recognize” a entity as salient, do we annotate the entity and:

  • Present the reader with a list of links, each to a separate mention with or without ads?
  • Present the reader with what is known about the entity, with or without ads?

I see enough divided posts and other information that forces readers to endure more ads that I consciously avoid buying anything for which I see a web ad. Suggest you do the same. (If possible.) I buy books, for example, because someone known to me recommends it, not because some marketeer pushes it at me across many domains.

Deep Learning for NLP (without Magic)

Tuesday, August 19th, 2014

Deep Learning for NLP (without Magic) by Richard Socher and Christopher Manning.


Machine learning is everywhere in today’s NLP, but by and large machine learning amounts to numerical optimization of weights for human designed representations and features. The goal of deep learning is to explore how computers can take advantage of data to develop features and representations appropriate for complex interpretation tasks. This tutorial aims to cover the basic motivation, ideas, models and learning algorithms in deep learning for natural language processing. Recently, these methods have been shown to perform very well on various NLP tasks such as language modeling, POS tagging, named entity recognition, sentiment analysis and paraphrase detection, among others. The most attractive quality of these techniques is that they can perform well without any external hand-designed resources or time-intensive feature engineering. Despite these advantages, many researchers in NLP are not familiar with these methods. Our focus is on insight and understanding, using graphical illustrations and simple, intuitive derivations. The goal of the tutorial is to make the inner workings of these techniques transparent, intuitive and their results interpretable, rather than black boxes labeled “magic here”. The first part of the tutorial presents the basics of neural networks, neural word vectors, several simple models based on local windows and the math and algorithms of training via backpropagation. In this section applications include language modeling and POS tagging. In the second section we present recursive neural networks which can learn structured tree outputs as well as vector representations for phrases and sentences. We cover both equations as well as applications. We show how training can be achieved by a modified version of the backpropagation algorithm introduced before. These modifications allow the algorithm to work on tree structures. Applications include sentiment analysis and paraphrase detection. We also draw connections to recent work in semantic compositionality in vector spaces. The principle goal, again, is to make these methods appear intuitive and interpretable rather than mathematically confusing. By this point in the tutorial, the audience members should have a clear understanding of how to build a deep learning system for word-, sentence- and document-level tasks. The last part of the tutorial gives a general overview of the different applications of deep learning in NLP, including bag of words models. We will provide a discussion of NLP-oriented issues in modeling, interpretation, representational power, and optimization.

A tutorial on deep learning from NAACL 2013, Atlanta. The webpage offers links to the slides (205), video of the tutorial, and additional resources.

Definitely a place to take a dive into deep learning.

On page 35 of the slides the following caught my eye:

The vast majority of rule-based and statistical NLP work regards words as atomic symbols: hotel, conference, walk.

In vector space terms, this is a vector with one 1 and a lot of zeroes.


Dimensionality: 20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)

We call this a “one-hot” representation. Its problem:

motel [000000000010000] AND
hotel [000000010000000] = 0

Another aspect of topic maps comes to the fore!

You can have “one-hot” representations of subjects in a topic map, that is a single identifier, but that’s not required.

You can have multiple “one-hot” representations for a subject or you can have more complex collections of properties that represent a subject. Depends on your requirements, not a default of the technology.

If “one-hot” representations of subjects are insufficient for deep learning, shouldn’t they be insufficient for humans as well?

NLTK 3.0 Beta!

Wednesday, July 23rd, 2014

NLTK 3.0 Beta!

The official name is nltk 3.0.0b1 but I thought 3.0 beta rolls off the tongue better. 😉

Interface changes.

Grab the latest, contribute bug reports, etc.

Artificial Intelligence | Natural Language Processing

Friday, July 18th, 2014

Artificial Intelligence | Natural Language Processing by Christopher Manning.

From the webpage:

This course is designed to introduce students to the fundamental concepts and ideas in natural language processing (NLP), and to get them up to speed with current research in the area. It develops an in-depth understanding of both the algorithms available for the processing of linguistic information and the underlying computational properties of natural languages. Wordlevel, syntactic, and semantic processing from both a linguistic and an algorithmic perspective are considered. The focus is on modern quantitative techniques in NLP: using large corpora, statistical models for acquisition, disambiguation, and parsing. Also, it examines and constructs representative systems.

Lectures with notes.

If you are new to natural language processing, it would be hard to point at a better starting point.



Friday, June 20th, 2014

Book-NLP: Natural language processing pipeline for book-length documents.

From the webpage:

BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

  • Part-of-speech tagging (Stanford)
  • Dependency parsing (MaltParser)
  • Named entity recognition (Stanford)
  • Character name clustering (e.g., “Tom”, “Tom Sawyer”, “Mr. Sawyer”, “Thomas Sawyer” -> TOM_SAWYER)
  • Quotation speaker identification
  • Pronominal coreference resolution

I can think of several classes of documents where this would be useful. Congressional hearing documents for example. Agency reports would be another.

Not the final word for mapping but certainly an assist to an author.

How Arrogant Are You?

Wednesday, June 11th, 2014


From the webpage:

AnalyzeWords helps reveal your personality by looking at how you use words. It is based on good scientific research connecting word use to who people are. So go to town – enter your Twitter name or the handles of friends, lovers, or Hollywood celebrities to learn about their emotions, social styles, and the ways they think.

Even though “…based on good scientific research…” I would not take the results too seriously.

Any more that I would take advice from a book called: “I’m OK, Your’re OK.” (I know the first isn’t likely and the second is untrue.) 😉

Play with it over a couple of days and try to guess the relationships between words and your ratings.

I first saw this in a tweet by Alyona Medelyan.

Puck, a high-speed GPU-powered parser

Monday, June 9th, 2014

Puck, a high-speed GPU-powered parser by David Hall.

From the post:

I’m pleased to announce yet another new parser called Puck. Puck is a lightning-fast parser that uses Nvidia GPUs to do most of its computation. It is capable of parsing over 400 sentences per second, or about half a million words per minute. (Most CPU constituency parsers of its quality are on the order of 10 sentences per second.)

Puck is based on the same grammars used in the Berkeley Parser, and produces nearly identical trees. Puck is only available for English right now.

For more information about Puck, please see the project github page ( , or the accompanying paper (

Because of some its dependencies are not formally released yet (namely the wonderful JavaCL library), I can’t push artifacts to Maven Central. Instead I’ve uploaded a fat assembly jar here: (See the readme on github for how to use it.) It’s better used as a command line tool, anyway.

Even more motivation for learning to use GPUs!

I first saw this in a tweet by Jason Baldridge.

Erin McKean, founder, Reverb

Saturday, May 31st, 2014

10 Questions: Erin McKean, founder, Reverb by Chanelle Bessette.

From the introduction to the interview:

At OUP, McKean began to question how effective paper dictionaries were for the English language. Every word is a data point that has no meaning unless it is put in context, she believed, and a digital medium was the only way to link them together. If the printed dictionary is an atlas, she reasoned, the digital dictionary is a GPS device.

McKean’s idea was to create an online dictionary, dubbed Wordnik, that not only defined words but also showed how words related to each other, thereby increasing the chance that people can find the exact word that they are looking for. Today, the technology behind Wordnik is used to power McKean’s latest company, Reverb. Reverb’s namesake product is a news discovery mobile application that recommends stories based on contextual clues in the words of the article. (Even if that word is “lexicography.”)

Another case where i need a mobile phone to view a new technology. 🙁

I ran across DARLING, which promises it isn’t ready to emulate an IPhone on Ubuntu.

Do you know of another iPhone emulator for Ubuntu?


brat rapid annotation tool

Sunday, May 11th, 2014

brat rapid annotation tool

From the introduction:

brat is a web-based tool for text annotation; that is, for adding notes to existing text documents.

brat is designed in particular for structured annotation, where the notes are not free form text but have a fixed form that can be automatically processed and interpreted by a computer.

The examples page has examples of:

  • Entity mention detection
  • Event extraction
  • Coreference resolution
  • Normalization
  • Chunking
  • Dependency syntax
  • Meta-knowledge
  • Information extraction
  • Bottom-up Metaphor annotation
  • Visualization
  • Information Extraction system evaluation

I haven’t installed the local version but it is on my to-do list.

I first saw this in a tweet by Steven Bird.

Parsing English with 500 lines of Python

Monday, April 28th, 2014

Parsing English with 500 lines of Python by Matthew Honnibal.

From the post:

A syntactic parser describes a sentence’s grammatical structure, to help another application reason about it. Natural languages introduce many unexpected ambiguities, which our world-knowledge immediately filters out. A favourite example:

Definitely a post to savor if you have any interest in natural language processing.

I first saw this in a tweet by Jim Salmons.

A New Entity Salience Task with Millions of Training Examples

Monday, March 10th, 2014

A New Entity Salience Task with Millions of Training Examples by Dan Gillick and Jesse Dunietz.


Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

The article concludes:

We believe entity salience is an important task with many applications. To facilitate further research, our automatically generated salience annotations, along with resolved entity ids, for the subset of the NYT corpus discussed in this paper are available here:

A classic approach to a CS article: new approach/idea, data + experiments, plus results and code. It doesn’t get any better.

The results won’t be perfect, but the question is: Are they “acceptable results?”

Which presumes a working definition of “acceptable” that you have hammered out with your client.

I first saw this in a tweet by Stefano Bertolo.