## Archive for the ‘Natural Language Processing’ Category

### pynlp – Pythonic Wrapper for Stanford CoreNLP [& Rand Paul]

Tuesday, November 14th, 2017

The example text for this wrapper:

text = (
'GOP Sen. Rand Paul was assaulted in his home in Bowling Green,
Kentucky, on Friday, ''according to Kentucky State Police. State
troopers responded to a call to the senator\'s ''residence at 3:21
p.m. Friday. Police arrested a man named Rene Albert Boucher, who
they ''allege "intentionally assaulted" Paul, causing him "minor
injury. Boucher, 59, of Bowling ''Green was charged with one count of
fourth-degree assault. As of Saturday afternoon, he ''was being held
in the Warren County Regional Jail on a \$5,000 bond.')


[Warning: Reformatted for readability. See the Github page for the text]

Nice to see examples using contemporary texts. Any of the recent sexual abuse apologies or non-apologies would work as well.

Enjoy!

### CMU Neural Networks for NLP 2017 (16 Lectures)

Saturday, October 28th, 2017

Neural networks provide powerful new tools for modeling language, and have been used both to improve the state-of-the-art in a number of tasks and to tackle new problems that were not easy in the past. This class will start with a brief overview of neural networks, then spend the majority of the class demonstrating how to apply neural networks to natural language problems. Each section will introduce a particular problem or phenomenon in natural language, describe why it is difficult to model, and demonstrate several models that were designed to tackle this problem. In the process of doing so, the class will cover different techniques that are useful in creating neural network models, including handling variably sized and structured sentences, efficient handling of large data, semi-supervised and unsupervised learning, structured prediction, and multilingual modeling.

Suggested pre-requisite: 11-711 “Algorithms for NLP”.

I wasn’t able to find videos for the algorithms for NLP course but you can explore the following as supplemental materials:

Each of these courses can be found in two places: YouTube and Academic Torrents. The advantage of Academic Torrents is that you can also download the supplementary course materials, like transcripts, PDFs, or PPTs.

1. Natural Language Processing: Dan Jurafsky and Christopher Manning, Stanford University. YouTube | Academic Torrents
2. Natural Language Processing: Michael Collins, Columbia University. YouTube | Academic Torrents
3. Introduction to Natural Language Processing: Dragomir Radev, University of Michigan. YouTube | Academic Torrents

Enjoyable but not as suited to binge watching as Stranger Things. 😉

Enjoy!

### Building Data Science with JS – Lifting the Curtain on Game Reviews

Saturday, October 7th, 2017

Building Data Science with JS by Tim Ermilov.

Three videos thus far:

Building Data Science with JS – Part 1 – Introduction

Building Data Science with JS – Part 2 – Microservices

Building Data Science with JS – Part 3 – RabbitMQ and OpenCritic microservice

Tim starts with the observation that the percentage of users assigning a score to a game isn’t very helpful. It tells you nothing about the content of the game and/or the person rating it.

In subject identity terms, each level, mighty, strong, weak, fair, collapses information about the game and a particular reviewer into a single summary subject. OpenCritic then displays the percent of reviewers who are represented by that summary subject.

The problem with the summary subject is that one critic may have down rated the game for poor content, another for sexism and still another for bad graphics. But a user only knows for reasons unknown, a critic whose past behavior is unknown, evaluated unknown content and assigned it a rating.

A user could read all the reviews, study the history of each reviewer, along with the other movies they have evaluated, but Ermilov proposes a more efficient means to peak behind the curtain of game ratings. (part 1)

In part 2, Ermilov designs a microservice based application to extract, process and display game reviews.

If you thought the first two parts were slow, you should enjoy Part 3. 😉 Ermilov speeds through a number of resources, documents, JS libraries, not to mention his source code for the project. You are likely to hit pause during this video.

Some links you will find helpful for Part 3:

AMQP 0-9-1 library and client for Node.JS – Channel-oriented API reference

AMQP 0-9-1 library and client for Node.JS (Github)

https://github.com/BuildingXwithJS

https://github.com/BuildingXwithJS/building-data-science-with-js

Microwork – simple creation of distributed scalable microservices in node.js with RabbitMQ (simplifies use of AMQP)

node-unfluff – Automatically extract body content (and other cool stuff) from an html document

OpenCritic

RabbitMQ. (Recommends looking at the RabbitMQ tutorials.)

### NLP tools for East Asian languages

Thursday, September 28th, 2017

NLP tools for East Asian languages

CLARIN is building a list of NLP tools for East Asian languages.

Oh, sorry:

CLARIN – European Research Infrastructure for Language Resources and Technology

CLARIN makes digital language resources available to scholars, researchers, students and citizen-scientists from all disciplines, especially in the humanities and social sciences, through single sign-on access. CLARIN offers long-term solutions and technology services for deploying, connecting, analyzing and sustaining digital language data and tools. CLARIN supports scholars who want to engage in cutting edge data-driven research, contributing to a truly multilingual European Research Area.

CLARIN stands for “Common Language Resources and Technology Infrastructure”.

Contribute to the spreadsheet of NLP tools and enjoy the CLARIN website.

### Deep Learning for NLP Best Practices

Wednesday, July 26th, 2017

From the introduction:

This post is a collection of best practices for using neural networks in Natural Language Processing. It will be updated periodically as new insights become available and in order to keep track of our evolving understanding of Deep Learning for NLP.

There has been a running joke in the NLP community that an LSTM with attention will yield state-of-the-art performance on any task. While this has been true over the course of the last two years, the NLP community is slowly moving away from this now standard baseline and towards more interesting models.

However, we as a community do not want to spend the next two years independently (re-)discovering the next LSTM with attention. We do not want to reinvent tricks or methods that have already been shown to work. While many existing Deep Learning libraries already encode best practices for working with neural networks in general, such as initialization schemes, many other details, particularly task or domain-specific considerations, are left to the practitioner.

This post is not meant to keep track of the state-of-the-art, but rather to collect best practices that are relevant for a wide range of tasks. In other words, rather than describing one particular architecture, this post aims to collect the features that underly successful architectures. While many of these features will be most useful for pushing the state-of-the-art, I hope that wider knowledge of them will lead to stronger evaluations, more meaningful comparison to baselines, and inspiration by shaping our intuition of what works.

I assume you are familiar with neural networks as applied to NLP (if not, I recommend Yoav Goldberg’s excellent primer [43]) and are interested in NLP in general or in a particular task. The main goal of this article is to get you up to speed with the relevant best practices so you can make meaningful contributions as soon as possible.

I will first give an overview of best practices that are relevant for most tasks. I will then outline practices that are relevant for the most common tasks, in particular classification, sequence labelling, natural language generation, and neural machine translation.

Certainly a resource to bookmark while you read A Primer on Neural Network Models for Natural Language Processing by Yoav Goldberg, at 76 pages and to consult frequently as you move beyond the primer stage.

Enjoy and pass it on!

### The Classical Language Toolkit

Tuesday, July 11th, 2017

The Classical Language Toolkit

From the webpage:

The Classical Language Toolkit (CLTK) offers natural language processing (NLP) support for the languages of Ancient, Classical, and Medieval Eurasia. Greek and Latin functionality are currently most complete.

Goals

• compile analysis-friendly corpora;
• collect and generate linguistic data;
• act as a free and open platform for generating scientific research.

You are sure to find one or more languages of interest:

Collecting, analyzing and mapping Tweets can be profitable and entertaining, but tomorrow or perhaps by next week, almost no one will read them again.

The texts in this project survived by hand preservation for thousands of years. People are still reading them.

### Dive Into NLTK – Update – No NLTK Book 2nd Edition

Wednesday, April 19th, 2017

Dive Into NLTK, Part I: Getting Started with NLTK

From the webpage:

NLTK is the most famous Python Natural Language Processing Toolkit, here I will give a detail tutorial about NLTK. This is the first article in a series where I will write everything about NLTK with Python, especially about text mining and text analysis online.

This is the first article in the series “Dive Into NLTK”, here is an index of all the articles in the series that have been published to date:

My first post on this series, had only the first seven lessons listed.

There’s another reason for this update.

It appears that no second edition of Natural Language Processing with Python is likely to appear.

Sounds like an opportunity for the NLTK community to continue the work already started.

I don’t have the chops to contribute high quality code but would be willing to work with others on proofing/editing (that’s the part of book production readers rarely see).

### Stanford CoreNLP – a suite of core NLP tools (3.7.0)

Thursday, January 12th, 2017

Stanford CoreNLP – a suite of core NLP tools

The beta is over and Stanford CoreNLP 3.7.0 is on the street!

From the webpage:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get quotes people said, etc.

Choose Stanford CoreNLP if you need:

• An integrated toolkit with a good range of grammatical analysis tools
• Fast, reliable analysis of arbitrary texts
• The overall highest quality text analytics
• Support for a number of major (human) languages
• Available interfaces for most major modern programming languages
• Ability to run as a simple web service

Stanford CoreNLP’s goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A tool pipeline can be run on a piece of plain text with just two lines of code. CoreNLP is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Moreover, an annotator pipeline can include additional custom or third-party annotators. CoreNLP’s analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

What stream of noise, sorry, news are you going to pipeling into the Stanford CoreNLP framework?

😉

Imagine a web service that offers levels of analysis alongside news text.

Or does the same with leaked emails and/or documents?

### Ulysses, Joyce and Stanford CoreNLP

Saturday, November 26th, 2016

Introduction to memory and time usage

From the webpage:

People not infrequently complain that Stanford CoreNLP is slow or takes a ton of memory. In some configurations this is true. In other configurations, this is not true. This section tries to help you understand what you can or can’t do about speed and memory usage. The advice applies regardless of whether you are running CoreNLP from the command-line, from the Java API, from the web service, or from other languages. We show command-line examples here, but the principles are true of all ways of invoking CoreNLP. You will just need to pass in the appropriate properties in different ways. For these examples we will work with chapter 13 of Ulysses by James Joyce. You can download it if you want to follow along.

You have to appreciate the use of a non-trivial text for advice on speed and memory usage of CoreNLP.

How does your text stack up against Chapter 13 of Ulysses?

I’m supposed to be reading Ulysses long distance with a friend. I’m afraid we have both fallen behind. Perhaps this will encourage me to have another go at it.

What favorite or “should read” text would you use to practice with CoreNLP?

Suggestions?

### Stanford CoreNLP v3.7.0 beta is out! [Time is short, comments, bug reports, now!]

Thursday, November 3rd, 2016

Stanford CoreNLP v3.7.0 beta

The tweets I saw from Stanford NLP Group read:

Stanford CoreNLP v3.7.0 beta is out—improved coreference, dep parsing—KBP relation annotator—Arabic pipeline #NLProc

We‘re doing an official CoreNLP beta release this time, so bugs, comments, and fixes especially appreciated over the next two weeks!

OK, so, what are you waiting for? 😉

Oh, the standard blurb for your boss on why Stanford CoreNLP should be taking up your time:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

• An integrated toolkit with a good range of grammatical analysis tools
• Fast, reliable analysis of arbitrary texts
• The overall highest quality text analytics
• Support for a number of major (human) languages
• Interfaces available for various major modern programming languages
• Ability to run as a simple web service

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Using the standard blurb about the Stanford CoreNLP has these advantages:

• It’s copy-n-paste, you didn’t have to write it
• It’s appeal to authority (Stanford)
• It’s truthful

The truthful point is a throw-away these days but thought I should mention it. 😉

### Mentioning Nazis or Hitler

Thursday, May 5th, 2016

78% of Reddit Threads With 1,000+ Comments Mention Nazis

From the post:

Let me start this post by noting that I will not attempt to test Godwin’s Law, which states that:

As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches 1.

In this post, I’ll only try to find out how many Reddit comments mention Nazis or Hitler and ignore the context in which they are made. The data source for this analysis is the Reddit dataset which is publicly available on Google BigQuery. The following graph is based on 4.6 million comments and shows the share of comments mentioning Nazis or Hitler by subreddit.

Left for a later post:

The next step would be to implement sophisticated text mining techniques to identify comments which use Nazi analogies in a way as described by Godwin. Unfortunately due to time constraints and the complexity of this problem, I was not able to try for this blog post.

Since Godwin’s law applies to inappropriate invocations of Nazis or Hitler, that implies there are legitimate uses of those terms.

What captures my curiosity is what characteristics must a subject have to be a legitimate comparison to Nazis and/or Hitler?

Or more broadly, what characteristics must a subject have to be classified as a genocidal ideology or a person who advocates genocide?

Thinking it isn’t Nazism (historically speaking) that needs to be avoided but the more general impulse that leads to genocidal rhetoric and policies.

### WordsEye [Subject Identity Properties]

Tuesday, March 29th, 2016

WordsEye

A site that enables you to “type a picture.” What? To illustrate:

A [mod] ox is a couple of feet in front of the [hay] wall. It is cloudy. The ground is shiny grass. The huge hamburger is on the ox. An enormous gold chicken is behind the wall…

Results in:

The site is in a close beta test but you can apply for an account.

I mention “subject identity properties” in the title because the words we use to identify subjects, are properties of subjects, just like any other properties we attribute to them.

Unfortunately, words are viewed by different people as identifying different subjects and the different words as identifying the same subjects.

The WordsEye technology can illustrates the fragility of using a single word to identify a subject of conversation.

Or that multiple identifications have the same subject, with side by side images that converge on a common image.

Imagine that in conjunction with 3-D molecular images for example.

I first saw this in a tweet by Alyona Medelyan.

### Patent Sickness Spreads [Open Source Projects on Prior Art?]

Tuesday, March 8th, 2016

James Cook reports a new occurrence of patent sickness in Facebook has an idea for software that detects cool new slang before it goes mainstream.

The most helpful part of James’ post is the graphic outline of the “process” patented by Facebook:

I sure do hope James has not patented that presentation because it make the Facebook patent, err, clear.

Quick show of hands on originality?

While researching this post, I ran across Open Source as Prior Art at the Linux Foundation. Are there other public projects that research and post prior art with regard to particular patents?

An armory of weapons for opposing ill-advised patents.

The Facebook patent is: 9,280,534 Hauser, et al. March 8, 2016, Generating a social glossary:

Its abstract:

Particular embodiments determine that a textual term is not associated with a known meaning. The textual term may be related to one or more users of the social-networking system. A determination is made as to whether the textual term should be added to a glossary. If so, then the textual term is added to the glossary. Information related to one or more textual terms in the glossary is provided to enhance auto-correction, provide predictive text input suggestions, or augment social graph data. Particular embodiments discover new textual terms by mining information, wherein the information was received from one or more users of the social-networking system, was generated for one or more users of the social-networking system, is marked as being associated with one or more users of the social-networking system, or includes an identifier for each of one or more users of the social-networking system. (emphasis in original)

### Automating Family/Party Feud

Monday, February 15th, 2016

Semantic Analysis of the Reddit Hivemind

From the webpage:

Our neural network read every comment posted to Reddit in 2015, and built a semantic map using word2vec and spaCy.

Try searching for a phrase that’s more than the sum of its parts to see what the model thinks it means. Try your favourite band, slang words, technical things, or something totally random.

Lynn Cherny suggested in a tweet to use “actually.”

If you are interested in the background on this tool, see: Sense2vec with spaCy and Gensim by Matthew Honnibal.

From the post:

If you were doing text analytics in 2015, you were probably using word2vec. Sense2vec (Trask et al., 2015) is a new twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. This post motivates the idea, explains our implementation, and comes with an interactive demo that we’ve found surprisingly addictive.

Polysemy: the problem with word2vec

When humans write dictionaries and thesauruses, we define concepts in relation to other concepts. For automatic natural language processing, it’s often more effective to use dictionaries that define concepts in terms of their usage statistics. The word2vec family of models are the most popular way of creating these dictionaries. Given a large sample of text, word2vec gives you a dictionary where each definition is just a row of, say, 300 floating-point numbers. To find out whether two entries in the dictionary are similar, you ask how similar their definitions are – a well-defined mathematical operation.

Certain to be a hit at technical conferences and parties.

SGML wasn’t mentioned even once during 2015 in Reddit Comments.

Try some your favorites words and phrases.

Enjoy!

### Toneapi helps your writing pack an emotional punch [Not For The Ethically Sensitive]

Thursday, February 4th, 2016

From the post:

Language analysis is a rapidly developing field and there are some interesting startups working on products that help you write better.

Take Toneapi, for example. This product from Northern Irish firm Adoreboard is a Web-based app that analyzes (and potentially improves) the emotional impact of your writing.

Paste in some text, and it will offer a detailed visualization of your writing.

If you aren’t overly concerned about manipulating, sorry, persuading your readers to your point of view, you might want to give Toneapi a spin. Martin reports that IBM’s Watson has Tone Analyzer and you should also consider Textio and Relative Insight.

Before this casts an Orwellian pale over your evening/day, remember that focus groups and testing messages have been the staple of advertising for decades.

What these software services do is make a crude form of that capability available to the average citizen.

Some people have a knack for emotional language, like Donald Trump, but I can’t force myself to write in incomplete sentences or with one syllable words. Maybe there’s an app for that? Suggestions?

### Stanford NLP Blog – First Post

Monday, January 25th, 2016

Sam Bowman posted The Stanford NLI Corpus Revisited today at the Stanford NLP blog.

From the post:

Last September at EMNLP 2015, we released the Stanford Natural Language Inference (SNLI) Corpus. We’re still excitedly working to build bigger and better machine learning models to use it to its full potential, and we sense that we’re not alone, so we’re using the launch of the lab’s new website to share a bit of what we’ve learned about the corpus over the last few months.

What is SNLI?

SNLI is a collection of about half a million natural language inference (NLI) problems. Each problem is a pair of sentences, a premise and a hypothesis, labeled (by hand) with one of three labels: entailment, contradiction, or neutral. An NLI model is a model that attempts to infer the correct label based on the two sentences.

A high level overview of the SNLI corpus.

The news of Marvin Minsky‘s death, today, much have arrived too late for inclusion in the post.

### There’s More Than One Kind Of Reddit Comment?

Friday, December 18th, 2015

‘Sarcasm detection on Reddit comments’

Contest ends: 15th of February, 2016.

From the webpage:

Sentiment analysis is a fairly well-developed field, but on the Internet, people often don’t say exactly what they mean. One of the toughest modes of communication for both people and machines to identify is sarcasm. Sarcastic statements often sound positive if interpreted literally, but through context and other cues the speaker indicates that they mean the opposite of what they say. In English, sarcasm is primarily communicated through verbal cues, meaning that it is difficult, even for native speakers, to determine it in text.

Sarcasm detection is a subtask of opinion mining. It aims at correctly identifying the user opinions expressed in the written text. Sarcasm detection plays a critical role in sentiment analysis by correctly identifying sarcastic sentences which can incorrectly flip the polarity of the sentence otherwise. Understanding sarcasm, which is often a difficult task even for humans, is a challenging task for machines. Common approaches for sarcasm detection are based on machine learning classifiers trained on simple lexical or dictionary based features. To date, some research in sarcasm detection has been done on collections of tweets from Twitter, and reviews on Amazon.com. For this task, we are interested in looking at a more conversational medium—comments on Reddit—in order to develop an algorithm that can use the context of the surrounding text to help determine whether a specific comment is sarcastic or not.

The premise of this competition is there is more than one kind of comment on Reddit, aside from sarcasm.

A surprising assumption I know but there you have it.

I wonder if participants will have to separate sarcastic + sexist, sarcastic + misogynistic, sarcastic + racist, sarcastic + abusive, into separate categories or will all sarcastic comments be classified as sarcasm?

I suppose the default case would be to assume all Reddit comments are some form of sarcasm and see how accurate that model proves to be when judged against the results of the competition.

Training data for sarcasm? Pointers anyone?

Thursday, October 29th, 2015

From the post:

About 1.8 million new scientific papers are published each year, and most are of little consequence to the general public — or even read, really; one study estimates that up to half of all academic studies are only read by their authors, editors, and peer reviewers.

But the papers that are read can change our understanding of the universe — traces of water on Mars! — or impact our lives here on earth — sea levels rising! — and when journalists get called upon to cover these stories, they’re often thrown into complex topics without much background or understanding of the research that led to the breakthrough.

As a result, a group of researchers at Columbia and Stanford are in the process of developing Science Surveyor, a tool that algorithmically helps journalists get important context when reporting on scientific papers.

“The idea occurred to me that you could characterize the wealth of scientific literature around the topic of a new paper, and if you could do that in a way that showed the patterns in funding, or the temporal patterns of publishing in that field, or whether this new finding fit with the overall consensus with the field — or even if you could just generate images that show images very rapidly what the huge wealth, in millions of articles, in that field have shown — [journalists] could very quickly ask much better questions on deadline, and would be freed to see things in a larger context,” Columbia journalism professor Marguerite Holloway, who is leading the Science Surveyor effort, told me.

Science Surveyor is still being developed, but broadly the idea is that the tool takes the text of an academic paper and searches academic databases for other studies using similar terms. The algorithm will surface relevant articles and show how scientific thinking has changed through its use of language.

For example, look at the evolving research around neurogenesis, or the growth of new brain cells. Neurogenesis occurs primarily while babies are still in the womb, but it continues through adulthood in certain sections of the brain.

Up until a few decades ago, researchers generally thought that neurogenesis didn’t occur in humans — you had a set number of brain cells, and that’s it. But since then, research has shown that neurogenesis does in fact occur in humans.

“This tells you — aha! — this discovery is not an entirely new discovery,” Columbia professor Dennis Tenen, one of the researchers behind Science Surveyor, told me. “There was a period of activity in the ’70s, and now there is a second period of activity today. We hope to produce this interactive visualization, where given a paper on neurogenesis, you can kind of see other related papers on neurogenesis to give you the context for the story you’re telling.”

Given the number of papers published every year, an algorithmic approach like Science Surveyor is an absolute necessity.

But imagine how much richer the results would be if one of the three or four people who actually read the paper could easily link it to other research and context?

Or perhaps being a researcher who discovers the article and then blazes a trail to non-obvious literature that is also relevant?

Search engines now capture what choices users make in the links they follow but that’s a fairly crude approximation of relevancy of a particular resource. Such as not specifying why a particular resource is relevant.

Usage of literature should decide which articles merit greater attention from machine or human annotators. The last amount of humanities literature is never cited by anyone. Why spend resources annotating content that no one is likely to read?

### NLP and Scala Resources

Wednesday, October 7th, 2015

Natural Language Processing and Scala Tutorials by Jason Baldridge.

An impressive collection of resources but in particular, the seventeen (17) Scala tutorials.

Unfortunately, given the state of search and indexing it isn’t possible to easily dedupe the content of these materials against others you may have already found.

### Corpus of American Tract Society Publications

Friday, September 11th, 2015

Corpus of American Tract Society Publications by Lincoln Mullen.

From the post:

I’ve created a small to mid-sized corpus of publications by the American Tract Society up to the year 1900 in plain text. This corpus has been gathered from the Internet Archive. It includes 641 documents containing just under sixty million words, along with a CSV file containing metadata for each of the files. I don’t make any claims that this includes all of the ATS publications from that time period, and it is pretty obvious that the metadata from the Internet Archive is not much good. The titles are mostly correct; the dates are pretty far off in cases.

This corpus was created for the purpose of testing document similarity and text reuse algorithms. I need a corpus for testing the textreuse, which is in very early stages of development. From reading many, many of these tracts, I already know the patterns of text reuse. (And of course, the documents are historically interesting in their own right, and might be a good candidate for text mining.) The ATS frequently republished tracts under the same title. Furthermore, they published volumes containing the entire series of tracts that they had published individually. So there are examples of entire documents which are reprinted, but also some documents which are reprinted inside others. Then as a extra wrinkle, the corpus contains the editions of the Bible published by the ATS, plus their edition of Cruden’s concordance and a Bible dictionary. Likely all of the tracts quote the Bible, some at great length, so there are many examples of borrowing there.

Here is the corpus and its repository:

With the described repetition, the corpus must compress well. 😉

Makes me wonder how much near-repetition occurs in CS papers?

Graph papers than repeat graph fundamentals, in nearly the same order, in paper after paper.

At what level would you measure re-use? Sentence? Paragraph? Larger divisions?

### spaCy: Industrial-strength NLP

Wednesday, June 10th, 2015

spaCy: Industrial-strength NLP by Matthew Honnibal.

From the post:

spaCy is a new library for text processing in Python and Cython. I wrote it because I think small companies are terrible at natural language processing (NLP). Or rather: small companies are using terrible NLP technology.

To do great NLP, you have to know a little about linguistics, a lot about machine learning, and almost everything about the latest research. The people who fit this description seldom join small companies. Most are broke — they’ve just finished grad school. If they don’t want to stay in academia, they join Google, IBM, etc.

The net result is that outside of the tech giants, commercial NLP has changed little in the last ten years. In academia, it’s changed entirely. Amazing improvements in quality. Orders of magnitude faster. But the academic code is always GPL, undocumented, unuseable, or all three. You could implement the ideas yourself, but the papers are hard to read, and training data is exorbitantly expensive. So what are you left with? A common answer is NLTK, which was written primarily as an educational resource. Nothing past the tokenizer is suitable for production use.

I used to think that the NLP community just needed to do more to communicate its findings to software engineers. So I wrote two blog posts, explaining how to write a part-of-speech tagger and parser. Both were well received, and there’s been a bit of interest in my research software — even though it’s entirely undocumented, and mostly unuseable to anyone but me.

So six months ago I quit my post-doc, and I’ve been working day and night on spaCy since. I’m now pleased to announce an alpha release.

If you’re a small company doing NLP, I think spaCy will seem like a minor miracle. It’s by far the fastest NLP software ever released. The full processing pipeline completes in 7ms per document, including accurate tagging and parsing. All strings are mapped to integer IDs, tokens are linked to embedded word representations, and a range of useful features are pre-calculated and cached.

Matthew uses an example based on Stephen King’s admonition “the adverb is not your friend“, which immediately brought to mind the utility of tagging all adverbs and adjectives in a standards draft and then generating comments that identify its parent <p> element and the offending phrase.

I haven’t verified the performance comparisons, but as you know, the real question is how well spaCy works on your data, work flow, etc.?

Thanks to Matthew for the reminder of: On writing : a memoir of the craft by Stephen King. Documentation will never be as gripping as a King novel, but it shouldn’t be painful to read.

I first saw this in a tweet by Jason Baldridge.

### NLP4L

Saturday, May 30th, 2015

NLP4L

From the webpage:

NLP4L is a natural language processing tool for Apache Lucene written in Scala. The main purpose of NLP4L is to use the NLP technology to improve Lucene users’ search experience. Lucene/Solr, for example, already provides its users with auto-complete and suggestion functions for search keywords. Using NLP technology, NLP4L development members may be able to present better keywords. In addition, NLP4L provides functions to collaborate with existing machine learning tools, including one to directly create document vector from a Lucene index and write it to a LIBSVM format file.

As NLP4L processes document data registered in the Lucene index, you can directly access a word database normalized by powerful Lucene Analyzer and use handy search functions. Being written in Scala, NLP4L excels at trying ad hoc interactive processing as well.

The documentation is currently in Japanese with a TOC for the English version. Could be interesting if you want to try your hand either at translation and/or working from the API Docs.

Enjoy!

### Political Futures Tracker

Wednesday, May 20th, 2015

From the webpage:

The Political Futures Tracker tells us the top political themes, how positive or negative people feel about them, and how far parties and politicians are looking to the future.

This software will use ground breaking language analysis methods to examine data from Twitter, party websites and speeches. We will also be conducting live analysis on the TV debates running over the next month, seeing how the public respond to what politicians are saying in real time. Leading up to the 2015 UK General Election we will be looking across the political spectrum for emerging trends and innovation insights.

If that sounds interesting, consider the following from: Introducing… the Political Futures Tracker:

We are exploring new ways to analyse a large amount of data from various sources. It is expected that both the amount of data and the speed that it is produced will increase dramatically the closer we get to election date. Using a semi-automatic approach, text analytics technology will sift through content and extract the relevant information. This will then be examined and analysed by the team at Nesta to enable delivery of key insights into hotly debated issues and the polarisation of political opinion around them.

The team at the University of Sheffield has extensive experience in the area of social media analytics and Natural Language Processing (NLP). Technical implementation has started already, firstly with data collection which includes following the Twitter accounts of existing MPs and political parties. Once party candidate lists become available, data harvesting will be expanded accordingly.

In parallel, we are customising the University of Sheffield’s General Architecture for Text Engineering (GATE); an open source text analytics tool, in order to identify sentiment-bearing and future thinking tweets, as well as key target topics within these.

One thing we’re particularly interested in is future thinking. We describe this as making statements concerning events or issues in the future. Given these measures and the views expressed by a certain person, we can model how forward thinking that person is in general, and on particular issues, also comparing this with other people. Sentiment, topics, and opinions will then be aggregated and tracked over time.

Personally I suspect that “future thinking” is used in difference senses by the general population and political candidates. For a political candidate, however the rhetoric is worded, the “future” consists of reaching election day with 50% plus 1 vote. For the general population, the “future” probably includes a longer time span.

I mention this in case you can sell someone on the notion that what political candidates say today has some relevance to what they will do after election. President Obmana has been in office for six (6) years on office, the Guantanamo Bay detention camp remains open, no one has been held accountable for years of illegal spying on U.S. citizens, banks and other corporate interests have all but been granted keys to the U.S. Treasury, to name a few items inconsistent with his previous “future thinking.”

Unless you accept my suggestion that “future thinking” for a politician means election day and no further.

### Analysis of named entity recognition and linking for tweets

Wednesday, May 20th, 2015

Abstract:

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

The questions addressed by the paper are:

RQ1 How robust are state-of-the-art named entity recognition and linking methods on short and noisy microblog texts?

RQ2 What problem areas are there in recognising named entities in microblog posts, and what are the major causes of false negatives and false positives?

RQ3 Which problems need to be solved in order to further the state-of-the-art in NER and NEL on this difficult text genre?

The ultimate conclusion is that entity recognition in microblog posts falls short of what has been achieved for newswire text but if you need results now or at least by tomorrow, this is a good guide to what is possible and where improvements can be made.

### Detecting Deception Strategies [Godsend for the 2016 Election Cycle]

Wednesday, May 20th, 2015

Discriminative Models for Predicting Deception Strategies by Scott Appling, Erica Briscoe, C.J. Hutto.

Abstract:

Although a large body of work has previously investigated various cues predicting deceptive communications, especially as demonstrated through written and spoken language (e.g., [30]), little has been done to explore predicting kinds of deception. We present novel work to evaluate the use of textual cues to discriminate between deception strategies (such as exaggeration or falsifi cation), concentrating on intentionally untruthful statements meant to persuade in a social media context. We conduct human subjects experimentation wherein subjects were engaged in a conversational task and then asked to label the kind(s) of deception they employed for each deceptive statement made. We then develop discriminative models to understand the difficulty between choosing between one and several strategies. We evaluate the models using precision and recall for strategy prediction among 4 deception strategies based on the most relevant psycholinguistic, structural, and data-driven cues. Our single strategy model results demonstrate as much as a 58% increase over baseline (random chance) accuracy and we also find that it is more difficult to predict certain kinds of deception than others.

The deception strategies studied in this paper:

• Falsification
• Exaggeration
• Omission

especially omission, will form the bulk of the content in the 2016 election cycle in the United States. Only deceptive statements were included in the test data, so the models were tested on correctly recognizing the deception strategy in a known deceptive statement.

The test data is remarkably similar to political content, which aside from their names and names of their opponents (mostly), is composed entirely of deceptive statements, albeit not marked for the strategy used in each one.

A web interface for loading pointers to video, audio or text with political content that emits tagged deception with pointers to additional information would be a real hit for the next U.S. election cycle. Monetize with ads, the sources of additional information, etc.

I first saw this in a tweet by Leon Derczynski.

### New Natural Language Processing and NLTK Videos

Saturday, May 2nd, 2015

From part 1:

Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language.

The NLTK module comes packed full of everything from trained algorithms to identify parts of speech to unsupervised machine learning algorithms to help you train your own machine to understand a specific bit of text.

NLTK also comes with a large corpora of data sets containing things like chat logs, movie reviews, journals, and much more!

Bottom line, if you’re going to be doing natural language processing, you should definitely look into NLTK!

Use the Playlist link: https://www.youtube.com/watch?v=FLZvO… link as I am sure more videos will be appearing in the near future.

Enjoy!

### Practical Text Analysis using Deep Learning

Friday, May 1st, 2015

From the post:

Deep Learning has become a household buzzword these days, and I have not stopped hearing about it. In the beginning, I thought it was another rebranding of Neural Network algorithms or a fad that will fade away in a year. But then I read Piotr Teterwak’s blog post on how Deep Learning can be easily utilized for various image analysis tasks. A powerful algorithm that is easy to use? Sounds intriguing. So I decided to give it a closer look. Maybe it will be a new hammer in my toolbox that can later assist me to tackle new sets of interesting problems.

After getting up to speed on Deep Learning (see my recommended reading list at the end of this post), I decided to try Deep Learning on NLP problems. Several years ago, Professor Moshe Koppel gave a talk about how he and his colleagues succeeded in determining an author’s gender by analyzing his or her written texts. They also released a dataset containing 681,288 blog posts. I found it remarkable that one can infer various attributes about an author by analyzing the text, and I’ve been wanting to try it myself. Deep Learning sounded very versatile. So I decided to use it to infer a blogger’s personal attributes, such as age and gender, based on the blog posts.

If you haven’t gotten into deep learning, here’s another opportunity focused on natural language processing. You can follow Michael’s general directions to learn on your own or follow more detailed instructions in his Ipython notebook.

Enjoy!

### Modern Methods for Sentiment Analysis

Monday, April 13th, 2015

From the post:

Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.

The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”….

Great discussion of Word2Vec and Doc2Vec, along with worked examples of both as well as analyzing sentiment in Emoji tweets.

Another limitation of the +1 / -1 approach is that human sentiments are rarely that sharply defined. Moreover, however strong or weak the “likes” or “dislikes” of a group of users, they are all collapsed into one score.

Be mindful that modeling is a lossy process.

### Deep Learning for Natural Language Processing (March – June, 2015)

Saturday, March 7th, 2015

Description:

Natural language processing (NLP) is one of the most important technologies of the information age. Understanding complex language utterances is also a crucial part of artificial intelligence. Applications of NLP are everywhere because people communicate most everything in language: web search, advertisement, emails, customer service, language translation, radiology reports, etc. There are a large variety of underlying tasks and machine learning models powering NLP applications. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. These models can often be trained with a single end-to-end model and do not require traditional, task-specific feature engineering. In this spring quarter course students will learn to implement, train, debug, visualize and invent their own neural network models. The course provides a deep excursion into cutting-edge research in deep learning applied to NLP. The final project will involve training a complex recurrent neural network and applying it to a large scale NLP problem. On the model side we will cover word vector representations, window-based neural networks, recurrent neural networks, long-short-term-memory models, recursive neural networks, convolutional neural networks as well as some very novel models involving a memory component. Through lectures and programming assignments students will learn the necessary engineering tricks for making neural networks work on practical problems.

Assignments, course notes and slides will all be posted online. You are free to “follow along” but no credit.

Are you ready for the cutting-edge?

I first saw this in a tweet by Randall Olson.

### Understanding Natural Language with Deep Neural Networks Using Torch

Tuesday, March 3rd, 2015

Understanding Natural Language with Deep Neural Networks Using Torch by Soumith Chintala and Wojciech Zaremba.

This is a deeply impressive article and a good introduction to Torch (scientific computing package with neural network, optimization, etc.)

In the preliminary materials, the authors illustrate one of the difficulties of natural language processing by machine:

For a machine to understand language, it first has to develop a mental map of words, their meanings and interactions with other words. It needs to build a dictionary of words, and understand where they stand semantically and contextually, compared to other words in their dictionary. To achieve this, each word is mapped to a set of numbers in a high-dimensional space, which are called “word embeddings”. Similar words are close to each other in this number space, and dissimilar words are far apart. Some word embeddings encode mathematical properties such as addition and subtraction (For some examples, see Table 1).

Word embeddings can either be learned in a general-purpose fashion before-hand by reading large amounts of text (like Wikipedia), or specially learned for a particular task (like sentiment analysis). We go into a little more detail on learning word embeddings in a later section.

You can already see the problem but just to call it out, the language usage in Wikipedia, for example, may or may not match the domain of interest. You could certainly use it as a general case but it will produce very odd results when the text to be “understood” in a regional version of a language where common words have meanings other than you will find in Wikipedia.

Slang is a good example. In the 17th century for example, “cab” was a term used for a brothel. To take a “hit” has a different meaning than being struck by a boxer, would be a more recent example.

“Understanding” natural language with machines is a great leap forward but one should never leap without looking.