Natural Language Processing « Another Word For It

October 5, 2019

January 2, 2019

Constructing Stoplists for Historical Languages [Hackers?]

Filed under: Classics,Cybersecurity,Hacking,Natural Language Processing — Patrick Durusau @ 9:50 am

Constructing Stoplists for Historical Languages by Patrick J. Burns.

Abstract

Stoplists are lists of words that have been filtered from documents prior to text analysis tasks, usually words that are either high frequency or that have low semantic value. This paper describes the development of a generalizable method for building stoplists in the Classical Language Toolkit (CLTK), an open-source Python platform for natural language processing research on historical languages. Stoplists are not readily available for many historical languages, and those that are available often offer little documentation about their sources or method of construction. The development of a generalizable method for building historical-language stoplists offers the following benefits: 1. better support for well-documented, data-driven, and replicable results in the use of CLTK resources; 2. reduction of arbitrary decision-making in building stoplists; 3. increased consistency in how stopwords are extracted from documents across multiple languages; and 4. clearer guidelines and standards for CLTK developers and contributors, a helpful step forward in managing the complexity of a multi-language open-source project.

I post this in part to spread the word about these stoplists for humanists.

At the same time, I’m curious about the use of stoplists by hackers to filter cruft from disassembled files. Disassembled files are “texts” of a sort and it seems to me that many of the tools used by humanists could, emphasis on could, be relevant.

Suggestions/pointers?

Comments Off

May 6, 2018

Natural Language Toolkit (NLTK) 3.3 Drops!

Filed under: Linguistics,Natural Language Processing,NLTK,Python — Patrick Durusau @ 7:52 pm

Natural Language Toolkit (NLTK) 3.3 has arrived!

From NLTK News:

NLTK 3.3 release: May 2018

Support Python 3.6, New interface to CoreNLP, Support synset retrieval by sense key, Minor fixes to CoNLL Corpus Reader, AlignedSent, Fixed minor inconsistencies in APIs and API documentation, Better conformance to PEP8, Drop Moses Tokenizer (incompatible license)

Whether you have fantasies about propaganda turning voters into robots, believe “persuasion” is a matter of “facts,” or other pre-Derrida illusions, or not, the NLTK is a must have weapon in such debates.

Enjoy!

Comments Off

December 24, 2017

Deep Learning for NLP, advancements and trends in 2017

Filed under: Artificial Intelligence,Deep Learning,Natural Language Processing — Patrick Durusau @ 5:57 pm

Deep Learning for NLP, advancements and trends in 2017 by Javier Couto.

If you didn’t get enough books as presents, Couto solves your reading shortage rather nicely:

Over the past few years, Deep Learning (DL) architectures and algorithms have made impressive advances in fields such as image recognition and speech processing.

Their application to Natural Language Processing (NLP) was less impressive at first, but has now proven to make significant contributions, yielding state-of-the-art results for some common NLP tasks. Named entity recognition (NER), part of speech (POS) tagging or sentiment analysis are some of the problems where neural network models have outperformed traditional approaches. The progress in machine translation is perhaps the most remarkable among all.

In this article I will go through some advancements for NLP in 2017 that rely on DL techniques. I do not pretend to be exhaustive: it would simply be impossible given the vast amount of scientific papers, frameworks and tools available. I just want to share with you some of the works that I liked the most this year. I think 2017 has been a great year for our field. The use of DL in NLP keeps widening, yielding amazing results in some cases, and all signs point to the fact that this trend will not stop.
…

After skimming this post, I suggest you make a fresh pot of coffee before starting to read and chase the references. It will take several days/pots to finish so it’s best to begin now.

Comments Off

November 14, 2017

pynlp – Pythonic Wrapper for Stanford CoreNLP [& Rand Paul]

Filed under: Natural Language Processing,Python,Stanford NLP — Patrick Durusau @ 4:36 pm

pynlp – Pythonic Wrapper for Stanford CoreNLP by Sina.

The example text for this wrapper:

text = (
'GOP Sen. Rand Paul was assaulted in his home in Bowling Green, 
Kentucky, on Friday, ''according to Kentucky State Police. State 
troopers responded to a call to the senator\'s ''residence at 3:21 
p.m. Friday. Police arrested a man named Rene Albert Boucher, who 
they ''allege "intentionally assaulted" Paul, causing him "minor 
injury. Boucher, 59, of Bowling ''Green was charged with one count of 
fourth-degree assault. As of Saturday afternoon, he ''was being held 
in the Warren County Regional Jail on a $5,000 bond.')

[Warning: Reformatted for readability. See the Github page for the text]

Nice to see examples using contemporary texts. Any of the recent sexual abuse apologies or non-apologies would work as well.

Enjoy!

Comments Off

October 28, 2017

CMU Neural Networks for NLP 2017 (16 Lectures)

Filed under: Natural Language Processing,Neural Networks — Patrick Durusau @ 8:31 pm

Course Description:

Neural networks provide powerful new tools for modeling language, and have been used both to improve the state-of-the-art in a number of tasks and to tackle new problems that were not easy in the past. This class will start with a brief overview of neural networks, then spend the majority of the class demonstrating how to apply neural networks to natural language problems. Each section will introduce a particular problem or phenomenon in natural language, describe why it is difficult to model, and demonstrate several models that were designed to tackle this problem. In the process of doing so, the class will cover different techniques that are useful in creating neural network models, including handling variably sized and structured sentences, efficient handling of large data, semi-supervised and unsupervised learning, structured prediction, and multilingual modeling.

Suggested pre-requisite: 11-711 “Algorithms for NLP”.

I wasn’t able to find videos for the algorithms for NLP course but you can explore the following as supplemental materials:

…
Each of these courses can be found in two places: YouTube and Academic Torrents. The advantage of Academic Torrents is that you can also download the supplementary course materials, like transcripts, PDFs, or PPTs.

Natural Language Processing: Dan Jurafsky and Christopher Manning, Stanford University. YouTube | Academic Torrents

Natural Language Processing: Michael Collins, Columbia University. YouTube | Academic Torrents

Introduction to Natural Language Processing: Dragomir Radev, University of Michigan. YouTube | Academic Torrents

… (From 9 popular online courses that are gone forever… and how you can still find them)

Enjoyable but not as suited to binge watching as Stranger Things.

Enjoy!

Comments Off

October 7, 2017

Building Data Science with JS – Lifting the Curtain on Game Reviews

Filed under: Data Science,Javascript,Natural Language Processing,Programming,Stanford NLP — Patrick Durusau @ 4:52 pm

Building Data Science with JS by Tim Ermilov.

Three videos thus far:

Building Data Science with JS – Part 1 – Introduction

Building Data Science with JS – Part 2 – Microservices

Building Data Science with JS – Part 3 – RabbitMQ and OpenCritic microservice

Tim starts with the observation that the percentage of users assigning a score to a game isn’t very helpful. It tells you nothing about the content of the game and/or the person rating it.

In subject identity terms, each level, mighty, strong, weak, fair, collapses information about the game and a particular reviewer into a single summary subject. OpenCritic then displays the percent of reviewers who are represented by that summary subject.

The problem with the summary subject is that one critic may have down rated the game for poor content, another for sexism and still another for bad graphics. But a user only knows for reasons unknown, a critic whose past behavior is unknown, evaluated unknown content and assigned it a rating.

A user could read all the reviews, study the history of each reviewer, along with the other movies they have evaluated, but Ermilov proposes a more efficient means to peak behind the curtain of game ratings. (part 1)

In part 2, Ermilov designs a microservice based application to extract, process and display game reviews.

If you thought the first two parts were slow, you should enjoy Part 3. Ermilov speeds through a number of resources, documents, JS libraries, not to mention his source code for the project. You are likely to hit pause during this video.

Some links you will find helpful for Part 3:

AMQP 0-9-1 library and client for Node.JS – Channel-oriented API reference

AMQP 0-9-1 library and client for Node.JS (Github)

https://github.com/BuildingXwithJS

https://github.com/BuildingXwithJS/building-data-science-with-js

Microwork – simple creation of distributed scalable microservices in node.js with RabbitMQ (simplifies use of AMQP)

node-unfluff – Automatically extract body content (and other cool stuff) from an html document

OpenCritic

RabbitMQ. (Recommends looking at the RabbitMQ tutorials.)

Comments Off

September 28, 2017

NLP tools for East Asian languages

Filed under: Language,Natural Language Processing — Patrick Durusau @ 8:59 pm

NLP tools for East Asian languages

CLARIN is building a list of NLP tools for East Asian languages.

Oh, sorry:

CLARIN – European Research Infrastructure for Language Resources and Technology

CLARIN makes digital language resources available to scholars, researchers, students and citizen-scientists from all disciplines, especially in the humanities and social sciences, through single sign-on access. CLARIN offers long-term solutions and technology services for deploying, connecting, analyzing and sustaining digital language data and tools. CLARIN supports scholars who want to engage in cutting edge data-driven research, contributing to a truly multilingual European Research Area.

…

CLARIN stands for “Common Language Resources and Technology Infrastructure”.

Contribute to the spreadsheet of NLP tools and enjoy the CLARIN website.

Comments Off

July 26, 2017

Deep Learning for NLP Best Practices

Filed under: Deep Learning,Natural Language Processing,Neural Networks — Patrick Durusau @ 3:18 pm

Deep Learning for NLP Best Practices by Sebastian Ruder.

From the introduction:

This post is a collection of best practices for using neural networks in Natural Language Processing. It will be updated periodically as new insights become available and in order to keep track of our evolving understanding of Deep Learning for NLP.

There has been a running joke in the NLP community that an LSTM with attention will yield state-of-the-art performance on any task. While this has been true over the course of the last two years, the NLP community is slowly moving away from this now standard baseline and towards more interesting models.

However, we as a community do not want to spend the next two years independently (re-)discovering the next LSTM with attention. We do not want to reinvent tricks or methods that have already been shown to work. While many existing Deep Learning libraries already encode best practices for working with neural networks in general, such as initialization schemes, many other details, particularly task or domain-specific considerations, are left to the practitioner.

This post is not meant to keep track of the state-of-the-art, but rather to collect best practices that are relevant for a wide range of tasks. In other words, rather than describing one particular architecture, this post aims to collect the features that underly successful architectures. While many of these features will be most useful for pushing the state-of-the-art, I hope that wider knowledge of them will lead to stronger evaluations, more meaningful comparison to baselines, and inspiration by shaping our intuition of what works.

I assume you are familiar with neural networks as applied to NLP (if not, I recommend Yoav Goldberg’s excellent primer [43]) and are interested in NLP in general or in a particular task. The main goal of this article is to get you up to speed with the relevant best practices so you can make meaningful contributions as soon as possible.

I will first give an overview of best practices that are relevant for most tasks. I will then outline practices that are relevant for the most common tasks, in particular classification, sequence labelling, natural language generation, and neural machine translation.
…

Certainly a resource to bookmark while you read A Primer on Neural Network Models for Natural Language Processing by Yoav Goldberg, at 76 pages and to consult frequently as you move beyond the primer stage.

Enjoy and pass it on!

Comments Off

July 11, 2017

The Classical Language Toolkit

Filed under: Classics,History,Humanities,Natural Language Processing — Patrick Durusau @ 4:26 pm

The Classical Language Toolkit

From the webpage:

The Classical Language Toolkit (CLTK) offers natural language processing (NLP) support for the languages of Ancient, Classical, and Medieval Eurasia. Greek and Latin functionality are currently most complete.

Goals

compile analysis-friendly corpora;

collect and generate linguistic data;

act as a free and open platform for generating scientific research.

You are sure to find one or more languages of interest:

Collecting, analyzing and mapping Tweets can be profitable and entertaining, but tomorrow or perhaps by next week, almost no one will read them again.

The texts in this project survived by hand preservation for thousands of years. People are still reading them.

How about you?

Comments Off

April 19, 2017

Dive Into NLTK – Update – No NLTK Book 2nd Edition

Filed under: Natural Language Processing,NLTK — Patrick Durusau @ 8:46 pm

Dive Into NLTK, Part I: Getting Started with NLTK

From the webpage:

NLTK is the most famous Python Natural Language Processing Toolkit, here I will give a detail tutorial about NLTK. This is the first article in a series where I will write everything about NLTK with Python, especially about text mining and text analysis online.

This is the first article in the series “Dive Into NLTK”, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK (this article)
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus

My first post on this series, had only the first seven lessons listed.

There’s another reason for this update.

It appears that no second edition of Natural Language Processing with Python is likely to appear.

Sounds like an opportunity for the NLTK community to continue the work already started.

I don’t have the chops to contribute high quality code but would be willing to work with others on proofing/editing (that’s the part of book production readers rarely see).

Comments Off

January 12, 2017

Stanford CoreNLP – a suite of core NLP tools (3.7.0)

Filed under: Natural Language Processing,Stanford NLP,Text Analytics,Text Mining — Patrick Durusau @ 9:16 pm

Stanford CoreNLP – a suite of core NLP tools

The beta is over and Stanford CoreNLP 3.7.0 is on the street!

From the webpage:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get quotes people said, etc.

Choose Stanford CoreNLP if you need:

An integrated toolkit with a good range of grammatical analysis tools

Fast, reliable analysis of arbitrary texts

The overall highest quality text analytics

Support for a number of major (human) languages

Available interfaces for most major modern programming languages

Ability to run as a simple web service

Stanford CoreNLP’s goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A tool pipeline can be run on a piece of plain text with just two lines of code. CoreNLP is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Moreover, an annotator pipeline can include additional custom or third-party annotators. CoreNLP’s analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

…

What stream of noise, sorry, news are you going to pipeling into the Stanford CoreNLP framework?

Imagine a web service that offers levels of analysis alongside news text.

Or does the same with leaked emails and/or documents?

Comments Off

November 26, 2016

Ulysses, Joyce and Stanford CoreNLP

Filed under: Literature,Natural Language Processing,Stanford NLP — Patrick Durusau @ 8:37 pm

Introduction to memory and time usage

From the webpage:

People not infrequently complain that Stanford CoreNLP is slow or takes a ton of memory. In some configurations this is true. In other configurations, this is not true. This section tries to help you understand what you can or can’t do about speed and memory usage. The advice applies regardless of whether you are running CoreNLP from the command-line, from the Java API, from the web service, or from other languages. We show command-line examples here, but the principles are true of all ways of invoking CoreNLP. You will just need to pass in the appropriate properties in different ways. For these examples we will work with chapter 13 of Ulysses by James Joyce. You can download it if you want to follow along.
…

You have to appreciate the use of a non-trivial text for advice on speed and memory usage of CoreNLP.

How does your text stack up against Chapter 13 of Ulysses?

I’m supposed to be reading Ulysses long distance with a friend. I’m afraid we have both fallen behind. Perhaps this will encourage me to have another go at it.

What favorite or “should read” text would you use to practice with CoreNLP?

Suggestions?

Comments Off

November 3, 2016

Stanford CoreNLP v3.7.0 beta is out! [Time is short, comments, bug reports, now!]

Filed under: Coreference Resolution,Entity Resolution,Information Retrieval,Natural Language Processing,Sentiment Analysis,Stanford NLP,Text Mining — Patrick Durusau @ 10:12 am

Stanford CoreNLP v3.7.0 beta

The tweets I saw from Stanford NLP Group read:

Stanford CoreNLP v3.7.0 beta is out—improved coreference, dep parsing—KBP relation annotator—Arabic pipeline #NLProc

We‘re doing an official CoreNLP beta release this time, so bugs, comments, and fixes especially appreciated over the next two weeks!

OK, so, what are you waiting for?

Oh, the standard blurb for your boss on why Stanford CoreNLP should be taking up your time:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

An integrated toolkit with a good range of grammatical analysis tools

Fast, reliable analysis of arbitrary texts

The overall highest quality text analytics

Support for a number of major (human) languages

Interfaces available for various major modern programming languages

Ability to run as a simple web service

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Using the standard blurb about the Stanford CoreNLP has these advantages:

It’s copy-n-paste, you didn’t have to write it
It’s appeal to authority (Stanford)
It’s truthful

The truthful point is a throw-away these days but thought I should mention it.

Comments Off

May 5, 2016

Mentioning Nazis or Hitler

Filed under: Natural Language Processing,Reddit — Patrick Durusau @ 9:51 am

78% of Reddit Threads With 1,000+ Comments Mention Nazis

From the post:

Let me start this post by noting that I will not attempt to test Godwin’s Law, which states that:

As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches 1.

In this post, I’ll only try to find out how many Reddit comments mention Nazis or Hitler and ignore the context in which they are made. The data source for this analysis is the Reddit dataset which is publicly available on Google BigQuery. The following graph is based on 4.6 million comments and shows the share of comments mentioning Nazis or Hitler by subreddit.
…

Left for a later post:

The next step would be to implement sophisticated text mining techniques to identify comments which use Nazi analogies in a way as described by Godwin. Unfortunately due to time constraints and the complexity of this problem, I was not able to try for this blog post.

Since Godwin’s law applies to inappropriate invocations of Nazis or Hitler, that implies there are legitimate uses of those terms.

What captures my curiosity is what characteristics must a subject have to be a legitimate comparison to Nazis and/or Hitler?

Or more broadly, what characteristics must a subject have to be classified as a genocidal ideology or a person who advocates genocide?

Thinking it isn’t Nazism (historically speaking) that needs to be avoided but the more general impulse that leads to genocidal rhetoric and policies.

Comments Off

March 29, 2016

WordsEye [Subject Identity Properties]

Filed under: Graphics,Natural Language Processing,Visualization — Patrick Durusau @ 8:50 am

WordsEye

A site that enables you to “type a picture.” What? To illustrate:

A [mod] ox is a couple of feet in front of the [hay] wall. It is cloudy. The ground is shiny grass. The huge hamburger is on the ox. An enormous gold chicken is behind the wall…

Results in:

The site is in a close beta test but you can apply for an account.

I mention “subject identity properties” in the title because the words we use to identify subjects, are properties of subjects, just like any other properties we attribute to them.

Unfortunately, words are viewed by different people as identifying different subjects and the different words as identifying the same subjects.

The WordsEye technology can illustrates the fragility of using a single word to identify a subject of conversation.

Or that multiple identifications have the same subject, with side by side images that converge on a common image.

Imagine that in conjunction with 3-D molecular images for example.

I first saw this in a tweet by Alyona Medelyan.

Comments Off

March 8, 2016

Patent Sickness Spreads [Open Source Projects on Prior Art?]

Filed under: Intellectual Property (IP),Natural Language Processing,Patents,Searching — Patrick Durusau @ 7:31 pm

James Cook reports a new occurrence of patent sickness in Facebook has an idea for software that detects cool new slang before it goes mainstream.

The most helpful part of James’ post is the graphic outline of the “process” patented by Facebook:

I sure do hope James has not patented that presentation because it make the Facebook patent, err, clear.

Quick show of hands on originality?

While researching this post, I ran across Open Source as Prior Art at the Linux Foundation. Are there other public projects that research and post prior art with regard to particular patents?

An armory of weapons for opposing ill-advised patents.

The Facebook patent is: 9,280,534 Hauser, et al. March 8, 2016, Generating a social glossary:

Its abstract:

Particular embodiments determine that a textual term is not associated with a known meaning. The textual term may be related to one or more users of the social-networking system. A determination is made as to whether the textual term should be added to a glossary. If so, then the textual term is added to the glossary. Information related to one or more textual terms in the glossary is provided to enhance auto-correction, provide predictive text input suggestions, or augment social graph data. Particular embodiments discover new textual terms by mining information, wherein the information was received from one or more users of the social-networking system, was generated for one or more users of the social-networking system, is marked as being associated with one or more users of the social-networking system, or includes an identifier for each of one or more users of the social-networking system. (emphasis in original)

Comments Off

February 15, 2016

Automating Family/Party Feud

Filed under: Natural Language Processing,Reddit,Vectors — Patrick Durusau @ 11:19 am

Semantic Analysis of the Reddit Hivemind

From the webpage:

Our neural network read every comment posted to Reddit in 2015, and built a semantic map using word2vec and spaCy.

Try searching for a phrase that’s more than the sum of its parts to see what the model thinks it means. Try your favourite band, slang words, technical things, or something totally random.

Lynn Cherny suggested in a tweet to use “actually.”

If you are interested in the background on this tool, see: Sense2vec with spaCy and Gensim by Matthew Honnibal.

From the post:

If you were doing text analytics in 2015, you were probably using word2vec. Sense2vec (Trask et al., 2015) is a new twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. This post motivates the idea, explains our implementation, and comes with an interactive demo that we’ve found surprisingly addictive.

Polysemy: the problem with word2vec

When humans write dictionaries and thesauruses, we define concepts in relation to other concepts. For automatic natural language processing, it’s often more effective to use dictionaries that define concepts in terms of their usage statistics. The word2vec family of models are the most popular way of creating these dictionaries. Given a large sample of text, word2vec gives you a dictionary where each definition is just a row of, say, 300 floating-point numbers. To find out whether two entries in the dictionary are similar, you ask how similar their definitions are – a well-defined mathematical operation.

Certain to be a hit at technical conferences and parties.

SGML wasn’t mentioned even once during 2015 in Reddit Comments.

Try some your favorites words and phrases.

Enjoy!

Comments Off

February 4, 2016

Toneapi helps your writing pack an emotional punch [Not For The Ethically Sensitive]

Filed under: Natural Language Processing,Sentiment Analysis — Patrick Durusau @ 8:04 pm

Toneapi helps your writing pack an emotional punch by Martin Bryant.

From the post:

Language analysis is a rapidly developing field and there are some interesting startups working on products that help you write better.

Take Toneapi, for example. This product from Northern Irish firm Adoreboard is a Web-based app that analyzes (and potentially improves) the emotional impact of your writing.

Paste in some text, and it will offer a detailed visualization of your writing.
…

If you aren’t overly concerned about manipulating, sorry, persuading your readers to your point of view, you might want to give Toneapi a spin. Martin reports that IBM’s Watson has Tone Analyzer and you should also consider Textio and Relative Insight.

Before this casts an Orwellian pale over your evening/day, remember that focus groups and testing messages have been the staple of advertising for decades.

What these software services do is make a crude form of that capability available to the average citizen.

Some people have a knack for emotional language, like Donald Trump, but I can’t force myself to write in incomplete sentences or with one syllable words. Maybe there’s an app for that? Suggestions?

Comments Off

January 25, 2016

Stanford NLP Blog – First Post

Filed under: Machine Learning,Natural Language Processing — Patrick Durusau @ 8:37 pm

Sam Bowman posted The Stanford NLI Corpus Revisited today at the Stanford NLP blog.

From the post:

Last September at EMNLP 2015, we released the Stanford Natural Language Inference (SNLI) Corpus. We’re still excitedly working to build bigger and better machine learning models to use it to its full potential, and we sense that we’re not alone, so we’re using the launch of the lab’s new website to share a bit of what we’ve learned about the corpus over the last few months.

What is SNLI?

SNLI is a collection of about half a million natural language inference (NLI) problems. Each problem is a pair of sentences, a premise and a hypothesis, labeled (by hand) with one of three labels: entailment, contradiction, or neutral. An NLI model is a model that attempts to infer the correct label based on the two sentences.

A high level overview of the SNLI corpus.

The news of Marvin Minsky‘s death, today, much have arrived too late for inclusion in the post.

Comments Off

December 18, 2015

There’s More Than One Kind Of Reddit Comment?

Filed under: Natural Language Processing,Sentiment Analysis — Patrick Durusau @ 11:52 am

‘Sarcasm detection on Reddit comments’

Contest ends: 15th of February, 2016.

From the webpage:

Sentiment analysis is a fairly well-developed field, but on the Internet, people often don’t say exactly what they mean. One of the toughest modes of communication for both people and machines to identify is sarcasm. Sarcastic statements often sound positive if interpreted literally, but through context and other cues the speaker indicates that they mean the opposite of what they say. In English, sarcasm is primarily communicated through verbal cues, meaning that it is difficult, even for native speakers, to determine it in text.

Sarcasm detection is a subtask of opinion mining. It aims at correctly identifying the user opinions expressed in the written text. Sarcasm detection plays a critical role in sentiment analysis by correctly identifying sarcastic sentences which can incorrectly flip the polarity of the sentence otherwise. Understanding sarcasm, which is often a difficult task even for humans, is a challenging task for machines. Common approaches for sarcasm detection are based on machine learning classifiers trained on simple lexical or dictionary based features. To date, some research in sarcasm detection has been done on collections of tweets from Twitter, and reviews on Amazon.com. For this task, we are interested in looking at a more conversational medium—comments on Reddit—in order to develop an algorithm that can use the context of the surrounding text to help determine whether a specific comment is sarcastic or not.
…

The premise of this competition is there is more than one kind of comment on Reddit, aside from sarcasm.

A surprising assumption I know but there you have it.

I wonder if participants will have to separate sarcastic + sexist, sarcastic + misogynistic, sarcastic + racist, sarcastic + abusive, into separate categories or will all sarcastic comments be classified as sarcasm?

I suppose the default case would be to assume all Reddit comments are some form of sarcasm and see how accurate that model proves to be when judged against the results of the competition.

Training data for sarcasm? Pointers anyone?

Comments Off

October 29, 2015

Parsing Academic Articles on Deadline

Filed under: Journalism,Natural Language Processing,News,Parsing — Patrick Durusau @ 8:10 pm

A group of researchers is trying to help science journalists parse academic articles on deadline by Joseph Lichterman.

From the post:

About 1.8 million new scientific papers are published each year, and most are of little consequence to the general public — or even read, really; one study estimates that up to half of all academic studies are only read by their authors, editors, and peer reviewers.

But the papers that are read can change our understanding of the universe — traces of water on Mars! — or impact our lives here on earth — sea levels rising! — and when journalists get called upon to cover these stories, they’re often thrown into complex topics without much background or understanding of the research that led to the breakthrough.

As a result, a group of researchers at Columbia and Stanford are in the process of developing Science Surveyor, a tool that algorithmically helps journalists get important context when reporting on scientific papers.

“The idea occurred to me that you could characterize the wealth of scientific literature around the topic of a new paper, and if you could do that in a way that showed the patterns in funding, or the temporal patterns of publishing in that field, or whether this new finding fit with the overall consensus with the field — or even if you could just generate images that show images very rapidly what the huge wealth, in millions of articles, in that field have shown — [journalists] could very quickly ask much better questions on deadline, and would be freed to see things in a larger context,” Columbia journalism professor Marguerite Holloway, who is leading the Science Surveyor effort, told me.

Science Surveyor is still being developed, but broadly the idea is that the tool takes the text of an academic paper and searches academic databases for other studies using similar terms. The algorithm will surface relevant articles and show how scientific thinking has changed through its use of language.

For example, look at the evolving research around neurogenesis, or the growth of new brain cells. Neurogenesis occurs primarily while babies are still in the womb, but it continues through adulthood in certain sections of the brain.

Up until a few decades ago, researchers generally thought that neurogenesis didn’t occur in humans — you had a set number of brain cells, and that’s it. But since then, research has shown that neurogenesis does in fact occur in humans.

“This tells you — aha! — this discovery is not an entirely new discovery,” Columbia professor Dennis Tenen, one of the researchers behind Science Surveyor, told me. “There was a period of activity in the ’70s, and now there is a second period of activity today. We hope to produce this interactive visualization, where given a paper on neurogenesis, you can kind of see other related papers on neurogenesis to give you the context for the story you’re telling.”

…

Given the number of papers published every year, an algorithmic approach like Science Surveyor is an absolute necessity.

But imagine how much richer the results would be if one of the three or four people who actually read the paper could easily link it to other research and context?

Or perhaps being a researcher who discovers the article and then blazes a trail to non-obvious literature that is also relevant?

Search engines now capture what choices users make in the links they follow but that’s a fairly crude approximation of relevancy of a particular resource. Such as not specifying why a particular resource is relevant.

Usage of literature should decide which articles merit greater attention from machine or human annotators. The last amount of humanities literature is never cited by anyone. Why spend resources annotating content that no one is likely to read?

Comments Off

October 7, 2015

NLP and Scala Resources

Filed under: Functional Programming,Natural Language Processing,Scala,ScalaNLP — Patrick Durusau @ 9:33 pm

Natural Language Processing and Scala Tutorials by Jason Baldridge.

An impressive collection of resources but in particular, the seventeen (17) Scala tutorials.

Unfortunately, given the state of search and indexing it isn’t possible to easily dedupe the content of these materials against others you may have already found.

Comments Off

September 11, 2015

Corpus of American Tract Society Publications

Filed under: Corpora,Corpus Linguistics,Natural Language Processing — Patrick Durusau @ 1:10 pm

Corpus of American Tract Society Publications by Lincoln Mullen.

From the post:

I’ve created a small to mid-sized corpus of publications by the American Tract Society up to the year 1900 in plain text. This corpus has been gathered from the Internet Archive. It includes 641 documents containing just under sixty million words, along with a CSV file containing metadata for each of the files. I don’t make any claims that this includes all of the ATS publications from that time period, and it is pretty obvious that the metadata from the Internet Archive is not much good. The titles are mostly correct; the dates are pretty far off in cases.

This corpus was created for the purpose of testing document similarity and text reuse algorithms. I need a corpus for testing the textreuse, which is in very early stages of development. From reading many, many of these tracts, I already know the patterns of text reuse. (And of course, the documents are historically interesting in their own right, and might be a good candidate for text mining.) The ATS frequently republished tracts under the same title. Furthermore, they published volumes containing the entire series of tracts that they had published individually. So there are examples of entire documents which are reprinted, but also some documents which are reprinted inside others. Then as a extra wrinkle, the corpus contains the editions of the Bible published by the ATS, plus their edition of Cruden’s concordance and a Bible dictionary. Likely all of the tracts quote the Bible, some at great length, so there are many examples of borrowing there.

Here is the corpus and its repository:

ats_corpus.zip (128 MB)

GitHub repository to reproduce the corpus

With the described repetition, the corpus must compress well.

Makes me wonder how much near-repetition occurs in CS papers?

Graph papers than repeat graph fundamentals, in nearly the same order, in paper after paper.

At what level would you measure re-use? Sentence? Paragraph? Larger divisions?

Comments Off

June 10, 2015

spaCy: Industrial-strength NLP

Filed under: Natural Language Processing — Patrick Durusau @ 9:55 am

spaCy: Industrial-strength NLP by Matthew Honnibal.

From the post:

spaCy is a new library for text processing in Python and Cython. I wrote it because I think small companies are terrible at natural language processing (NLP). Or rather: small companies are using terrible NLP technology.

To do great NLP, you have to know a little about linguistics, a lot about machine learning, and almost everything about the latest research. The people who fit this description seldom join small companies. Most are broke — they’ve just finished grad school. If they don’t want to stay in academia, they join Google, IBM, etc.

The net result is that outside of the tech giants, commercial NLP has changed little in the last ten years. In academia, it’s changed entirely. Amazing improvements in quality. Orders of magnitude faster. But the academic code is always GPL, undocumented, unuseable, or all three. You could implement the ideas yourself, but the papers are hard to read, and training data is exorbitantly expensive. So what are you left with? A common answer is NLTK, which was written primarily as an educational resource. Nothing past the tokenizer is suitable for production use.

I used to think that the NLP community just needed to do more to communicate its findings to software engineers. So I wrote two blog posts, explaining how to write a part-of-speech tagger and parser. Both were well received, and there’s been a bit of interest in my research software — even though it’s entirely undocumented, and mostly unuseable to anyone but me.

So six months ago I quit my post-doc, and I’ve been working day and night on spaCy since. I’m now pleased to announce an alpha release.

If you’re a small company doing NLP, I think spaCy will seem like a minor miracle. It’s by far the fastest NLP software ever released. The full processing pipeline completes in 7ms per document, including accurate tagging and parsing. All strings are mapped to integer IDs, tokens are linked to embedded word representations, and a range of useful features are pre-calculated and cached.

…

Matthew uses an example based on Stephen King’s admonition “the adverb is not your friend“, which immediately brought to mind the utility of tagging all adverbs and adjectives in a standards draft and then generating comments that identify its parent <p> element and the offending phrase.

I haven’t verified the performance comparisons, but as you know, the real question is how well spaCy works on your data, work flow, etc.?

Thanks to Matthew for the reminder of: On writing : a memoir of the craft by Stephen King. Documentation will never be as gripping as a King novel, but it shouldn’t be painful to read.

I first saw this in a tweet by Jason Baldridge.

Comments Off

May 30, 2015

NLP4L

Filed under: Lucene,Natural Language Processing — Patrick Durusau @ 7:16 pm

NLP4L

From the webpage:

NLP4L is a natural language processing tool for Apache Lucene written in Scala. The main purpose of NLP4L is to use the NLP technology to improve Lucene users’ search experience. Lucene/Solr, for example, already provides its users with auto-complete and suggestion functions for search keywords. Using NLP technology, NLP4L development members may be able to present better keywords. In addition, NLP4L provides functions to collaborate with existing machine learning tools, including one to directly create document vector from a Lucene index and write it to a LIBSVM format file.

As NLP4L processes document data registered in the Lucene index, you can directly access a word database normalized by powerful Lucene Analyzer and use handy search functions. Being written in Scala, NLP4L excels at trying ad hoc interactive processing as well.
…

The documentation is currently in Japanese with a TOC for the English version. Could be interesting if you want to try your hand either at translation and/or working from the API Docs.

Enjoy!

Comments Off

May 20, 2015

Political Futures Tracker

Filed under: Government,Natural Language Processing,Politics,Sentiment Analysis — Patrick Durusau @ 2:01 pm

Political Futures Tracker.

From the webpage:

The Political Futures Tracker tells us the top political themes, how positive or negative people feel about them, and how far parties and politicians are looking to the future.

This software will use ground breaking language analysis methods to examine data from Twitter, party websites and speeches. We will also be conducting live analysis on the TV debates running over the next month, seeing how the public respond to what politicians are saying in real time. Leading up to the 2015 UK General Election we will be looking across the political spectrum for emerging trends and innovation insights.

If that sounds interesting, consider the following from: Introducing… the Political Futures Tracker:

We are exploring new ways to analyse a large amount of data from various sources. It is expected that both the amount of data and the speed that it is produced will increase dramatically the closer we get to election date. Using a semi-automatic approach, text analytics technology will sift through content and extract the relevant information. This will then be examined and analysed by the team at Nesta to enable delivery of key insights into hotly debated issues and the polarisation of political opinion around them.

The team at the University of Sheffield has extensive experience in the area of social media analytics and Natural Language Processing (NLP). Technical implementation has started already, firstly with data collection which includes following the Twitter accounts of existing MPs and political parties. Once party candidate lists become available, data harvesting will be expanded accordingly.

In parallel, we are customising the University of Sheffield’s General Architecture for Text Engineering (GATE); an open source text analytics tool, in order to identify sentiment-bearing and future thinking tweets, as well as key target topics within these.

One thing we’re particularly interested in is future thinking. We describe this as making statements concerning events or issues in the future. Given these measures and the views expressed by a certain person, we can model how forward thinking that person is in general, and on particular issues, also comparing this with other people. Sentiment, topics, and opinions will then be aggregated and tracked over time.

Personally I suspect that “future thinking” is used in difference senses by the general population and political candidates. For a political candidate, however the rhetoric is worded, the “future” consists of reaching election day with 50% plus 1 vote. For the general population, the “future” probably includes a longer time span.

I mention this in case you can sell someone on the notion that what political candidates say today has some relevance to what they will do after election. President Obmana has been in office for six (6) years on office, the Guantanamo Bay detention camp remains open, no one has been held accountable for years of illegal spying on U.S. citizens, banks and other corporate interests have all but been granted keys to the U.S. Treasury, to name a few items inconsistent with his previous “future thinking.”

Unless you accept my suggestion that “future thinking” for a politician means election day and no further.

Comments Off

Analysis of named entity recognition and linking for tweets

Filed under: Entity Resolution,Natural Language Processing,Tweets — Patrick Durusau @ 1:29 pm

Analysis of named entity recognition and linking for tweets by Leon Derczynski, et al.

Abstract:

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

The questions addressed by the paper are:

…
RQ1 How robust are state-of-the-art named entity recognition and linking methods on short and noisy microblog texts?

RQ2 What problem areas are there in recognising named entities in microblog posts, and what are the major causes of false negatives and false positives?

RQ3 Which problems need to be solved in order to further the state-of-the-art in NER and NEL on this difficult text genre?
…

The ultimate conclusion is that entity recognition in microblog posts falls short of what has been achieved for newswire text but if you need results now or at least by tomorrow, this is a good guide to what is possible and where improvements can be made.

Comments Off

Detecting Deception Strategies [Godsend for the 2016 Election Cycle]

Filed under: Language,Natural Language Processing,Politics — Patrick Durusau @ 10:20 am

Discriminative Models for Predicting Deception Strategies by Scott Appling, Erica Briscoe, C.J. Hutto.

Abstract:

Although a large body of work has previously investigated various cues predicting deceptive communications, especially as demonstrated through written and spoken language (e.g., [30]), little has been done to explore predicting kinds of deception. We present novel work to evaluate the use of textual cues to discriminate between deception strategies (such as exaggeration or falsification), concentrating on intentionally untruthful statements meant to persuade in a social media context. We conduct human subjects experimentation wherein subjects were engaged in a conversational task and then asked to label the kind(s) of deception they employed for each deceptive statement made. We then develop discriminative models to understand the difficulty between choosing between one and several strategies. We evaluate the models using precision and recall for strategy prediction among 4 deception strategies based on the most relevant psycholinguistic, structural, and data-driven cues. Our single strategy model results demonstrate as much as a 58% increase over baseline (random chance) accuracy and we also find that it is more difficult to predict certain kinds of deception than others.

The deception strategies studied in this paper:

Falsification
Exaggeration
Omission
Misleading

especially omission, will form the bulk of the content in the 2016 election cycle in the United States. Only deceptive statements were included in the test data, so the models were tested on correctly recognizing the deception strategy in a known deceptive statement.

The test data is remarkably similar to political content, which aside from their names and names of their opponents (mostly), is composed entirely of deceptive statements, albeit not marked for the strategy used in each one.

A web interface for loading pointers to video, audio or text with political content that emits tagged deception with pointers to additional information would be a real hit for the next U.S. election cycle. Monetize with ads, the sources of additional information, etc.

I first saw this in a tweet by Leon Derczynski.

Comments Off

May 2, 2015

New Natural Language Processing and NLTK Videos

Filed under: Natural Language Processing,NLTK,Python — Patrick Durusau @ 3:59 pm

Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences and Stop Words – Natural Language Processing With Python and NLTK p.2 by Harrison Kinsley.

From part 1:

Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language.

The NLTK module comes packed full of everything from trained algorithms to identify parts of speech to unsupervised machine learning algorithms to help you train your own machine to understand a specific bit of text.

NLTK also comes with a large corpora of data sets containing things like chat logs, movie reviews, journals, and much more!

Bottom line, if you’re going to be doing natural language processing, you should definitely look into NLTK!

Playlist link: https://www.youtube.com/watch?v=FLZvO…

sample code: http://pythonprogramming.net
http://hkinsley.com
https://twitter.com/sentdex
http://sentdex.com
http://seaofbtc.com

Use the Playlist link: https://www.youtube.com/watch?v=FLZvO… link as I am sure more videos will be appearing in the near future.

Enjoy!

Comments Off

Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 5, 2019

January 2, 2019

May 6, 2018

December 24, 2017

November 14, 2017

October 28, 2017

October 7, 2017

September 28, 2017

July 26, 2017

July 11, 2017

April 19, 2017

January 12, 2017

November 26, 2016

November 3, 2016

May 5, 2016

March 29, 2016

March 8, 2016

February 15, 2016

February 4, 2016

January 25, 2016

December 18, 2015

October 29, 2015

October 7, 2015

September 11, 2015

June 10, 2015

May 30, 2015

May 20, 2015

May 2, 2015