## Archive for the ‘Natural Language Processing’ Category

### Linguists Circle the Wagons, or Disagreement != Danger

Thursday, May 16th, 2013

Pullum’s NLP Lament: More Sleight of Hand Than Fact by Christopher Phipps.

From the post:

My first reading of both of Pullum’s recent NLP posts (one and two) interpreted them to be hostile, an attack on a whole field (see my first response here). Upon closer reading, I see Pullum chooses his words carefully and it is less of an attack and more of a lament. He laments that the high-minded goals of early NLP (to create machines that process language like humans do) has not been reached, and more to the point, that commercial pressures have distracted the field from pursuing those original goals, hence they are now neglected. And he’s right about this to some extent.

But, he’s also taking the commonly used term “natural language processing” and insisting that it NOT refer to what 99% of people who use the term use it for, but rather only a very narrow interpretation consisting of something like “computer systems that mimic human language processing.” This is fundamentally unfair.

In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.

I feel Pullum is moving the goal posts on us when he says “there is, to my knowledge, no available system for unaided machine answering of free-form questions via general syntactic and semantic analysis” [my emphasis]. Pullum’s agenda appears to be to create a straw-man NLP world where NLP techniques are only admirable if they mimic human processing. And this is unfair for two reasons.

If there is unfairness in this discussion, it is the insistence by Christopher Phipps (and others) that Pullum has invented “…a straw-man NLP world where NLP techniques are only admirable if they mimic human processing.”

On the contrary, it was 1949 when Warren Weaver first proposed computers as the solution to world-wide translation problems. Weaver’s was not the only optimistic projection of language processing by computers. Those have continued up to and including the Semantic Web.

Yes, NLP practitioners such as Christopher Phipps use NLP in a more precise sense than Pullum. And NLP as defined by Phipps has too many achievements to easily list.

Neither one of those statements takes anything away from Pullum’s point that Google found a “sweet spot” between machine processing and human intelligence for search purposes.

What other insights Pullum has to offer may be obscured by the “…circle the wagons…” attitude from linguists.

Disagreement != Danger.

### Finding Significant Phrases in Tweets with NLTK

Sunday, May 12th, 2013

Finding Significant Phrases in Tweets with NLTK by Sujit Pal.

From the post:

Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described – one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence – the latter is used by LingPipe.

I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.

The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.

For input data, I decided to use Twitter. I’m not that familiar with the Twitter API, but I’m taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:

An interesting look “behind the glass” on n-grams.

I am using AntConc to generate n-grams for proofing standards prose.

But as a finished tool, AntConc doesn’t give you insight into the technical side of the process.

### Enigma

Friday, May 10th, 2013

Enigma

I suppose it had to happen. With all the noise about public data sets that someone would create a startup to search them.

Not a lot of detail at the site but you can sign up for a free trial.

Features:

100,000+ Public Data Sources: Access everything from import bills of lading, to aircraft ownership, lobbying activity,real estate assessments, spectrum licenses, financial filings, liens, government spending contracts and much, much more.

Augment Your Data: Get a more complete picture of investments, customers, partners, and suppliers. Discover unseen correlations between events, geographies and transactions.

API Access: Get direct access to the data sets, relational engine and NLP technologies that power Enigma.

Request Custom Data: Can’t find a data set anywhere else? Need to synthesize data from disparate sources? We are here to help.

Discover While You Work: Never miss a critical piece of information. Enigma uncovers entities in context, adding intelligence and insight to your daily workflow.

Powerful Context Filters: Our vast collection of public data sits atop a proprietary data ontology. Filter results by topics, tags and source to quickly refine and scope your query.

Focus on the Data: Immerse yourself in the details. Data is presented in its raw form, full screen and without distraction.

Curated Metadata: Source data is often unorganized and poorly documented. Our domain experts focus on sanitizing, organizing and annotating the data.

Easy Filtering: Rapidly prototype hypotheses by refining and shaping data sets in context. Filter tools allow the sorting, refining, and mathematical manipulation of data sets.

The “proprietary data ontology” jumps out at me as an obvious question. Do users get to know what the ontology is?

Not to mention the “our domain experts focus on sanitizing,….” Works for some cases, take legal research for example. Not sure that “your” experts works as well as “my” experts for less focused areas.

Looking forward to learning more about Enigma!

### Why Are We Still Waiting for Natural Language Processing?

Friday, May 10th, 2013

Why Are We Still Waiting for Natural Language Processing? by Geoffrey Pullum.

From the post:

Try typing this, or any question with roughly the same meaning, into the Google search box:

Which UK papers are not part of the Murdoch empire?

Your results (and you could get identical ones by typing the same words in the reverse order) will contain an estimated two million or more pages about Rupert Murdoch and the newspapers owned by his News Corporation. Exactly what you did not ask for.

Putting quotes round the search string freezes the word order, but makes things worse: It calls not for the answer (which would be a list including The Daily Telegraph, the Daily Mail, the Daily Mirror, etc.) but for pages where the exact wording of the question can be found, and there probably aren’t any (except this post).

Machine answering of such a question calls for not just a database of information about newspapers but also natural language processing (NLP). I’ve been waiting for NLP to arrive for 30 years. Whatever happened?

This is a series you need to follow.

Geoffrey promises to report on three “unexpected developments” that relate to natural language processing.

The next installment to appear Monday, May 13, 2013.

Of course, with curated content, as with a topic map, you get a find-once/read-many result (FOMR?).

### Natural Language Processing and Big Data…

Wednesday, May 8th, 2013

Natural Language Processing and Big Data: Using NLTK and Hadoop – Talk Overview by Benjamin Bengfort.

From the post:

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem — designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word … but trust me, it’s worth it.

If you can’t wait for the future posts, Benjamin’s presentation from April is here. Amusing but fairly sparse slides.

Big Data and Natural Language Processing – Part 1

The “Foo” of Big Data – Part 2

Python’s Natural Language Took Kit (NLTK) and Hadoop – Part 3

Hadoop for Preprocessing Language – Part 4

Beyond Preprocessing – Weakly Inferred Meanings – Part 5

### Inter-Document Similarity with Scikit-Learn and NLTK

Saturday, May 4th, 2013

Inter-Document Similarity with Scikit-Learn and NLTK by Sujit Pal.

From the post:

Someone recently asked me about using Python to calculate document similarity across text documents. The application had to do with cheating detection, ie, compare student transcripts and flag documents with (abnormally) high similarity for further investigation. For security reasons, I could not get access to actual student transcripts. But the basic idea was to convince ourselves that this approach is valid, and come up with a code template for doing this.

I have been playing quite a bit with NLTK lately, but for this work, I decided to use the Python ML Toolkit Scikit-Learn, which has pretty powerful text processing facilities. I did end up using NLTK for its cosine similarity function, but that was about it.

I decided to use the coffee-sugar-cocoa mini-corpus of 53 documents to test out the code – I first found this in Dr Manu Konchady’s TextMine project, and I have used it off and on. For convenience I have made it available at the github location for the sub-project.

Similarity measures are fairly well understood.

But they lack interesting data sets for testing code.

Here are some random suggestions:

• Speeches by Republicans on Benghazi
• Speeches by Democrats on Gun Control
• TV reports on any particular disaster
• News reports of sporting events
• Dialogue from popular TV shows

With a five to ten second lag, perhaps streams of speech could be monitored for plagiarism or repetition and simply dropped.

### NLPCS 2013

Saturday, May 4th, 2013

NLPCS 2013: 10th International Workshop on Natural Language Processing and Cognitive Science

When Oct 15, 2013 – Oct 16, 2013
Where Marseille, France
Final Version Due Sep 15, 2013

From the webpage:

The aim of this workshop is to foster interactions among researchers and practitioners in Natural Language Processing (NLP) by taking a Cognitive Science perspective. What characterises this kind of approach is the fact that NLP is considered from various viewpoints (linguistics, psychology, neurosciences, artificial intelligence,…), and that a deliberate effort is made to reconcile or integrate them into a coherent whole.

We believe that this is necessary, as the modelling of the process is simply too complex to be addressed by a single discipline. No matter whether we deal with a natural or artificial system (people or computers) or a combination of both (interactive NLP), systems rely on many types of very different knowledge sources. Hence, strategies vary considerably depending on the person (novice, expert), on the available knowledge (internal and external), and on the nature of the information processor: human, machines or both (human-machine communication).

The problem we are confronted with and the spirit of the workshop can fairly well be captured via the following two quotations :

• Build models that illustrate somehow the way people use language slightly adapted comment taken from Martin Kay’s talk given when receiving the Lifetime Achievement Award, http://acl.ldc.upenn.edu/J/J05/J05-4001.pdf
• Make machines behave more like humans, rather than make people behave like machines (“Humaniser la machine, ne pas mécaniser l’utilisateur”), O. Nérot)

This kind of workshop provides an excellent opportunity to get closer to these goals. Encouraging cross-fertilization it may possibly lead to the creation of true semiotic extensions, i.e. the development of brain inspired (or brain compatible) cognitive systems.

I rather like the focus of this NLP workshop.

You?

### Open Sentiment Analysis

Thursday, May 2nd, 2013

Open Sentiment Analysis by Pete Warden.

From the post:

Sentiment analysis is fiendishly hard to solve well, but easy to solve to a first approximation. I’ve been frustrated that there have been no easy free libraries that make the technology available to non-specialists like me. The problem isn’t with the code, there are some amazing libraries like NLTK out there, but everyone guards their training sets of word weights jealously. I was pleased to discover that SentiWordNet is now CC-BY-SA, but even better I found that Finn Årup has made a drop-dead simple list of words available under an Open Database License!

With that in hand, I added some basic tokenizing code and was able to implement a new text2sentiment API endpoint for the Data Science Toolkit:

http://www.datasciencetoolkit.org/developerdocs#text2sentiment

BTW, while you are there, take a look at the Data Science Toolkit more generally.

Sentiment analysis with undisclosed word weights sounds iffy to me.

It’s like getting a list of rounded numbers but you don’t know the rounding factor.

Even worse with sentiment analysis because every rounding factor may be different.

### Resources and Readings for Big Data Week DC Events

Tuesday, April 23rd, 2013

Resources and Readings for Big Data Week DC Events

This is Big Data week in DC and Data Community DC has put together a list of books articles and posts to keep you busy all week.

Very cool!

### TSDW:… [Enterprise Disambiguation]

Monday, April 22nd, 2013

TSDW: Two-stage word sense disambiguation using Wikipedia by Chenliang Li, Aixin Sun, Anwitaman Datta. (Li, C., Sun, A. and Datta, A. (2013), TSDW: Two-stage word sense disambiguation using Wikipedia. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22829)

Abstract:

The semantic knowledge of Wikipedia has proved to be useful for many tasks, for example, named entity disambiguation. Among these applications, the task of identifying the word sense based on Wikipedia is a crucial component because the output of this component is often used in subsequent tasks. In this article, we present a two-stage framework (called TSDW) for word sense disambiguation using knowledge latent in Wikipedia. The disambiguation of a given phrase is applied through a two-stage disambiguation process: (a) The first-stage disambiguation explores the contextual semantic information, where the noisy information is pruned for better effectiveness and efficiency; and (b) the second-stage disambiguation explores the disambiguated phrases of high confidence from the first stage to achieve better redisambiguation decisions for the phrases that are difficult to disambiguate in the first stage. Moreover, existing studies have addressed the disambiguation problem for English text only. Considering the popular usage of Wikipedia in different languages, we study the performance of TSDW and the existing state-of-the-art approaches over both English and Traditional Chinese articles. The experimental results show that TSDW generalizes well to different semantic relatedness measures and text in different languages. More important, TSDW significantly outperforms the state-of-the-art approaches with both better effectiveness and efficiency.

TSDW works because Wikipedia is a source of unambiguous phrases, that can also be used to disambiguate phrases that one first pass are not unambiguous.

But Wikipedia did not always exist and was built out of the collaboration of thousands of users over time.

Does that offer a clue as to building better search tools for enterprise data?

What if statistically improbable phrases are mined from new enterprise documents and links created to definitions for those phrases?

Thinking picking a current starting point avoids a “…boil the ocean…” scenario before benefits can be shown.

Current content is also more likely to be a search target.

Domain expertise and literacy required.

Expertise in logic or ontologies not.

### NLP Programming Tutorial

Friday, April 19th, 2013

NLP Programming Tutorial by Graham Neubig.

From the webpage:

This is a tutorial I did at NAIST for people to start learning how to program basic algorithms for natural language processing.

You should need very little programming experience to start out, but each of the tutorials builds on the stuff from the previous tutorials, so it is highly recommended that you do them in order. You can also download the data for the practice exercises.

Slides so you will need to supply reading materials, references, local data sets of interest, etc.

### A Survey of Stochastic and Gazetteer Based Approaches for Named Entity Recognition

Thursday, April 18th, 2013

From the post:

Generally speaking, the most effective named entity recognition systems can be categorized as rule-based, gazetteer and machine learning approaches. Within each of these approaches are a myriad of sub-approaches that combine to varying degrees each of these top-level categorizations. However, because of the research challenge posed by each approach, typically one or the other is focused on in the literature.

Rule-based systems utilize pattern-matching techniques in text as well as heuristics derived either from the morphology or the semantics of the input sequence. They are generally used as classifiers in machine-learning approaches, or as candidate taggers in gazetteers. Some applications can also make effective use of stand-alone rule-based systems, but they are prone to both overreach and skipping over named entities. Rule-based approaches are discussed in (10), (12), (13), and (14).

Gazetteer approaches make use of some external knowledge source to match chunks of the text via some dynamically constructed lexicon or gazette to the names and entities. Gazetteers also further provide a non-local model for resolving multiple names to the same entity. This approach requires either the hand crafting of name lexicons or some dynamic approach to obtaining a gazette from the corpus or another external source. However, gazette based approaches achieve better results for specific domains. Most of the research on this topic focuses on the expansion of the gazetteer to more dynamic lexicons, e.g. the use of Wikipedia or Twitter to construct the gazette. Gazette based approaches are discussed in (15), (16), and (17).

Stochastic approaches fare better across domains, and can perform predictive analysis on entities that are unknown in a gazette. These systems use statistical models and some form of feature identification to make predictions about named entities in text. They can further be supplemented with smoothing for universal coverage. Unfortunately these approaches require large amounts of annotated training data in order to be effective, and they don’t naturally provide a non-local model for entity resolution. Systems implemented with this approach are discussed in (7), (8), (4), (9), and (6).

Benjamin continues his excellent survey of named entity recognition techniques.

All of these techniques may prove to be useful in constructing topic maps from source materials.

### An Introduction to Named Entity Recognition…

Wednesday, April 17th, 2013

An Introduction to Named Entity Recognition in Natural Language Processing – Part 1 by Benjamin Bengfort.

From the post:

Abstract:

The task of identifying proper names of people, organizations, locations, or other entities is a subtask of information extraction from natural language documents. This paper presents a survey of techniques and methodologies that are currently being explored to solve this difficult subtask. After a brief review of the challenges of the task, as well as a look at previous conventional approaches, the focus will shift to a comparison of stochastic and gazetteer based approaches. Several machine-learning approaches are identified and explored, as well as a discussion of knowledge acquisition relevant to recognition. This two-part white paper will show that applications that require named entity recognition will be served best by some combination of knowledge- based and non-deterministic approaches.

Introduction:

In school we were taught that a proper noun was “a specific person, place, or thing,” thus extending our definition from a concrete noun. Unfortunately, this seemingly simple mnemonic masks an extremely complex computational linguistic task—the extraction of named entities, e.g. persons, organizations, or locations from corpora (1). More formally, the task of Named Entity Recognition and Classification can be described as the identification of named entities in computer readable text via annotation with categorization tags for information extraction.

Not only is named entity recognition a subtask of information extraction, but it also plays a vital role in reference resolution, other types of disambiguation, and meaning representation in other natural language processing applications. Semantic parsers, part of speech taggers, and thematic meaning representations could all be extended with this type of tagging to provide better results. Other, NER-specific, applications abound including question and answer systems, automatic forwarding, textual entailment, and document and news searching. Even at a surface level, an understanding of the named entities involved in a document provides much richer analytical frameworks and cross-referencing.

Named entities have three top-level categorizations according to DARPA’s Message Understanding Conference: entity names, temporal expressions, and number expressions (2). Because the entity names category describes the unique identifiers of people, locations, geopolitical bodies, events, and organizations, these are usually referred to as named entities and as such, much of the literature discussed in this paper focuses solely on this categorization, although it is easy to imagine extending the proposed systems to cover the full MUC-7 task. Further, the CoNLL-2003 Shared Task, upon which the standard of evaluation for such systems is based, only evaluates the categorization of organizations, persons, locations, and miscellaneous named entities. For example:

(ORG S.E.C.) chief (PER Mary Shapiro) to leave (LOC Washington) in December.

This sentence contains three named entities that demonstrate many of the complications associated with named entity recognition. First, S.E.C. is an acronym for the Securities and Exchange Commission, which is an organization. The two words “Mary Shapiro” indicate a single person, and Washington, in this case, is a location and not a name. Note also that the token “chief” is not included in the person tag, although it very well could be. In this scenario, it is ambiguous if “S.E.C. chief Mary Shapiro” is a single named entity, or if multiple, nested tags would be required.

Nice introduction to the area and ends with a great set of references.

Looking forward to part 2!

### NLTK 2.3 – Working with Wordnet

Friday, April 12th, 2013

NLTK 2.3 – Working with Wordnet by Vsevolod Dyomkin.

From the post:

I’m a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn’t mean that work on CL-NLP has stopped – I’ve just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.

Today we’ll start looking at Chapter 2, but we’ll do it from the end, first exploring the topic of Wordnet.

Vsevolod more than makes up for his absence with his post on Wordnet.

As a sample, consider this graphic of the potential of Wordnet:

Pay particular attention to the coverage of similarity measures.

Enjoy!

### 50,000 Lessons on How to Read:…

Friday, April 12th, 2013

50,000 Lessons on How to Read: a Relation Extraction Corpus by Dave Orr, Product Manager, Google Research.

From the post:

One of the most difficult tasks in NLP is called relation extraction. It’s an example of information extraction, one of the goals of natural language understanding. A relation is a semantic connection between (at least) two entities. For instance, you could say that Jim Henson was in a spouse relation with Jane Henson (and in a creator relation with many beloved characters and shows).

The goal of relation extraction is to learn relations from unstructured natural language text. The relations can be used to answer questions (“Who created Kermit?”), learn which proteins interact in the biomedical literature, or to build a database of hundreds of millions of entities and billions of relations to try and help people explore the world’s information.

To help researchers investigate relation extraction, we’re releasing a human-judged dataset of two relations about public figures on Wikipedia: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months.

Another step in the “right” direction.

This is a human-curated set of relation semantics.

Rather than trying to apply this as a universal “standard,” what if you were to create a similar data set for your domain/enterprise?

Using human curators to create and maintain a set of relation semantics?

Being a topic mappish sort of person, I suggest the basis for their identification of the relationship be explicit, for robust re-use.

But you can repeat the same analysis over and over again if you prefer.

### Apache cTAKES

Wednesday, April 10th, 2013

Apache cTAKES

From the webpage:

Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named entities from various dictionaries including the Unified Medical Language System (UMLS) – medications, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, subject (patient, family member, etc.) and context (negated/not negated, conditional, generic, degree of certainty). Some of the attributes are expressed as relations, for example the location of a clinical condition (locationOf relation) or the severity of a clinical condition (degreeOf relation).

Apache cTAKES was built using the Apache UIMA Unstructured Information Management Architecture engineering framework and Apache OpenNLP natural language processing toolkit. Its components are specifically trained for the clinical domain out of diverse manually annotated datasets, and create rich linguistic and semantic annotations that can be utilized by clinical decision support systems and clinical research. cTAKES has been used in a variety of use cases in the domain of biomedicine such as phenotype discovery, translational science, pharmacogenomics and pharmacogenetics.

Apache cTAKES employs a number of rule-based and machine learning methods. Apache cTAKES components include:

1. Sentence boundary detection
2. Tokenization (rule-based)
3. Morphologic normalization
4. POS tagging
5. Shallow parsing
6. Named Entity Recognition
• Dictionary mapping
• Semantic typing is based on these UMLS semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, medications
7. Assertion module
8. Dependency parser
9. Constituency parser
10. Semantic Role Labeler
11. Coreference resolver
12. Relation extractor
13. Drug Profile module
14. Smoking status classifier

The goal of cTAKES is to be a world-class natural language processing system in the healthcare domain. cTAKES can be used in a great variety of retrievals and use cases. It is intended to be modular and expandable at the information model and method level.
The cTAKES community is committed to best practices and R&D (research and development) by using cutting edge technologies and novel research. The idea is to quickly translate the best performing methods into cTAKES code.

Processing a text with cTAKES is a processing of adding semantic information to the text.

As you can imagine, the better the semantics that are added, the better searching and other functions become.

In order to make added semantic information interoperable, well, that’s a topic map question.

I first saw this in a tweet by Tim O’Reilly.

### Implementing the RAKE Algorithm with NLTK

Monday, March 25th, 2013

Implementing the RAKE Algorithm with NLTK by Sujit Pal.

From the post:

The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn’t improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK’s built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

Reminds me of the “statistically insignificant phrases” at Amazon. Or was that “statistically improbable phrases?”

If you search on “statistically improbable phrases,” you get twenty (20) “hits” under books at Amazon.com.

Could be a handy tool to quickly extract candidates for topics in a topic map.

### Applied Natural Language Processing

Wednesday, March 20th, 2013

Applied Natural Language Processing by Jason Baldridge.

Description:

This class will provide instruction on applying algorithms in natural language processing and machine learning for experimentation and for real world tasks, including clustering, classification, part-of-speech tagging, named entity recognition, topic modeling, and more. The approach will be practical and hands-on: for example, students will program common classifiers from the ground up, use existing toolkits such as OpenNLP, Chalk, StanfordNLP, Mallet, and Breeze, construct NLP pipelines with UIMA, and get some initial experience with distributed computation with Hadoop and Spark. Guidance will also be given on software engineering, including build tools, git, and testing. It is assumed that students are already familiar with machine learning and/or computational linguistics and that they already are competent programmers. The programming language used in the course will be Scala; no explicit instruction will be given in Scala programming, but resources and assistance will be provided for those new to the language.

From the syllabus:

The foremost goal of this course is to provide practical exposure to the core techniques and applications of natural language processing. By the end, students will understand the motivations for and capabilities of several core natural language processing and machine learning algorithms and techniques used in text analysis, including:

• regular expressions
• vector space models
• clustering
• classification
• deduplication
• n-gram language models
• topic models
• part-of-speech tagging
• named entity recognition
• PageRank
• label propagation
• dependency parsing

We will show, on a few chosen topics, how natural language processing builds on and uses the fundamental data structures and algorithms presented in this course. In particular, we will discuss:

• language identification
• spam detection
• sentiment analysis
• influence
• information extraction
• geolocation

Students will learn to write non-trivial programs for natural language processing that take advantage of existing open source toolkits. The course will involve significant guidance and instruction in to software engineering practices and principles, including:

• functional programming
• distributed version control systems (git)
• build systems
• unit testing

The course will help prepare students both for jobs in the industry and for doing original research that involves natural language processing.

A great start to one aspect of being a “data scientist.”

I encountered this course via the Nak (Scala library for NLP) project. Version 1.1.1 was just released and I saw a tweet from Jason Baldridge on the same.

The course materials have exercises and a rich set of links to other resources.

You may also enjoy:

Bcomposes (Jason’s blog).

### Elasticsearch OpenNLP Plugin

Saturday, March 9th, 2013

Elasticsearch OpenNLP Plugin

From the webpage:

This plugin uses the opennlp project to extract named entities from an indexed field. This means, when a certain field of a document is indexed, you can extract entities like persons, dates and locations from it automatically and store them in additional fields.

Extracting entities into roles perhaps?

### NLTK 1.3 – Computing with Language: Simple Statistics

Wednesday, March 6th, 2013

NLTK 1.3 – Computing with Language: Simple Statistics by Vsevolod Dyomkin.

From the post:

Most of the remaining parts of the first chapter of NLTK book serve as an introduction to Python in the context of text processing. I won’t translate that to Lisp, because there’re much better resources explaining how to use Lisp properly. First and foremost I’d refer anyone interested to the appropriate chapters of Practical Common Lisp:

It’s only worth noting that Lisp has a different notion of lists, than Python. Lisp’s lists are linked lists, while Python’s are essentially vectors. Lisp also has vectors as a separate data-structure, and it also has multidimensional arrays (something Python mostly lacks). And the set of Lisp’s list operations is somewhat different from Python’s. List is the default sequence data-structure, but you should understand its limitations and know, when to switch to vectors (when you will have a lot of elements and often access them at random). Also Lisp doesn’t provide Python-style syntactic sugar for slicing and dicing lists, although all the operations are there in the form of functions. The only thing which isn’t easily reproducible in Lisp is assigning to a slice:

Vsevolod continues his journey through chapter 1 of NLTK 1.3 focusing on the statistics (with examples).

### NLTK 1.1 – Computing with Language: …

Monday, March 4th, 2013

NLTK 1.1 – Computing with Language: Texts and Words by Vsevolod Dyomkin.

From the post:

OK, let’s get started with the NLTK book. Its first chapter tries to impress the reader with how simple it is to accomplish some neat things with texts using it. Actually, the underlying algorithms that allow to achieve these results are mostly quite basic. We’ll discuss them in this post and the code for the first part of the chapter can be found in nltk/ch1-1.lisp.

A continuation of Natural Language Meta Processing with Lisp.

Who knows? You might decide that Lisp is a natural language.

### Lincoln Logarithms: Finding Meaning in Sermons

Thursday, February 28th, 2013

Lincoln Logarithms: Finding Meaning in Sermons

From the webpage:

Just after his death, Abraham Lincoln was hailed as a luminary, martyr, and divine messenger. We wondered if using digital tools to analyze a digitized collection of elegiac sermons might uncover patterns or new insights about his memorialization.

We explored the power and possibility of four digital tools—MALLET, Voyant, Paper Machines, and Viewshare. MALLET, Paper Machines, and Voyant all examine text. They show how words are arranged in texts, their frequency, and their proximity. Voyant and Paper Machines also allow users to make visualizations of word patterns. Viewshare allows users to create timelines, maps, and charts of bodies of material. In this project, we wanted to experiment with understanding what these tools, which are in part created to reveal, could and could not show us in a small, but rich corpus. What we have produced is an exploration of the possibilities and the constraints of these tools as applied to this collection.

The resulting digital collection: The Martyred President: Sermons Given on the Assassination of President Lincoln.

Let’s say this is not an “ahistorical” view.

Good example of exploring “unstructured” data.

A first step before authoring a topic map.

### Named entity extraction

Thursday, February 28th, 2013

Named entity extraction

From the webpage:

The techniques we discussed in the Cleanup and Reconciliation parts come in very handy when your data is already in a structured format. However, many fields (notoriously description) contain unstructured text, yet they usually convey a high amount of interesting information. To capture this in machine-processable format, named entity recognition can be used.

A Google Refine / OpenRefine extension developed by Multimedia Lab (ELIS — Ghent University / iMinds) and MasTIC (Université Libre de Bruxelles.

Abstract:

Unstructured metadata ﬁelds such as ‘description’ offer tremendous value for users to understand cultural heritage objects. However, this type of narrative information is of little direct use within a machine-readable context due to its unstructured nature. This paper explores the possibilities and limitations of Named-Entity Recognition (NER) to mine such unstructured metadata for meaningful concepts. These concepts can be used to leverage otherwise limited searching and browsing operations, but they can also play an important role to foster Digital Humanities research. In order to catalyze experimentation with NER, the paper proposes an evaluation of the performance of three thirdparty NER APIs through a comprehensive case study, based on the descriptive ﬁelds of the Smithsonian Cooper-Hewitt National Design Museum in New York. A manual analysis is performed of the precision, recall, and F-score of the concepts identiﬁed by the third party NER APIs. Based on the outcomes of the analysis, the conclusions present the added value of NER services, but also point out to the dangers of uncritically using NER, and by extension Linked Data principles, within the Digital Humanities. All metadata and tools used within the paper are freely available, making it possible for researchers and practitioners to repeat the methodology. By doing so, the paper offers a signiﬁcant contribution towards understanding the value of NER for the Digital Humanities.

I commend the paper to you for a very close reading, particularly those of you in the humanities.

To conclude, the Digital Humanities need to launch a broader debate on how we can incorporate within our work the probabilistic character of tools such as NER services. Drucker eloquently states that ‘we use tools from disciplines whose epistemological foundations are at odds with, or even hostile to, the humanities. Positivistic, quantitative and reductive, these techniques preclude humanistic methods because of the very assumptions on which they are designed: that objects of knowledge can be understood as ahistorical and autonomous.’

Drucker, J. (2012), Debates in the Digital Humanities, Minesota Press, chapter Humanistic Theory and Digital Scholarship, pp. 85–95.

…that objects of knowledge can be understood as ahistorical and autonomous.

Certainly possible, but lossy, very lossy, in my view.

You?

### Natural Language Meta Processing with Lisp

Sunday, February 24th, 2013

Natural Language Meta Processing with Lisp by Vsevolod Dyomkin.

From the post:

Recently I’ve started work on gathering and assembling a comprehensive suite of NLP tools for Lisp — CL-NLP. Something along the lines of OpenNLP or NLTK. There’s actually quite a lot of NLP tools in Lisp accumulated over the years, but they are scattered over various libraries, internet sites and books. I’d like to have them in one place with a clean and concise API which would provide easy startup point for anyone willing to do some NLP experiments or real work in Lisp. There’s already a couple of NLP libraries, most notably, langutils, but I don’t find them very well structured and also their development isn’t very active. So, I see real value in creating CL-NLP.

Besides, I’m currently reading the NLTK book. I thought that implementing the examples from the book in Lisp could be likewise a great introduction to NLP and to Lisp as it is an introduction to Python. So I’m going to work through them using CL-NLP toolset. I plan to cover 1 or 2 chapters per month. The goal is to implement pretty much everything meaningful, including the graphs — for them I’m going to use gnuplot driven by cgn of which I’ve learned answering questions on StackOverflow. I’ll try to realize the examples just from the description — not looking at NLTK code — although, I reckon it will be necessary sometimes if the results won’t match. Also in the process I’m going to discuss different stuff re NLP, Lisp, Python, and NLTK — that’s why there’s “meta” in the title.

Just in case you haven’t found a self-improvement project for 2013!

Seriously, this could be a real learning experience.

I first saw this at Christophe Lalanne’s A bag of tweets / February 2013.

### …O’Reilly Book on NLP with Java?

Friday, February 22nd, 2013

Anyone Want to Write an O’Reilly Book on NLP with Java? by Bob Carpenter.

From the post:

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

Read Bob’s post and give him a shout if you are interested.

Would be a good exercise in learning how choices influence the “objective” outcomes.

### PyPLN: a Distributed Platform for Natural Language Processing

Friday, February 8th, 2013

PyPLN: a Distributed Platform for Natural Language Processing by Flávio Codeço Coelho, Renato Rocha Souza, Álvaro Justen, Flávio Amieiro, Heliana Mello.

Abstract:

This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks as long as a linux version is available. PyPLN facilitates analyses both at document and corpus level, simplifying management and publication of corpora and analytical results through an easy to use web interface. In the current (beta) release, it supports English and Portuguese languages with support to other languages planned for future releases. To support the Portuguese language PyPLN uses the PALAVRAS parser\citep{Bick2000}. Currently PyPLN offers the following features: Text extraction with encoding normalization (to UTF-8), part-of-speech tagging, token frequency, semantic annotation, n-gram extraction, word and sentence repertoire, and full-text search across corpora. The platform is licensed as GPL-v3.

Source code: http://pypln.org.

Have you noticed that tools for analysis are getting easier, not harder to use?

Is there a lesson there for tools to create topic map content?

### History SPOT

Friday, February 8th, 2013

History SPOT

I discovered this site via a post entitled: Text Mining for Historians: Natural Language Processing.

From the webpage:

Welcome to History SPOT. This is a subsite of the IHR [Institute of Historical Research] website dedicated to our online research training provision. On this page you will find the latest updates regarding our seminar podcasts, online training courses and History SPOT blog posts.

Currently offered online training courses (free registration required):

• Designing Databases for Historians
• Podcasting for Historians
• Sources for British History on the Internet
• Data Preservation
• Digital Tools
• InScribe Palaeography

Not to mention over 300 pod casts!

Two thoughts:

First, a good way to learn about the tools and expectations that historians have of their digital tools. That should help you prepare an answer to: “What do topic maps have to offer over X technology?”

Second, I rather like the site and its module orientation. A possible template for topic map training online?

### *SEM 2013 [...Independence to be Semantically Diverse]

Saturday, January 26th, 2013

*SEM 2013 : The 2nd Joint Conference on Lexical and Computational Semantics

Dates:

When Jun 13, 2013 – Jun 14, 2013
Where Atlanta GA, USA
Final Version Due Apr 21, 2013

From the call:

The main goal of *SEM is to provide a stable forum for the growing number of NLP researchers working on different aspects of semantic processing, which has been scattered over a large array of small workshops and conferences.

Topics of interest include, but are not limited to:

• Formal and linguistic semantics
• Cognitive aspects of semantics
• Lexical semantics
• Semantic aspects of morphology and semantic processing of morphologically rich languages
• Semantic processing at the sentence level
• Semantic processing at the discourse level
• Semantic processing of non-propositional aspects of meaning
• Textual entailment
• Multiword expressions
• Multilingual semantic processing
• Social media and linguistic semantics

*SEM 2013 will feature a distinguished panel on Deep Language Understanding.

*SEM 2013 hosts the shared task on Semantic Textual Similarity.

Another workshop to join the array of “…small workshops and conferences.”

Not a bad thing. Communities grow up around conferences and people you will see at one are rarely at others.

Diversity of communities, dare I say semantics?, isn’t a bad thing. It is a reflection of our diversity and we should stop beating ourselves up over it.

Our machines are capable of being uniformly monotonous. But that is because they lack the independence to be diverse on their own.

Why would anyone want to emulate being a machine?

### Natural Language Processing-(NLP) Tools

Sunday, December 30th, 2012

Natural Language Processing-(NLP) Tools

A very good collection of NLP tools, running from the general to taggers and pointers to other NLP resource pages.

### Semantic Assistants Wiki-NLP Integration

Wednesday, December 26th, 2012

Natural Language Processing for MediaWiki: First major release of the Semantic Assistants Wiki-NLP Integration

From the post:

We are happy to announce the first major release of our Semantic Assistants Wiki-NLP integration. This is the first comprehensive open source solution for bringing Natural Language Processing (NLP) to wiki users, in particular for wikis based on the well-known MediaWiki engine and its Semantic MediaWiki (SMW) extension. It can run any NLP pipeline deployed in the General Architecture for Text Engineering (GATE), brokered as web services through the Semantic Assistants server. This allows you to bring novel text mining assistants to wiki users, e.g., for automatically structuring wiki pages, answering questions in natural language, quality assurance, entity detection, summarization, among others. The results of the NLP analysis are written back to the wiki, allowing humans and AI to work collaboratively on wiki content. Additionally, semantic markup understood by the SMW extension can be automatically generated from NLP output, providing semantic search and query functionalities.

Features:

• Light-weight MediaWiki Extension
• NLP Pipeline Independent Architecture
• Flexible Wiki Input Handling
• Flexible NLP Result Handling
• Semantic Markup Generation
• Wiki-independent Architecture

A promising direction for creation of author-curated text!