Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 11, 2014

brat rapid annotation tool

Filed under: Annotation,Natural Language Processing,Visualization — Patrick Durusau @ 3:20 pm

brat rapid annotation tool

From the introduction:

brat is a web-based tool for text annotation; that is, for adding notes to existing text documents.

brat is designed in particular for structured annotation, where the notes are not free form text but have a fixed form that can be automatically processed and interpreted by a computer.

The examples page has examples of:

  • Entity mention detection
  • Event extraction
  • Coreference resolution
  • Normalization
  • Chunking
  • Dependency syntax
  • Meta-knowledge
  • Information extraction
  • Bottom-up Metaphor annotation
  • Visualization
  • Information Extraction system evaluation

I haven’t installed the local version but it is on my to-do list.

I first saw this in a tweet by Steven Bird.

April 28, 2014

Parsing English with 500 lines of Python

Filed under: Natural Language Processing,Python — Patrick Durusau @ 4:19 pm

Parsing English with 500 lines of Python by Matthew Honnibal.

From the post:

A syntactic parser describes a sentence’s grammatical structure, to help another application reason about it. Natural languages introduce many unexpected ambiguities, which our world-knowledge immediately filters out. A favourite example:

Definitely a post to savor if you have any interest in natural language processing.

I first saw this in a tweet by Jim Salmons.

March 10, 2014

A New Entity Salience Task with Millions of Training Examples

A New Entity Salience Task with Millions of Training Examples by Dan Gillick and Jesse Dunietz.

Abstract:

Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

The article concludes:

We believe entity salience is an important task with many applications. To facilitate further research, our automatically generated salience annotations, along with resolved entity ids, for the subset of the NYT corpus discussed in this paper are available here: https://code.google.com/p/nyt-salience

A classic approach to a CS article: new approach/idea, data + experiments, plus results and code. It doesn’t get any better.

The results won’t be perfect, but the question is: Are they “acceptable results?”

Which presumes a working definition of “acceptable” that you have hammered out with your client.

I first saw this in a tweet by Stefano Bertolo.

March 4, 2014

Part-of-Speech Tagging from 97% to 100%:…

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 8:21 pm

Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? by Christopher D. Manning.

Abstract:

I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semi-supervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.

I was struck by Christopher’s observation:

The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.

which comes up again in his final sentence:

But in such cases, we must accept that we are assigning parts of speech by convention for engineering convenience rather than achieving taxonomic truth, and there are still very interesting issues for linguistics to continue to investigate, along the lines of [27].

I suppose the observation stood out for me because on what other basis would we assign properties other than “convenience?”

When I construct a topic, I assign properties that I hope are useful to others when they view that particular topic. I don’t assign it properties unknown to me. I don’t necessarily assign it all the properties I may know for a given topic.

I may even assign it properties that I know will cause a topic to merge with other topics.

BTW, footnote [27] refers to:

Aarts, B.: Syntactic gradience: the nature of grammatical indeterminacy. Oxford University Press, Oxford (2007)

Sounds like an interesting work. I did search for “semantic indeterminacy” while at Amazon but it marked out “semantic” and returned results for indeterminacy. 😉

I first saw this in a tweet by the Stanford NLP Group.

February 8, 2014

Arabic Natural Language Processing

Filed under: Language,Natural Language Processing — Patrick Durusau @ 3:14 pm

Arabic Natural Language Processing

From the webpage:

Arabic is the largest member of the Semitic language family and is spoken by nearly 500 million people worldwide. It is one of the six official UN languages. Despite its cultural, religious, and political significance, Arabic has received comparatively little attention by modern computational linguistics. We are remedying this oversight by developing tools and techniques that deliver state-of-the-art performance in a variety of language processing tasks. Machine translation is our most active area of research, but we have also worked on statistical parsing and part-of-speech tagging. This page provides links to our freely available software along with a list of relevant publications.

Software and papers from the Stanford NLP group.

An important capability to add to your toolkit, especially if you are dealing with the U.S. security complex.

I first saw this at: Stanford NLP Group Tackles Arabic Machine Translation.

January 31, 2014

…only the information that they can ‘see’…

Filed under: Natural Language Processing,Semantics,Topic Maps — Patrick Durusau @ 1:31 pm

Jumping NLP Curves: A Review of Natural Language Processing Research by Erik Cambria and Bebo White.

From the post:

Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to look at the past, present, and future of NLP technology in a new light. Borrowing the paradigm of `jumping curves’ from the eld of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves –namely Syntactics, Semantics, and Pragmatics Curves– which will eventually lead NLP research to evolve into natural language understanding.

This is not your average review of the literature as the authors point out:

…this review paper focuses on the evolution of NLP research according to three diff erent paradigms, namely: the bag-of-words, bag-of-concepts, and bag-of-narratives models.

But what caught my eye was:

All such capabilities are required to shift from mere NLP to what is usually referred to as natural language understanding (Allen, 1987). Today, most of the existing approaches are still based on the syntactic representation of text, a method which mainly relies on word co-occurrence frequencies. Such algorithms are limited by the fact that they can process only the information that they can `see’. As human text processors, we do not have such limitations as every word we see activates a cascade of semantically related concepts, relevant episodes, and sensory experiences, all of which enable the completion of complex NLP tasks &endash; such as word-sense disambiguation, textual entailment, and semantic role labeling &endash; in a quick and e ffortless way. (emphasis added)

The phrase, “only the information that they can `see’” captures the essence of the problem that topic maps address. A program can only see the surface of a text, nothing more.

The next phrase summarizes the promise of topic maps, to capture “…a cascade of semantically related concepts, relevant episodes, and sensory experiences…” related to a particular subject.

Not that any topic map could capture the full extent of related information to any subject but it can capture information to the extent plausible and useful.

I first saw this in a tweet by Marin Dimitrov.

December 5, 2013

TextBlob: Simplified Text Processing

Filed under: Natural Language Processing,Parsing,Text Mining — Patrick Durusau @ 7:31 pm

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

….

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • JSON serialization
  • Add new models or languages through extensions
  • WordNet integration

Knowing that TextBlob plays well with NLTK is a big plus!

December 4, 2013

Free Language Lessons for Computers

Filed under: Data,Language,Natural Language Processing — Patrick Durusau @ 4:58 pm

Free Language Lessons for Computers by Dave Orr.

From the post:

50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.

These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.

A great summary of the major data drops by Google Research over the past year. In many cases including pointers to additional information on the datasets.

One that I have seen before and that strikes me as particularly relevant to topic maps is:

Dictionaries for linking Text, Entities, and Ideas

What is it: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.

Where can I find it: http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2

I want to know more: A description of the data, several examples, and ideas for uses for it can be found in a blog post or in the associated paper.

For most purposes, you would need far less than the full set of 7.5 million concepts. Imagine having the relevant concepts for a domain that was being automatically “tagged” as you composed prose about it.

Certainly less error-prone than marking concepts by hand!

November 27, 2013

How to spot first stories on Twitter using Storm

Filed under: Natural Language Processing,Storage,Tweets — Patrick Durusau @ 5:37 pm

How to spot first stories on Twitter using Storm by Michael Vogiatzis.

From the post:

As a first blog post, I decided to describe a way to detect first stories (a.k.a new events) on Twitter as they happen. This work is part of the Thesis I wrote last year for my MSc in Computer Science in the University of Edinburgh.You can find the document here.

Every day, thousands of posts share information about news, events, automatic updates (weather, songs) and personal information. The information published can be retrieved and analyzed in a news detection approach. The immediate spread of events on Twitter combined with the large number of Twitter users prove it suitable for first stories extraction. Towards this direction, this project deals with a distributed real-time first story detection (FSD) using Twitter on top of Storm. Specifically, I try to identify the first document in a stream of documents, which discusses about a specific event. Let’s have a look into the implementation of the methods used.

Other resources of interest:

Slide deck by the same name.

Code on Github.

The slides were interesting and were what prompted me to search for and find the blog and Github materials.

An interesting extension to this technique would be to discover “new” ideas in papers.

Or particular classes of “new” ideas in news streams.

November 15, 2013

Shrinking the Haystack with Solr and NLP

Filed under: BigData,Natural Language Processing,Solr — Patrick Durusau @ 8:39 pm

Shrinking the Haystack with Solr and NLP by Wes Caldwell.

A very high level view of using Solr and NLP to shrink a data haystack but a useful one none the less.

If you think of this in the context of Chuck Hollis’ “modest data,” you begin to realize that the inputs may be “big data” but to be useful to a human analyst, it needs to be pared down to “modest data.”

Or even further to “actionable data.”

There’s an interesting contrast: Big data vs. Actionable data.

Ask your analyst if they prefer five terabytes of raw data or five pages of actionable data?

Adjust your deliverables accordingly.

November 11, 2013

Day 14: Stanford NER…

Day 14: Stanford NER–How To Setup Your Own Name, Entity, and Recognition Server in the Cloud by Shekhar Gulati.

From the post:

I am not a huge fan of machine learning or natural text processing (NLP) but I always have ideas in mind which require them. The idea that I will explore during this post is the ability to build a real time job search engine using twitter data. Tweets will contain the name of the company which if offering a job, the location of the job, and name of the contact person at the company. This requires us to parse the tweet for Person, Location, and Organisation. This type of problem falls under Named Entity Recognition.

A continuation of Shekhar’s Learning 30 Technologies in 30 Days… but one that merits a special shout out.

In part because you can consume the entities that other “recognize” or you can be in control of the recognition process.

It isn’t easy but on the other hand, it isn’t free from hidden choices and selection biases.

I would prefer those were my hidden choices and selection biases, if you don’t mind. 😉

October 26, 2013

Center for Language and Speech Processing Archives

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 4:52 pm

Center for Language and Speech Processing Archives

Archived seminars from the Center for Language and Speech Processing (CLSP) at John Hopkins University.

I mentioned recently that Chris Callison-Burch is digitizing these videos and posting them to Vimeo. (Say Good-Bye to iTunes: > 400 NLP Videos)

Unfortunately, Vimeo offers primitive sorting (by upload date), etc.

Works if you are a Kim Kardashian fan. One tweet, photo or video is as meaningful (sic) as another.

Works less well if you looking for specific and useful content.

CLSP offers searching “by speaker, year, or keyword from title, abstract, bio.”

Enjoy!

October 20, 2013

Say Good-Bye to iTunes: > 400 NLP Videos

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 2:59 pm

Chris Callison-Burch’s Videos

Chris tweeted today that he is less than twenty-five videos away from digitizing the entire CLSP video archive.

Currently there are four hundred and twenty-five NLP videos at Chris’ Vimeo page.

Way to go Chris!

Spread the word about this remarkable resource!

October 12, 2013

Sixth International Joint Conference on Natural Language Processing (Papers)

Filed under: Natural Language Processing — Patrick Durusau @ 7:32 pm

Proceedings of the Sixth International Joint Conference on Natural Language Processing

Not counting the system demonstration papers, one hundred and ninety-seven (197) papers on natural language processing!

Any attempt to summarize the collection would be unfair.

I will be going over the proceeding for papers that look particularly useful for topic maps.

I would appreciate your suggesting your favorites or even better yet, writing/blogging about your favorites and sending me a link.

Happy reading!

September 30, 2013

Classifying Non-Patent Literature…

Filed under: Classification,Natural Language Processing,Patents,Solr — Patrick Durusau @ 6:29 pm

Classifying Non-Patent Literature To Aid In Prior Art Searches by John Berryman.

From the post:

Before a patent can be granted, it must be proven beyond a reasonable doubt that the innovation outlined by the patent application is indeed novel. Similarly, when defending one’s own intellectual property against a non-practicing entity (NPE – also known as a patent troll) one often attempts to prove that the patent held by the accuser is invalid by showing that relevant prior art already exists and that their patent is actual not that novel.

Finding Prior Art

So where does one get ahold of pertinent prior art? The most obvious place to look is in the text of earlier patents grants. If you can identify a set of reasonably related grants that covers the claims of the patent in question, then the patent may not be valid. In fact, if you are considering the validity of a patent application, then reviewing existing patents is certainly the first approach you should take. However, if you’re using this route to identify prior art for a patent held by an NPE, then you may be fighting an uphill battle. Consider that a very bright patent examiner has already taken this approach, and after an in-depth examination process, having found no relevant prior art, the patent office granted the very patent that you seek to invalidate.

But there is hope. For a patent to be granted, it must not only be novel among the roughly 10Million US Patents that currently exist, but it must also be novel among all published media prior to the application date – so called non-patent literature (NPL). This includes conference proceeding, academic articles, weblogs, or even YouTube videos. And if anyone – including the applicant themselves – publicly discloses information critical to their patent’s claims, then the patent may be rendered invalid. As a corollary, if you are looking to invalidate a patent, then looking for prior art in non-patent literature is a good idea! While tools are available to systematically search through patent grants, it is much more difficult to search through NPL. And if the patent in question truly is not novel, then evidence must surely exists – if only you knew where to look.

More suggestions than solutions but good suggestions, such as these, are hard to come by.

John suggests using existing patents and their classifications as a learning set to classify non-patent literature.

Interesting but patent language is highly stylized and quite unlike the descriptions you encounter in non-patent literature.

It would be an interesting experiment to take some subset of patents and their classifications along with a set of non-patent literature, known to describe the same “inventions” covered by the patents.

Suggestions for subject areas?

August 16, 2013

Finding Parties Named in U.S. Law…

Filed under: Law,Natural Language Processing,NLTK,Python — Patrick Durusau @ 4:59 pm

Finding Parties Named in U.S. Law using Python and NLTK by Gary Sieling.

From the post:

U.S. Law periodically names specific institutions; historically it is possible for Congress to write a law naming an individual, although I think that has become less common. I expect the most common entities named in Federal Law to be groups like Congress. It turns out this is true, but the other most common entities are the law itself and bureaucratic functions like archivists.

To get at this information, we need to read the Code XML, and use a natural language processing library to get at the named groups.

NLTK is such an NLP library. It provides interesting features like sentence parsing, part of speech tagging, and named entity recognition. (If interested in the subject see my review of “Natural Language Processing with Python“, a book which covers this library in detail)

I would rather know who paid for particular laws but that requires information external to the Code XML data set. 😉

A very good exercise to become familiar with both NLTK and the Code XML data set.

August 3, 2013

in the dark heart of a language model lies madness….

Filed under: Corpora,Language,Natural Language Processing — Patrick Durusau @ 4:32 pm

in the dark heart of a language model lies madness…. by Chris.

From the post:

This is the second in a series of post detailing experiments with the Java Graphical Authorship Attribution Program. The first post is here.

screenshot

In my first run (seen above), I asked JGAAP to normalize for white space, strip punctuation, turn everything into lowercase. Then I had it run a Naive Bayes classifier on the top 50 tri-grams from the three known authors (Shakespeare, Marlowe, Bacon) and one unknown author (Shakespeare’s sonnets).

Based on that sample, JGAAP came to the conclusion that Francis Bacon wrote the sonnets. We know that because it lists its guesses in order from best to worst in the left window in the above image. Bacon is on top. This alone is cause to start tinkering with the model, but the results didn’t look as flat weird until I looked at the image again today. It lists the probability that the sonnets were written by Bacon as 1. A probability of 1 typically means absolute certainty. So this model, given the top 50 trigrams, is absolutely certain that Francis Bacon wrote those sonnets … Bullshit. A probabilistic model is never absolutely certain of anything. That’s what makes it probabilistic, right?

So where’s the bug? Turns out, it might have been poor data management on my part. I didn’t bother to sample in any kind of fair and reasonable way. Here are my corpora:

(…)

You may not be a stakeholder in the Shakespeare vs. Bacon debate, but you are likely to encounter questions about the authorship of data. Particularly text data.

The tool that Chris describes is a great introduction to that type of analysis.

August 2, 2013

Interestingly: the sentence adverbs of PubMed Central

Filed under: Natural Language Processing — Patrick Durusau @ 4:20 pm

Interestingly: the sentence adverbs of PubMed Central by Neil Saunders.

From the post:

Scientific writing – by which I mean journal articles – is a strange business, full of arcane rules and conventions with origins that no-one remembers but to which everyone adheres.

I’ve always been amused by one particular convention: the sentence adverb. Used with a comma to make a point at the start of a sentence, as in these examples:

Surprisingly, we find that the execution of karyokinesis and cytokinesis is timely…
Grossly, the tumor is well circumscribed with fibrous capsule…
Correspondingly, the short-term Smad7 gene expression is graded…

The example that always makes me smile is interestingly. “This is interesting. You may not have realised that. So I said interestingly, just to make it clear.”

With that in mind, let’s go looking for sentence adverbs in article abstracts.

Great example of parsing PubMed abstracts (~47 GB uncompressed) for adverbs, with Ruby and analyzing the results with R.

I have something similar coming up this weekend. Searching a standards corpus for keyword usage consistency.

I think I know the answer but having a file with every instance will be a lot more convincing. 😉

What tools do you use to explore texts?

July 23, 2013

FreeLing 3.0 – Demo

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 2:29 pm

FreeLing 3.0 – Demo

I have posted about FreeLing before but this web-based demo merits separate mention.

If you are not familiar with natural language processing (NLP), visit the FreeLing 3.0 demo and type in some sample text.

Not suitable for making judgements on proposed NLP solutions but it will give you a rough idea of what is or is not possible.

July 22, 2013

PPDB: The Paraphrase Database

Filed under: Computational Linguistics,Linguistics,Natural Language Processing — Patrick Durusau @ 2:46 pm

PPDB: The Paraphrase Database by Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch.

Abstract:

We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similarity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our release includes pruning tools that allow users to determine their own precision/recall tradeoff.

A resource that should improve your subject identification!

PPDB data sets range from 424MB 6.8M rules to 5.7 GB, 86.4 rules. Download PPDB data sets.

NAACL 2013 – Videos!

NAACL 2013

Videos of the presentations at the 2013 Conference of the North American Chapter of the Association for Computational Linguistics.

Along with the papers, you should not lack for something to do over the summer!

Named Entities in Law & Order Episodes

Filed under: Entity Extraction,Named Entity Mining,Natural Language Processing — Patrick Durusau @ 2:19 pm

Named Entities in Law & Order Episodes by Yhat.

A worked example of using natural language processing on a corpus of viewer summaries of episodes of Law & Order and Law & Order: Special Victims Unit.

The data is here.

Makes me wonder if there is a archive of the soap operas that have been on for decades?

They survived because they have supporting audiences. Suspect a resource about the same would as well.

July 15, 2013

Natural Language Processing (NLP) Demos

Filed under: Natural Language Processing — Patrick Durusau @ 3:53 pm

Natural Language Processing (NLP) Demos (Cognitive Computation Group, University of Illinois at Urbana-Champaign)

From the webpage:

Most of the information available today is in free form text. Current technologies (google, yahoo) allow us to access text only via key-word search.

We would like to facilitate content-based access to information. Examples include:

  • Topical and Functional categorization of documents: Find documents that deal with stem cell research, but only Call for Proposals.
  • Semantic categorization: Find documents about Columbus (the City, not the Person).
  • Retrieval of concepts and entities rather than strings in text: Find documents about JFK, the president; include those documents that mention him as “John F. Kennedy, John Kennedy, Congressman Kennedy or any other possible writing; but not those that mention the baseball player John Kennedy, nor any of JFK’s relatives.
  • Extraction of information based on semantic categorization: Find a list of all companies that participated in merges in the last year. List all professors in Illinois that do research in Machine Learning.

I count twenty (20) separate demos.

Gives you a good sense of the current state of NLP.

I first saw this at: Demos of NLP by Ryan Swanstrom.

July 9, 2013

…Recursive Neural Networks

Filed under: Natural Language Processing,Neural Networks — Patrick Durusau @ 1:37 pm

Parsing Natural Scenes and Natural Language with Recursive Neural Networks by Richard Socher; Cliff Chiung-Yu Lin; Andrew Ng; and Chris Manning.

Description:

Recursive structure is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. Discovering this recursive structure helps us to not only identify the units that an image or sentence contains but also how they interact to form a whole. We introduce a max-margin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification. For segmentation and annotation our algorithm obtains a new level of state-of-the-art performance on the Stanford background dataset (78.1%). The features from the image parse tree outperform Gist descriptors for scene classification by 4%.

Video of Richard Socher’s presentation at ICML 2011.

PDF of the paper: http://nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf

According to one popular search engine the paper has 51 citations (as of today).

What caught my attention was the mapping of phrases into vector spaces which resulted in the ability to calculate nearest neighbors on phrases.

Both for syntactic and semantic similarity.

If you need more than a Boolean test for similarity (Yes/No), then you are likely to be interested in this work.

Later work by Socher at his homepage.

June 14, 2013

NAACL ATL 2013

2013 Conference of the North American Chapter of the Association for Computational Linguistics

The NAACL conference wraps up tomorrow in Atlanta but in case you are running low on summer reading materials:

Proceedings for the 2013 NAACL and *SEM conferences. Not quite 180MB but close.

Scanning the accepted papers will give you an inkling of what awaits.

Enjoy!

June 12, 2013

How does name analysis work?

Filed under: Names,Natural Language Processing — Patrick Durusau @ 2:51 pm

How does name analysis work? by Pete Warden.

From the post:

Over the last few months, I’ve been doing a lot more work with name analysis, and I’ve made some of the tools I use available as open-source software. Name analysis takes a list of names, and outputs guesses for the gender, age, and ethnicity of each person. This makes it incredibly useful for answering questions about the demographics of people in public data sets. Fundamentally though, the outputs are still guesses, and end-users need to understand how reliable the results are, so I want to talk about the strengths and weaknesses of this approach.

The short answer is that it can never work any better than a human looking at somebody else’s name and guessing their age, gender, and race. If you saw Mildred Hermann on a list of names, I bet you’d picture an older white woman, whereas Juan Hernandez brings to mind an Hispanic man, with no obvious age. It should be obvious that this is not always reliable for individuals (I bet there are some young Mildreds out there) but as the sample size grows, the errors tend to cancel each other out.

The algorithms themselves work by looking at data that’s been released by the US Census and the Social Security agency. These data sets list the popularity of 90,000 first names by gender and year of birth, and 150,000 family names by ethnicity. I then use these frequencies as the basis for all of the estimates. Crucially, all the guesses depend on how strong a correlation there is between a particular name and a person’s characteristics, which varies for each property. I’ll give some estimates of how strong these relationships are below, and I link to some papers with more rigorous quantitative evaluations below.

Not 100% as Pete points out but an interesting starting point. Plus links to more formal analysis.

June 9, 2013

NLTK 2.1 – Working with Text Corpora

Filed under: Natural Language Processing,NLTK,Text Corpus — Patrick Durusau @ 5:46 pm

NLTK 2.1 – Working with Text Corpora by Vsevolod Dyomkin.

From the post:

Let’s return to start of chapter 2 and explore the tools needed to easily and efficiently work with various linguistic resources.

What are the most used and useful corpora? This is a difficult question to answer because different problems will likely require specific annotations and often a specific corpus. There are even special conferences dedicated to corpus linguistics.

Here’s a list of the most well-known general-purpose corpora:

  • Brown Corpus – one of the first big corpora and the only one in the list really easily accessible – we’ve already worked with it in the first chapter
  • Penn Treebank – Treebank is a corpus of sentences annotated with their constituency parse trees so that they can be used to train and evaluate parsers
  • Reuters Corpus (not to be confused with the ApteMod version provided with NLTK)
  • British National Corpus (BNC) – a really huge corpus, but, unfortunately, not freely available

Another very useful resource which isn’t structured specifically as academic corpora mentioned above, but at the same time has other dimensions of useful connections and annotations is Wikipedia. And there’s being a lot of interesting linguistic research performed with it.

Besides there are two additional valuable language resources that can’t be classified as text corpora at all, but rather as language databases: WordNet and Wiktionary. We have already discussed CL-NLP interface to Wordnet. And we’ll touch working with Wiktionary in this part.

Vsevolod continues to recast the NLTK into Lisp.

Learning corpus processing along with Lisp. How can you lose?

June 5, 2013

Entity recognition with Scala and…

Filed under: Entity Resolution,Natural Language Processing,Scala,Stanford NLP — Patrick Durusau @ 4:05 pm

Entity recognition with Scala and Stanford NLP Named Entity Recognizer by Gary Sieling.

From the post:

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.

(…)

The voting on entity recognition made me curious about interactive entity resolution where a user has a voice.

See the next post.

June 4, 2013

NLP Weather: High Pressure or Low?

Filed under: Natural Language Processing,Translation — Patrick Durusau @ 12:43 pm

Machine Translation Without the Translation by Geoffrey Pullum.

From the post:

I have been ruminating this month on why natural language processing (NLP) still hasn’t arrived, and I have pointed to three developments elsewhere that seem to be discouraging its development. First, enhanced keyword search via Google’s influentiality-ranking of results. Second, the dramatic enhancement in applicability of speech recognition that dialog design facilitates. I now turn to a third, which has to do with the sheer power of number-crunching.

Machine translation is the unclimbed Everest of computational linguistics. It calls for syntactic and semantic analysis of the source language, mapping source-language meanings to target-language meanings, and generating acceptable output from the latter. If computational linguists could do all those things, they could hang up the “mission accomplished” banner.

What has emerged instead, courtesy of Google Translate, is something utterly different: pseudotranslation without analysis of grammar or meaning, developed by people who do not know (or need to know) either source or target language.

The trick: huge quantities of parallel texts combined with massive amounts of rapid statistical computation. The catch: low quality, and output inevitably peppered with howlers.

Of course, if I may purloin Dr Johnson’s remark about a dog walking on his hind legs, although it is not done well you are surprised to find it done at all. For Google Translate’s pseudotranslation is based on zero linguistic understanding. Not even word meanings are looked up: The program couldn’t care less about the meaning of anything. Here, roughly, is how it works.

(…)

My conjecture is that it is useful enough to constitute one more reason for not investing much in trying to get real NLP industrially developed and deployed.

NLP will come, I think; but when you take into account the ready availability of (1) Google search, and (2) speech-driven applications aided by dialog design, and (3) the statistical pseudotranslation briefly discussed above, the cumulative effect is enough to reduce the pressure to develop NLP, and will probably delay its arrival for another decade or so.

Surprised to find that Geoffrey thinks more pressure will result in “real NLP,” albeit delayed by a decade or so for the reasons outlined in his post.

If you recall, machine translation of texts was the hot topic at the end of the 1950’s and early 1960’s.

With an emphasis on automatic translation of Russian. Height of the cold war so there was lots of pressure for a solution.

Lots of pressure then did not result in a solution.

There’s a rather practical reason for not investing in “real NLP.”

There is no evidence that how humans “understand” language is known well enough to program a computer to mimic that “understanding.”

If Geoffrey has evidence to the contrary, I am sure everyone would be glad to hear about it.

Speech Recognition vs. Language Processing [noise-burst classification]

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 12:06 pm

Speech Recognition vs. Language Processing by Geoffrey Pullum.

From the post:

I have stressed that we are still waiting for natural language processing (NLP). One thing that might lead you to believe otherwise is that some companies run systems that enable you to hold a conversation with a machine. But that doesn’t involve NLP, i.e. syntactic and semantic analysis of sentences. It involves automatic speech recognition (ASR), which is very different.

ASR systems deal with words and phrases rather as the song “Rawhide” recommends for cattle: “Don’t try to understand ’em; just rope and throw and brand ’em.”

Labeling noise bursts is the goal, not linguistically based understanding.

(…)

Prompting a bank customer with “Do you want to pay a bill or transfer funds between accounts?” considerably improves the chances of getting something with either “pay a bill” or “transfer funds” in it; and they sound very different.

In the latter case, no use is made by the system of the verb + object structure of the two phrases. Only the fact that the customer appears to have uttered one of them rather than the other is significant. What’s relevant about pay is not that it means “pay” but that it doesn’t sound like tran-. As I said, this isn’t about language processing; it’s about noise-burst classification.

I can see why the NLP engineers dislike Pullum so intensely.

Characterizing “speech recognition” as “noise-burst classification,” while entirely accurate, is also offensive.

😉

“Speech recognition” fools a layperson into thinking NLP is more sophisticated than it is in fact.

The question for NLP engineers is: Why the pretense at sophistication?

« Newer PostsOlder Posts »

Powered by WordPress