Named Entity Tutorial (LingPipe)
While looking for something else I ran across this named entity tutorial at LingPipe.
Other named entity tutorials that I should collect?
From the post:
Recently, I was tasked with evaluating LingPipe for use in our NLP processing pipeline. I have looked at LingPipe before, but have generally kept away from it because of its licensing – while it is quite friendly to individual developers such as myself (as long as I share the results of my work, I can use LingPipe without any royalties), a lot of the stuff I do is motivated by problems at work, and LingPipe based solutions are only practical when the company is open to the licensing costs involved.
So anyway, in an attempt to kill two birds with one stone, I decided to work with the LingPipe tutorial, but with Scala. I figured that would allow me to pick up the LingPipe API as well as give me some additional experience in Scala coding. I looked around to see if anybody had done something similar and I came upon the scalingpipe project on GitHub where Alexy Khrabov had started with porting the Interesting Phrases tutorial example.
Now there’s a clever idea!
Achieves a deep understanding of the LingPipe API and Scala experience.
Not to mention having useful results for other users.
Bob Carpenter writes:
I have a question about using the chunking evaluation class for inter annotation agreement : how can you use it when the annotators might have missing chunks I.e., if one of the files contains more chunks than the other.
The answer’s not immediately obvious because the usual application of interannotator agreement statistics is to classification tasks (including things like part-of-speech tagging) that have a fixed number of items being annotated.
An issue that is likely to come up in crowd sourcing analysis/annotation of text as well.
Cross Validation vs. Inter-Annotator Agreement by Bob Carpenter.
From the post:
Time, Negation, and Clinical Events
Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.
LingPipe Chunk Annotation GUI
Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in
- LingPipe Sandbox: citationEntities project
She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.
Lessons in the use of LingPipe tools!
If you are annotating texts or anticipate annotating texts, read this post.
Twitter POS Tagging with LingPipe and ARK Tweet Data by Bob Carpenter.
From the post:
We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.
We will train and test on anything that’s easy to parse.
How’s that for a motto!
Social media may be more important than I thought it was several years ago. It may just be the serialization in digital form all the banter in bars, at blocks parties and around the water cooler. If that is true, then governments would be well advised to encourage and assist with access to social media. To give them an even chance of leaving ahead of the widow maker.
Think of mining Twitter data like the NSA and phone traffic, but you aren’t doing anything illegal.
From the post:
Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure’s awesome Java Interop, you can do quite complex things very simply and succinctly.
Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.
By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.
The world of text exploration just gets better all the time!
From the website:
We’ve decided to split what used to be the monolithic LingPipe book in two. As they’re written, we’ll be putting up drafts here.
NLP with LingPipe
You can download the PDF of the LingPipe book here:
Carpenter, Bob and Breck Baldwin. 2011. Natural Language Processing with LingPipe 4. Draft 0.5. June 2011. [Download: lingpipe-book-0.5.pdf]
Text Processing with Java
The PDF of the book on text in Java is here:
Carpenter, Bob, Mitzi Morris, and Breck Baldwin. 2011. Text Processing with Java 6. Draft 0.5. June 2011. [Download: java-text-book-0.5.pdf]
The pages are 7 inches by 10 inches, so if you print, you have the choice of large margins (no scaling) or large print (print fit to page).
Source code is also available.
Bob Carpenter continues his series on domain adaptation:
Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.
Hierarchical Logistic Regression
Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.
Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.
From the post:
We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.
But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.
For your convenience, links for the “…tutorial for LREC with Massimo Poesio” can be found at: LREC 2010 Tutorial: Modeling Data Annotation.
From the post:
I came across this paper, which, among other things, describes the data collection being used for the 2011 TREC Crowdsourcing Track:
- Tang, Wei and Matthew Lease. 2011. Semi-supervised consensus labeling for crowdsourcing. SIGIR Workshop on Crowdsourcing for Information Retrieval.
But that’s not why we’re here today. I want to talk about their modeling decisions.
Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).
The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).
I post this in part for the review of the model in question and also as a warning that competent people really do read research papers in their areas. Yes, on the WWW you can publish anything you want, of whatever quality. But, others in your field will notice. Is that what you want?
Domain Adaptation with Hierarchical Naive Bayes Classifiers by Bob Carpenter.
From the post:
This will be the first of two posts exploring hierarchical and multilevel classifiers. In this post, I’ll describe a hierarchical generalization of naive Bayes (what the NLP world calls a “generative” model). The next post will explore hierarchical logistic regression (called a “discriminative” or “log linear” or “max ent” model in NLP land).
Very entertaining and useful if you use NLP at all in your pre-topic map phase.
A report that version 0.4 of the LingPipe book has appeared.
Bob Carpenter walks though the use of LingPipe in connection with the MITRE Name Matching Challenge.
There are many complex issues in data mining but doing well on basic tasks is always a good starting place.