A Consumer Electronics Named Entity Recognizer using NLTK by Sujit Pal.
From the post:
Some time back, I came across a question someone asked about possible approaches to building a Named Entity Recognizer (NER) for the Consumer Electronics (CE) industry on LinkedIn’s Natural Language Processing People group. I had just finished reading the NLTK Book and had some ideas, but I wanted to test my understanding, so I decided to build one. This post describes this effort.
The approach is actually quite portable and not tied to NLTK and Python, you could, for example, build a Java/Scala based NER using components from OpenNLP and Weka using this approach. But NLTK provides all the components you need in one single package, and I wanted to get familiar with it, so I ended up using NLTK and Python.
The idea is that you take some Consumer Electronics text, mark the chunks (words/phrases) you think should be Named Entities, then train a (binary) classifier on it. Each word in the training set, along with some features such as its Part of Speech (POS), Shape, etc is a training input to the classifier. If the word is part of a CE Named Entity (NE) chunk, then its trained class is True otherwise it is False. You then use this classifier to predict the class (CE NE or not) of words in (previously unseen) text from the Consumer Electronics domain.
Should help with mining data for “entities” (read “subjects” in the topic map sense) for addition to your topic map.
I did puzzle over the suggestion for improvement that reads:
Another idea is to not do reference resolution during tagging, but instead postponing this to a second stage following entity recognition. That way, the references will be localized to the text under analysis, thus reducing false positives.
Post-authoring reference resolution might benefit from that approach.
But, if references were resolved by authors during the creation of a text, such as the insertion of Wikipedia references for entities, a different result would be obtained.
In those cases, assuming the author of a text is identified, they can be associated with a particular set of reference resolutions.