A New Entity Salience Task with Millions of Training Examples by Dan Gillick and Jesse Dunietz.
Abstract:
Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.
The article concludes:
We believe entity salience is an important task with many applications. To facilitate further research, our automatically generated salience annotations, along with resolved entity ids, for the subset of the NYT corpus discussed in this paper are available here: https://code.google.com/p/nyt-salience
A classic approach to a CS article: new approach/idea, data + experiments, plus results and code. It doesn’t get any better.
The results won’t be perfect, but the question is: Are they “acceptable results?”
Which presumes a working definition of “acceptable” that you have hammered out with your client.
I first saw this in a tweet by Stefano Bertolo.