New York Times Annotated Corpus Add-On

New York Times corpus add-on annotations: MIDs and Entity Salience. (GitHub – Data)

From the webpage:

The data included in this release accompanies the paper, entitled “A New Entity Salience Task with Millions of Training Examples” by Jesse Dunietz and Dan Gillick (EACL 2014).

The training data includes 100,834 documents from 2003-2006, with 19,261,118 annotated entities. The evaluation data includes 9,706 documents from 2007, with 187,080 annotated entities.

An empty line separates each document annotation. The first line of a document’s annotation contains the NYT document id followed by the title. Each subsequent line refers to an entity, with the following tab-separated fields:

entity index automatically inferred salience {0,1} mention count (from our coreference system) first mention’s text byte offset start position for the first mention byte offset end position for the first mention MID (from our entity resolution system)

The background in Teaching machines to read between the lines (and a new corpus with entity salience annotations) by Dan Gillick and Dave Orr, will be useful.

From the post:

Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about. (emphasis added)

Truly an important data set but I’m rather partial to that last line. 😉

So the question is if we “recognize” a entity as salient, do we annotate the entity and:

  • Present the reader with a list of links, each to a separate mention with or without ads?
  • Present the reader with what is known about the entity, with or without ads?

I see enough divided posts and other information that forces readers to endure more ads that I consciously avoid buying anything for which I see a web ad. Suggest you do the same. (If possible.) I buy books, for example, because someone known to me recommends it, not because some marketeer pushes it at me across many domains.

Comments are closed.