Analysis of Named Entity Recognition and Linking for Tweets

Analysis of Named Entity Recognition and Linking for Tweets by Leon Derczynski, et al.


Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identi cation, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

A detailed review of existing solutions for mining tweets, where they fail along and why.

A comparison to spur tweet research:

Tweets Per Day > 500,000,000 Derczynski, p. 2
Annotated Tweets < 10,000 Derczynski, p. 27

Let’s see: 500,000,000 / 10,000 = 50,000.

The number of tweet per day is more than 50,000 times the number of tweets annotated with named entity types.

It may just be me but that sounds like the sort of statement you would see in a grant proposal to increase the number of annotated tweets.


I first saw this in a tweet by Diana Maynard.

Comments are closed.