Archive for the ‘Disambiguation’ Category

Evaluating Entity Linking with Wikipedia

Monday, April 28th, 2014

Evaluating Entity Linking with Wikipedia by Ben Hachey, et al.


Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Knowledge Base (KB). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or NIL. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets.

We reimplement three seminal NEL systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms.

A very deep survey of entity linking literature (including record linkage) and implementation of three complete entity linking systems for comparison.

At forty-eight (48) pages it isn’t a quick read but should be your starting point for pushing the boundaries on entity linking research.

I first saw this in a tweet by Alyona Medelyan.

Entity Recognition and Disambiguation Challenge

Wednesday, March 5th, 2014

Entity Recognition and Disambiguation Challenge by Evgeniy Gabrilovich.

Important Dates
March 10: Leaderboard and trial submission system online (tentative)
June 10: Trial runs end at 11:59AM PDT; Test begins at noon PDT
June 20: Team results announced
June 27: Workshop paper due
July 11: Workshop at SIGIR-2014, Gold Coast, Australia

From the post:

We are happy to announce the 2014 Entity Recognition and Disambiguation (ERD) Challenge! Participating teams will have the opportunity to not only win cash prizes in the total amount of US$1,500 but also be invited to publish and present their results at a SIGIR 2014 workshop in Gold Coast, Australia, co-sponsored by Google and Microsoft (

The objective of an ERD system is to recognize mentions of entities in a given text, disambiguate them, and map them to the known entities in a given collection or knowledge base. Building a good ERD system is challenging because:

* Entities may appear in different surface forms
* The context in which a surface form appears often constrains valid entity interpretations
* An ambiguous surface form may match multiple entity interpretations, especially in short text

The ERD Challenge will have two tracks, with one focusing on ERD for long texts (i.e., web documents) and the other on short texts (i.e., web search queries), respectively. Each team can elect to participate either in one or both tracks.

Open to the general public, participants are asked to build their systems as publicly accessible web services using whatever resources at their disposal. The entries to the Challenge are submitted in the form of URLs to the participants’ web services.

Participants will have a period of 3 months to test run their systems using development datasets hosted by the ERD Challenge website. The final evaluations and the determination of winners will be performed on held-out datasets that have similar properties to the development sets.

From the Microsoft version of the announcement:

  • The call for participation is now available here.
  • Please express your intent for joining the competition using the signup sheet.
  • A google group is created for discussion purpose. subscribe to the google group for news and announcements.
  • ERD 2014 will be a SIGIR 2014 workshop. See you in Australia!

The challenge website: Entity Recognition and Disambiguation Challenge

Beta versions of the data sets.

I would like to be on a team but someone else would have to make the trip to Australia. 😉 [Humans Only]

Monday, October 21st, 2013

From the About page:

Life is full of choices to make, so are the differences. Differentiation is the identity of a person or any item.

Throughout our life we have to make number of choices. To make the right choice we need to know what makes one different from the other.

We know that making the right choice is the hardest task we face in our life and we will never be satisfied with what we chose, we tend to think the other one would have been better. We spend a lot of time on making decision between A and B.

And the information that guide us to make the right choice should be unbiased, easily accessible, freely available, no hidden agendas and have to be simple and self explanatory, while adequately informative. Information is everything in decision making. That’s where comes in. We make your life easy by guiding you to distinguish the differences between anything and everything, so that you can make the right choices.

Whatever the differences you want to know, be it about two people, two places, two items, two concepts, two technologiesor whatever it is, we have the answer. We have not confined ourselves in to limits. We have a very wide collection of information, that are diverse, unbiased and freely available. In our analysis we try to cover all the areas such as what is the difference, why the difference and how the difference affect.

What we do at, we team up with selected academics, subject matter experts and script writers across the world to give you the best possible information in differentiating any two items.

Easy Search: We have added search engine for viewers to go direct to the topic they are searching for, without browsing page by page.

Sam Hunting forwarded this to my attention.

I listed it under dictionary and disambiguation but I am not sure either of those is correct.

Just a sampling:

And my current favorite:

Difference Between Lucid Dreaming and Astral Projection

Never has occurred to me to confuse those two. 😉

There are over five hundred and twenty (520) pages and assuming an average of sixteen (16) entries per page, there are over eight thousand (8,000) entries today.

Unstructured prose is used to distinguish one subject from another, rather than formal properties.

Being human really helps with the distinctions given in the articles.

Disambiguating Hilarys

Monday, April 15th, 2013

Hilary Mason (live, data scientist) writes about Google confusing her with Hilary Mason (deceased, actress) in Et tu, Google?

To be fair, Hilary Mason (live, data scientist), notes Bing has made the same mistake in the past.

Hilary Mason (live, data scientist) goes on to say:

I know that entity disambiguation is a hard problem. I’ve worked on it, though never with the kind of resources that I imagine Google can bring to it. And yet, this is absurd!

Is entity disambiguation a hard problem?

Or is entity disambiguation a hard problem after the act of authorship?

Authors (in general) know what entities they meant.

The hard part is inferring what entity they meant when they forgot to disambiguate between possible entities.

Rather than focusing on mining low grade ore (content where entities are not disambiguated), wouldn’t a better solution be authoring with automatic entity disambiguation?

We have auto-correction in word processing software now, why not auto-entity software that tags entities in content?

Presenting the author of content with disambiguated entities for them to accept, reject or change.

Won’t solve the problem of prior content with undistinguished entities but can keep the problem from worsening.

Learning from Big Data: 40 Million Entities in Context

Saturday, March 9th, 2013

Learning from Big Data: 40 Million Entities in Context by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research,

A fuller explanation of the Wikilinks Corpus from Google:

When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.

To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages — over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.

Suggestions for using the data? The authors have those as well:

What might you do with this data? Well, we’ve already written one ACL paper on cross-document co-reference (and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:

  • Look into coreference — when different mentions mention the same entity — or entity resolution — matching a mention to the underlying entity
  • Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity
  • Learn things about entities by aggregating information across all the documents they’re mentioned in
  • Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
  • Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.

Those all sound like topic map tasks to me, especially if you capture your coreference results for merging with other coreference results.

…Wikilinks Corpus With 40M Mentions And 3M Entities

Saturday, March 9th, 2013

Google Research Releases Wikilinks Corpus With 40M Mentions And 3M Entities by Frederic Lardinois.

From the post:

Google Research just launched its Wikilinks corpus, a massive new data set for developers and researchers that could make it easier to add smart disambiguation and cross-referencing to their applications. The data could, for example, make it easier to find out if two web sites are talking about the same person or concept, Google says. In total, the corpus features 40 million disambiguated mentions found within 10 million web pages. This, Google notes, makes it “over 100 times bigger than the next largest corpus,” which features fewer than 100,000 mentions.

For Google, of course, disambiguation is something that is a core feature of the Knowledge Graph project, which allows you to tell Google whether you are looking for links related to the planet, car or chemical element when you search for ‘mercury,’ for example. It takes a large corpus like this one and the ability to understand what each web page is really about to make this happen.

Details follow on how to create this data set.

Very cool!

The only caution is that your entities, those specific to your enterprise, are unlikely to appear, even in 40M mentions.

But the Wikilinks Corpus + your entities, now that is something with immediate ROI for your enterprise.

Kwong – … Word Sense Disambiguation

Tuesday, January 29th, 2013

New Perspectives on Computational and Cognitive Strategies for Word Sense Disambiguation
by Oi Yee Kwong.

From the description:

Cognitive and Computational Strategies for Word Sense Disambiguation examines cognitive strategies by humans and computational strategies by machines, for WSD in parallel.

Focusing on a psychologically valid property of words and senses, author Oi Yee Kwong discusses their concreteness or abstractness and draws on psycholinguistic data to examine the extent to which existing lexical resources resemble the mental lexicon as far as the concreteness distinction is concerned. The text also investigates the contribution of different knowledge sources to WSD in relation to this very intrinsic nature of words and senses.

I wasn’t aware that the “mental lexicon” of words had been fully described.

Shows what you can learn from reading marketing summaries of research.

Open Source Natural Language Spell-Checker [Disambiguation at the point of origin.]

Friday, October 26th, 2012

Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.

I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?

My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.

It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”

That confusion resolved, I read:

Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.

Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”

Interested parties can check out this demo or read the tech overview and grab the source code here.

I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.

Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?

Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.

Disambiguation at the point of origin.

The title of the original article could become:

“<a href=””>Automattic</a> Open Sources Natural Language Spell-Checker <a href=””>After the Deadline</a>”

Seems less ambiguous to me.

Certainly less ambiguous to a search engine.


Web query disambiguation using PageRank

Sunday, July 1st, 2012

Web query disambiguation using PageRank by Christos Makris, Yannis Plegas, and Sofia Stamou. (Makris, C., Plegas, Y. and Stamou, S. (2012), Web query disambiguation using PageRank. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22685)


In this article, we propose new word sense disambiguation strategies for resolving the senses of polysemous query terms issued to Web search engines, and we explore the application of those strategies when used in a query expansion framework. The novelty of our approach lies in the exploitation of the Web page PageRank values as indicators of the significance the different senses of a term carry when employed in search queries. We also aim at scalable query sense resolution techniques that can be applied without loss of efficiency to large data sets such as those on the Web. Our experimental findings validate that the proposed techniques perform more accurately than do the traditional disambiguation strategies and improve the quality of the search results, when involved in query expansion.

A better summary of the author’s approach lies within the article:

The intuition behind our method is that we could improve the Web users’ search experience if we could correlate the importance that the sense of a term has when employed in a query (i.e., the importance of the sense as perceived by the information seeker) with the importance the same sense has when contained in a Web page (i.e., the importance of the sense as perceived by the information provider). We rely on the exploitation of PageRank because of its effectiveness in capturing the importance of every page on the Web graph based on their links’ connectivity, and from which we may infer the importance of every page in the “collective mind” of the Web content providers/creators. To account for that, we explore whether the PageRank value of a page may serve as an indicator of how significant the dominant senses of a query-matching term in the page are and, based on that, disambiguate the query.

Which reminds me of statistical machine translation, which replaced syntax based methods years ago.

Perhaps pagerank is summing our linguistic preferences from some word senses.

If that is the case, how would you incorporate that in ranking results to be delivered to a user from a topic map? There are different possible search outcomes, how do we establish the one a user prefers?