Archive for the ‘Entity Resolution’ Category

Stanford CoreNLP v3.7.0 beta is out! [Time is short, comments, bug reports, now!]

Thursday, November 3rd, 2016

Stanford CoreNLP v3.7.0 beta

The tweets I saw from Stanford NLP Group read:

Stanford CoreNLP v3.7.0 beta is out—improved coreference, dep parsing—KBP relation annotator—Arabic pipeline #NLProc

We‘re doing an official CoreNLP beta release this time, so bugs, comments, and fixes especially appreciated over the next two weeks!

OK, so, what are you waiting for? 😉

Oh, the standard blurb for your boss on why Stanford CoreNLP should be taking up your time:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

  • An integrated toolkit with a good range of grammatical analysis tools
  • Fast, reliable analysis of arbitrary texts
  • The overall highest quality text analytics
  • Support for a number of major (human) languages
  • Interfaces available for various major modern programming languages
  • Ability to run as a simple web service

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Using the standard blurb about the Stanford CoreNLP has these advantages:

  • It’s copy-n-paste, you didn’t have to write it
  • It’s appeal to authority (Stanford)
  • It’s truthful

The truthful point is a throw-away these days but thought I should mention it. 😉

Visualizing Data Loss From Search

Thursday, April 14th, 2016

I used searches for “duplicate detection” (3,854) and “coreference resolution” (3290) in “Ironically, Entity Resolution has many duplicate names” [Data Loss] to illustrate potential data loss in searches.

Here is a rough visualization of the information loss if you use only one of those terms:


If you search for “duplicate detection,” you miss all the articles shaded in blue.

If you search for “coreference resolution,” you miss all the articles shaded in yellow.

Suggestions for improving this visualization?

It is a visualization that could be performed on client’s data, using their search engine/database.

In order to identify the data loss they are suffering now from search across departments.

With the caveat that not all data loss is bad and/or worth avoiding.

Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.

That is a starting point for evaluating the use of topic maps.

While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?

Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.

I assume the most relevant terms are going to be those internal to customers and/or potential customers.

Interest in working this up into a client prospecting/topic map marketing tool?

Separately I want to note my discovery (you probably already knew about it) of VennDIS: a JavaFX-based Venn and Euler diagram software to generate publication quality figures. Download here. (Apologies, the publication itself if firewalled.)

The export defaults to 800 x 800 resolution. If you need something smaller, edit the resulting image in Gimp.

It’s a testimony to the software that I was able to produce a useful image in less than a day. Kudos to the software!

“Ironically, Entity Resolution has many duplicate names” [Data Loss]

Wednesday, April 13th, 2016

Nancy Baym tweeted:

“Ironically, Entity Resolution has many duplicate names” – Lise Getoor


I can’t think of any subject that doesn’t have duplicate names.

Can you?

In a “search driven” environment, not knowing the “duplicate” names for a subject means data loss.

Data loss that could include “smoking gun” data.

Topic mappers have been making that pitch for decades but it never has really caught fire.

I don’t think anyone doubts that data loss occurs, but the gravity of that data loss remains elusive.

For example, let’s take three duplicate names for entity resolution from the slide, duplicate detection, reference reconciliation, coreference resolution.

Supplying all three as quoted strings to CiteSeerX, any guesses on the number of “hits” returned?

As of April 13, 2016:

  • duplicate detection – 3,854
  • reference reconciliation – 253
  • coreference resolution – 3,290

When running the query "duplicate detection" "coreference resolution", only 76 “hits” are returned, meaning that there are only 76 overlapping cases reported in the total of 7,144 for both of those terms separately.

That’s assuming CiteSeerX isn’t shorting me on results due to server load, etc. I would have to cross-check the data itself before I would swear to those figures.

But consider just the raw numbers I report today: duplicate detection – 3,854, coreference resolution – 3,290, with 76 overlapping cases.

That’s two distinct lines of research on the same problem, for the most part, ignoring the other.

What do you think the odds are of duplication of techniques, experiences, etc., spread out over those 7,144 articles?

Instead of you or your client duplicating a known-to-somebody solution, you could be building an enhanced solution.

Well, except for the data loss due to “duplicate names” in a search environment.

And that you would have to re-read all the articles in order to find which technique or advancement was made in each article.

Multiply that by everyone who is interested in a subject and its a non-trivial amount of effort.

How would you like to avoid data loss and duplication of effort?

How Entity-Resolved Data Dramatically Improves Analytics

Wednesday, June 10th, 2015

How Entity-Resolved Data Dramatically Improves Analytics by Marc Shichman.

From the post:

In my last two blog posts, I’ve written about how Novetta Entity Analytics resolves entity data from multiple sources and formats, and why its speed and scalability are so important when analyzing large volumes of data. Today I’m going to discuss how analysts can achieve much better results than ever before by utilizing entity-resolved data in analytics applications.

When data from all available sources is combined and entities are resolved, individual records about a real-world entity’s transactions, actions, behaviors, etc. are aggregated and assigned to that person, organization, location, automobile, ship or any other entity type. When an application performs analytics on this entity-resolved data, the results offer much greater context than analytics on the unlinked, unresolved data most applications use today.

Analytics that present a complete view of all actions of an individual entity are difficult to deliver today as they can require many time-consuming and expensive manual processes. With entity-resolved data, complete information about each entity’s specific actions and behaviors is automatically linked so applications can perform analytics quickly and easily. Below are some examples of how applications, such as enterprise search, data warehouse and link analysis visualization, can employ entity-resolved data from Novetta Entity Analytics to provide more powerful analytics.

Marc isn’t long on specifics of how Novetta Entity Analytics works in his prior posts but I think we can all agree on his recitation of the benefits of entity resolution in this post.

Once we know the resolution of an entity or subject identity as we would say in topic maps, the payoffs are immediate and worthwhile. Search results are more relevant, aggregated (merged) data speeds up queries and multiple links are simplified as they are merged.

How we would get there varies but Marc does a good job of describing the benefits!

Analysis of named entity recognition and linking for tweets

Wednesday, May 20th, 2015

Analysis of named entity recognition and linking for tweets by Leon Derczynski, et al.


Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

The questions addressed by the paper are:

RQ1 How robust are state-of-the-art named entity recognition and linking methods on short and noisy microblog texts?

RQ2 What problem areas are there in recognising named entities in microblog posts, and what are the major causes of false negatives and false positives?

RQ3 Which problems need to be solved in order to further the state-of-the-art in NER and NEL on this difficult text genre?

The ultimate conclusion is that entity recognition in microblog posts falls short of what has been achieved for newswire text but if you need results now or at least by tomorrow, this is a good guide to what is possible and where improvements can be made.

Wandora tutorial – OCR extractor and Alchemy API Entity extractor

Wednesday, March 18th, 2015

From the description:

Video reviews the OCR (Optical Character Recognition) extractor and the Alchemy API Entity extractor of Wandora application. First, the OCR extractor is used to recognize text out of PNG images. Next the Alchemy API Entity extractor is used to recognize entities out of the text. Wandora is an open source tool for people who collect and process information, especially networked knowledge and knowledge about WWW resources. For more information see

A great demo of some of the many options of Wandora! (Wandora has more options than a Swiss army knife.)

It is an impressive demonstration.

If you aren’t familiar with Wandora, take a close look at it:

Information Extraction framework in Python

Friday, November 7th, 2014

Information Extraction framework in Python

From the post:

IEPY is an open source tool for Information Extractionfocused on Relation Extraction.

To give an example of Relation Extraction, if we are trying to find a birth date in:

“John von Neumann (December 28, 1903 – February 8, 1957) was a Hungarian and American pure and applied mathematician, physicist, inventor and polymath.”

then IEPY’s task is to identify “John von Neumann” and “December 28, 1903” as the subject and object entities of the “was born in” relation.

It’s aimed at:
  • users needing to perform Information Extraction on a large dataset.
  • scientists wanting to experiment with new IE algorithms.

Your success with recognizing relationships will vary but every one successfully recognized is one less that must be coded by hand.

Speaking of relationships, I would prefer to also have the relationships between John von Neumann and “Hungarian and American pure and applied mathematician, physicist, inventor and polymath” recognized as well.

I first saw this in a tweet by Scientific Python.

Analysis of Named Entity Recognition and Linking for Tweets

Friday, October 24th, 2014

Analysis of Named Entity Recognition and Linking for Tweets by Leon Derczynski, et al.


Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identi cation, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

A detailed review of existing solutions for mining tweets, where they fail along and why.

A comparison to spur tweet research:

Tweets Per Day > 500,000,000 Derczynski, p. 2
Annotated Tweets < 10,000 Derczynski, p. 27

Let’s see: 500,000,000 / 10,000 = 50,000.

The number of tweet per day is more than 50,000 times the number of tweets annotated with named entity types.

It may just be me but that sounds like the sort of statement you would see in a grant proposal to increase the number of annotated tweets.


I first saw this in a tweet by Diana Maynard.

Named Entity Recognition: A Literature Survey

Friday, September 19th, 2014

Named Entity Recognition: A Literature Survey by Rahul Sharnagat.


In this report, we explore various methods that are applied to solve NER. In section 1, we introduce the named entity problem. In section 2, various named entity recognition methods are discussed in three three broad categories of machine learning paradigm and explore few learning techniques in them. In the first part, we discuss various supervised techniques. Subsequently we move to semi-supervised and unsupervised techniques. In the end we discuss about the method from deep learning to solve NER.

If you are new to the named entity recognition issue or want to pass on an introduction, this may be the paper for you. It covers all the high points with a three page bibliography to get your started in the literature.

I first saw this in a tweet by Christopher.

Getting Started with S4, The Self-Service Semantic Suite

Tuesday, September 16th, 2014

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.

New York Times Annotated Corpus Add-On

Wednesday, August 27th, 2014

New York Times corpus add-on annotations: MIDs and Entity Salience. (GitHub – Data)

From the webpage:

The data included in this release accompanies the paper, entitled “A New Entity Salience Task with Millions of Training Examples” by Jesse Dunietz and Dan Gillick (EACL 2014).

The training data includes 100,834 documents from 2003-2006, with 19,261,118 annotated entities. The evaluation data includes 9,706 documents from 2007, with 187,080 annotated entities.

An empty line separates each document annotation. The first line of a document’s annotation contains the NYT document id followed by the title. Each subsequent line refers to an entity, with the following tab-separated fields:

entity index automatically inferred salience {0,1} mention count (from our coreference system) first mention’s text byte offset start position for the first mention byte offset end position for the first mention MID (from our entity resolution system)

The background in Teaching machines to read between the lines (and a new corpus with entity salience annotations) by Dan Gillick and Dave Orr, will be useful.

From the post:

Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about. (emphasis added)

Truly an important data set but I’m rather partial to that last line. 😉

So the question is if we “recognize” a entity as salient, do we annotate the entity and:

  • Present the reader with a list of links, each to a separate mention with or without ads?
  • Present the reader with what is known about the entity, with or without ads?

I see enough divided posts and other information that forces readers to endure more ads that I consciously avoid buying anything for which I see a web ad. Suggest you do the same. (If possible.) I buy books, for example, because someone known to me recommends it, not because some marketeer pushes it at me across many domains.

Communicating and resolving entity references

Friday, June 27th, 2014

Communicating and resolving entity references by R.V. Guha.


Statements about entities occur everywhere, from newspapers and web pages to structured databases. Correlating references to entities across systems that use different identifiers or names for them is a widespread problem. In this paper, we show how shared knowledge between systems can be used to solve this problem. We present “reference by description”, a formal model for resolving references. We provide some results on the conditions under which a randomly chosen entity in one system can, with high probability, be mapped to the same entity in a different system.

An eye appointment is going to prevent me from reading this paper closely today.

From a quick scan, do you think Guha is making a distinction between entities and subjects (in the topic map sense)?

What do you make of literals having no identity beyond their encoding? (page 4, #3)

Redundant descriptions? (page 7) Would you say that defining a set of properties that must match would qualify? (Or even just additional subject indicators?)

Expect to see a lot more comments on this paper.


I first saw this in a tweet by Stefano Bertolo.

A New Entity Salience Task with Millions of Training Examples

Monday, March 10th, 2014

A New Entity Salience Task with Millions of Training Examples by Dan Gillick and Jesse Dunietz.


Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

The article concludes:

We believe entity salience is an important task with many applications. To facilitate further research, our automatically generated salience annotations, along with resolved entity ids, for the subset of the NYT corpus discussed in this paper are available here:

A classic approach to a CS article: new approach/idea, data + experiments, plus results and code. It doesn’t get any better.

The results won’t be perfect, but the question is: Are they “acceptable results?”

Which presumes a working definition of “acceptable” that you have hammered out with your client.

I first saw this in a tweet by Stefano Bertolo.

Entity Recognition and Disambiguation Challenge

Wednesday, March 5th, 2014

Entity Recognition and Disambiguation Challenge by Evgeniy Gabrilovich.

Important Dates
March 10: Leaderboard and trial submission system online (tentative)
June 10: Trial runs end at 11:59AM PDT; Test begins at noon PDT
June 20: Team results announced
June 27: Workshop paper due
July 11: Workshop at SIGIR-2014, Gold Coast, Australia

From the post:

We are happy to announce the 2014 Entity Recognition and Disambiguation (ERD) Challenge! Participating teams will have the opportunity to not only win cash prizes in the total amount of US$1,500 but also be invited to publish and present their results at a SIGIR 2014 workshop in Gold Coast, Australia, co-sponsored by Google and Microsoft (

The objective of an ERD system is to recognize mentions of entities in a given text, disambiguate them, and map them to the known entities in a given collection or knowledge base. Building a good ERD system is challenging because:

* Entities may appear in different surface forms
* The context in which a surface form appears often constrains valid entity interpretations
* An ambiguous surface form may match multiple entity interpretations, especially in short text

The ERD Challenge will have two tracks, with one focusing on ERD for long texts (i.e., web documents) and the other on short texts (i.e., web search queries), respectively. Each team can elect to participate either in one or both tracks.

Open to the general public, participants are asked to build their systems as publicly accessible web services using whatever resources at their disposal. The entries to the Challenge are submitted in the form of URLs to the participants’ web services.

Participants will have a period of 3 months to test run their systems using development datasets hosted by the ERD Challenge website. The final evaluations and the determination of winners will be performed on held-out datasets that have similar properties to the development sets.

From the Microsoft version of the announcement:

  • The call for participation is now available here.
  • Please express your intent for joining the competition using the signup sheet.
  • A google group is created for discussion purpose. subscribe to the google group for news and announcements.
  • ERD 2014 will be a SIGIR 2014 workshop. See you in Australia!

The challenge website: Entity Recognition and Disambiguation Challenge

Beta versions of the data sets.

I would like to be on a team but someone else would have to make the trip to Australia. 😉

Duke 1.2 Released!

Sunday, February 16th, 2014

Lars Marius Garshol has released Duke 1.2!

From the homepage:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.2 (see ReleaseNotes).

Duke can find duplicate customer records, or other kinds of records in your database. Or you can use it to connect records in one data set with other records representing the same thing in another data set. Duke has sophisticated comparators that can handle spelling differences, numbers, geopositions, and more. Using a probabilistic model Duke can handle noisy data with good accuracy.


  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Genetic algorithm for automatically tuning configurations.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. The examples of use page lists real examples of using Duke, complete with data and configurations. This presentation has more the big picture and background.


Until you know which two or more records are talking about the same subject, it’s very difficult to know what to map together.


Tuesday, December 24th, 2013


From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Interesting pairing of Duke (entity resolution/record linkage software by Lars Marius Garshol) with ElasticSearch.

Strings and user search behavior can only take an indexing engine so far. This is a step in the right direction.

A step more likely be followed with an Apache License as opposed to its current LGPLv3.

Reverse Entity Recognition? (Scrubbing)

Tuesday, December 10th, 2013

Improving privacy with language technologies by Rob Munro.

From the post:

One downside of the kind of technologies that we build at Idibon is that they can be used to compromise people’s privacy and, by extension, their safety. Any technology can be used for positive and negative purposes and as engineers we have a responsibility to ensure that what we create is for a better world.

For language technologies, the most negative application, by far, is eavesdropping: discovering information about people by monitoring their online communications and using that information in ways that harm the individuals. This can be something as direct and targeted as exposing the identities of at-risk individuals in a war-zone or it can be the broad expansion of government surveillance. The engineers at many technology companies announced their opposition to the latter with a loud, unified call today to reform government surveillance.

One way that privacy can be compromised at scale is the use of technology known as “named entity recognition”, which identifies the names of people, places, organizations, and other types of real-world entities in text. Given millions of sentences of text, named entity recognition can extract the names and addresses of everybody in the data in just a few seconds. But the same technology that can we used to uncover personally identifying information (PII) can also be used to remove the personally identifying information from the text. This is known as anonymizing or simply “scrubbing”.

Rob agrees that entity recognition can invade your personal privacy, but points out it can also protect your privacy.

You may think your “handle” on one or more networks provides privacy but it would not take much data to disappoint most people.

Entity recognition software can scrub data to remove “tells” that may identify you from it.

How much scrubbing is necessary depends on the data and the consequences of discovery.

Entity recognition is usually thought of as recognizing names, places, but it could just as easily be content analysis to recognize a particular author.

That would require more sophisticated “scrubbing” than entity recognition can support.

Entity Discovery using Mahout CollocDriver

Saturday, October 26th, 2013

Entity Discovery using Mahout CollocDriver by Sujit Pal.

From the post:

I spent most of last week trying out various approaches to extract “interesting” phrases from a collection of articles. The objective was to identify candidate concepts that could be added to our taxonomy. There are various approaches, ranging from simple NGram frequencies, to algorithms such as RAKE (Rapid Automatic Keyword Extraction), to rescoring NGrams using Log Likelihood or Chi-squared measures. In this post, I describe how I used Mahout’s CollocDriver (which uses the Log Likelihood measure) to find interesting phrases from a small corpus of about 200 articles.

The articles were in various formats (PDF, DOC, HTML), and I used Apache Tika to parse them into text (yes, I finally found the opportunity to learn Tika :-)). Tika provides parsers for many common formats, so all we have to do was to hook them up to produce text from the various file formats. Here is my code:

Think of this as winnowing the chaff that your human experts would otherwise read.

A possible next step would be to decorate the candidate “interesting” phrases with additional information before being viewed by your expert(s).

Understanding Entity Search [Better Late Than Never]

Thursday, October 10th, 2013

Understanding Entity Search by Paul Bruemmer.

From the post:

Over the past two decades, the Internet, search engines, and Web users have had to deal with unstructured data, which is essentially any data that has not been organized or classified according to any sort of pre-defined data model. Thus, search engines were able to identify patterns within webpages (keywords) but were not really able to attach meaning to those pages.

Semantic Search provides a method for classifying the data by labeling each piece of information as an entity — this is referred to as structured data. Consider retail product data, which contains enormous amounts of unstructured information. Structured data enables retailers and manufacturers to provide extremely granular and accurate product data for search engines (machines/bots) to consume, understand, classify and link together as a string of verified information.

Semantic or entity search will optimize much more than just retail product data. Take a look at’s schema types – these schemas represent the technical language required to create a structured Web of data (entities with unique identifiers) — and this becomes machine-readable. Machine-readable structured data is disambiguated and more reliable; it can be cross-verified when compared with other sources of linked entity data (unique identifiers) on the Web.

Interesting to see unstructured data defined as:

any data that has not been organized or classified according to any sort of pre-defined data model.

I suppose you can say that but is that how any of us write?

We all write with specific entities in minds, entities that represent subjects we could identify with additional properties if required.

So it is more accurate to say that unstructured data can be defined as:

any data that has not been explicitly identified by one or more properties.

Well, that’s the trick isn’t it? We look at an entity and see properties that a machine does not.

Explicit identification is a requirement. But on the other hand, a “unique” identifier is not.

That’s not just a topic map opinion but is in fact in play at the Global Biodiversity Information Facility (GBIF) I posted about yesterday.

GBIF realizes that ongoing identifications are never going to converge on that happy state where every entity has only one unique reference. In part because an on-going system has to account for all existing data as well as new data which could have new identifiers.

There isn’t enough time or resources to find all prior means of identifying an entity and replacing those with an new identifier. Rather than cutting the Gordian knot of multiple identifiers with a URI sword, GBIF understands multiple identifiers for an entity.

Robust entity search capabilities require the capturing of all identifiers for an entity. So no user is disadvantaged by the identification they know for an entity.

The properties of subjects represented by entities and their identifiers serve as the basis for mapping between identifiers.

None of which needs to be exposed to the user. All a user may see is whatever identifier they have for an entity returns the correct entity and information that was recorded using other identifiers (if they look closely).

What else should an interface disclose other than the result desired by the user?

PS: “Better Late Than Never,” refers to Steve Newcomb and Michel Biezunski promotion of the use of properties to identify the subject represented by entities since the 1990’s. The W3C approach is to replace existing identifiers with a URI. How an opaque URI is better than an opaque string isn’t apparent to me.

Webinar: Trubo-Charging Solr

Monday, October 7th, 2013

Turbo-charge your Solr instance with Entity Recognition, Business Rules and a Relevancy Workbench by Yann Yu.

Date: Thursday, October 17, 2013
Time: 10:00am Pacific Time

From the post:

LucidWorks has three new modules available in the Solr Marketplace that run on top of your existing Solr or LucidWorks Search instance. Join us for an overview of each module and learn how implementing one, two or all three will turbo-charge your Solr instance.

  • Business Rules Engine: Out of the box integration with Drools, the popular open-source business rules engine is now available for Solr and LucidWorks Search. With the LucidWorks Business Rules module, developers can write complex rules using declarative syntax with very little programming. Data can be modified, cleaned and enriched through multiple permutations and combinations.
  • Relevancy Workbench: Experiment with different search parameters to understand the impact of these changes to search results. With intuitive, color-code and side-by-side comparisons of results for different sets of parameters, users can quickly tune their application to produce the results they need. The Relevancy Workbench encourages experimentation with a visual “before and after” view of the results of parameter changes.
  • Entity Recognition: Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc.

All of these modules will be of interest to topic mappers who are processing bulk data.

Elasticsearch Entity Resolution

Thursday, September 12th, 2013

elasticsearch-entity-resolution by Yann Barraud.

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

It is usable as is, though cleaners are not yet implemented.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Intereactive deduplication? Now that sounds very useful for topic map authoring.

Appropriate that I saw this in a Tweet by Duke‘s author, Lars Marius Garshol.

Entity Resolution for Big Data

Tuesday, August 20th, 2013

Entity Resolution for Big Data by Benjamin Bengfort.

From the post:

A Summary of the KDD 2013 Tutorial Taught by Dr. Lise Getoor and Dr. Ashwin Machanavajjhala

Entity Resolution is becoming an important discipline in Computer Science and in Big Data, especially with the recent release of Google’s Knowledge Graph and the open Freebase API. Therefore it is exceptionally timely that last week at KDD 2013, Dr. Lise Getoor of the University of Maryland and Dr. Ashwin Machanavajjhala of Duke University will be giving a tutorial on Entity Resolution for Big Data. We were fortunate enough to be invited to attend a run through workshop at the Center for Scientific Computation and Mathematical Modeling at College Park, and wanted to highlight some of the key points for those unable to attend.

A summary that makes you regret not seeing the tutorial!

Named Entity Recognition (NER) in Solr

Friday, August 2nd, 2013

Named Entity Recognition (NER) in Solr

From the post:

Named Entity Recognition, or NER for short, is a powerful paradigm which causes entities to be recognized within text. Typically these objects can be places, organizations or people. For example, given the phrase “Jon works at Searchbox”, a good NER would return that Jon is a person and Searchbox is an organization. Why is this powerful, especially in Solr? Using this information we can not only propose better suggestions for users searching for things, but using Solr faceting capability we’ll have the ability to facet directly on organizations (or people) without having to manually identify them in all of the documents.

In this blog post, extending from our two previous slideshares on how to develop search components and request handlers, we’ll teach you how to directly embed Stanford’s NER library into a production ready plugin which provides all of the mentioned benefits. We of course provide the full source code packaging here.

Very nice walk through on entity recognition with Solr.

Thought occurs to me that every instance of an entity that is recognized, could be presented to a user as occurrences of that entity. Plugging that search result into a topic that represents the subject.

So there is some static aspect to the topic map, the topic for that subject and a dynamic aspect, being the search results presented as occurrences.

You could enter information or relationships you discover in the occurrences on the static side of the map. Let software manage metadata from the document containing the occurrence.

Entity recognition with Scala and…

Wednesday, June 5th, 2013

Entity recognition with Scala and Stanford NLP Named Entity Recognizer by Gary Sieling.

From the post:

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.


The voting on entity recognition made me curious about interactive entity resolution where a user has a voice.

See the next post.

Named Entity Tutorial

Tuesday, May 21st, 2013

Named Entity Tutorial (LingPipe)

While looking for something else I ran across this named entity tutorial at LingPipe.

Other named entity tutorials that I should collect?

Preliminary evaluation of the CellFinder literature…

Friday, April 19th, 2013

Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts by Mariana Neves, Alexander Damaschun, Nancy Mah, Fritz Lekschas, Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas Kurtz, and Ulf Leser. (Database (2013) 2013 : bat020 doi: 10.1093/database/bat020)


Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ∼50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction.

Database URL:

Another extremely useful data curation project.

Do you get the impression that curation projects will continue to be outrun by data production?

And that will be the case, even with machine assistance?

Is there an alternative to falling further and further behind?

Such as abandoning some content (CNN?) to simply forever go uncurated? Or the same to be true for government documents/reports?

I am sure we all have different suggestions for what data to dump alongside the road to make room for the “important” stuff.

Suggestions on solutions other than simply dumping data?

Disambiguating Hilarys

Monday, April 15th, 2013

Hilary Mason (live, data scientist) writes about Google confusing her with Hilary Mason (deceased, actress) in Et tu, Google?

To be fair, Hilary Mason (live, data scientist), notes Bing has made the same mistake in the past.

Hilary Mason (live, data scientist) goes on to say:

I know that entity disambiguation is a hard problem. I’ve worked on it, though never with the kind of resources that I imagine Google can bring to it. And yet, this is absurd!

Is entity disambiguation a hard problem?

Or is entity disambiguation a hard problem after the act of authorship?

Authors (in general) know what entities they meant.

The hard part is inferring what entity they meant when they forgot to disambiguate between possible entities.

Rather than focusing on mining low grade ore (content where entities are not disambiguated), wouldn’t a better solution be authoring with automatic entity disambiguation?

We have auto-correction in word processing software now, why not auto-entity software that tags entities in content?

Presenting the author of content with disambiguated entities for them to accept, reject or change.

Won’t solve the problem of prior content with undistinguished entities but can keep the problem from worsening.

Learning from Big Data: 40 Million Entities in Context

Saturday, March 9th, 2013

Learning from Big Data: 40 Million Entities in Context by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research,

A fuller explanation of the Wikilinks Corpus from Google:

When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.

To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages — over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.

Suggestions for using the data? The authors have those as well:

What might you do with this data? Well, we’ve already written one ACL paper on cross-document co-reference (and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:

  • Look into coreference — when different mentions mention the same entity — or entity resolution — matching a mention to the underlying entity
  • Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity
  • Learn things about entities by aggregating information across all the documents they’re mentioned in
  • Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
  • Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.

Those all sound like topic map tasks to me, especially if you capture your coreference results for merging with other coreference results.

…Wikilinks Corpus With 40M Mentions And 3M Entities

Saturday, March 9th, 2013

Google Research Releases Wikilinks Corpus With 40M Mentions And 3M Entities by Frederic Lardinois.

From the post:

Google Research just launched its Wikilinks corpus, a massive new data set for developers and researchers that could make it easier to add smart disambiguation and cross-referencing to their applications. The data could, for example, make it easier to find out if two web sites are talking about the same person or concept, Google says. In total, the corpus features 40 million disambiguated mentions found within 10 million web pages. This, Google notes, makes it “over 100 times bigger than the next largest corpus,” which features fewer than 100,000 mentions.

For Google, of course, disambiguation is something that is a core feature of the Knowledge Graph project, which allows you to tell Google whether you are looking for links related to the planet, car or chemical element when you search for ‘mercury,’ for example. It takes a large corpus like this one and the ability to understand what each web page is really about to make this happen.

Details follow on how to create this data set.

Very cool!

The only caution is that your entities, those specific to your enterprise, are unlikely to appear, even in 40M mentions.

But the Wikilinks Corpus + your entities, now that is something with immediate ROI for your enterprise.

Duke 1.0 Release!

Monday, March 4th, 2013

Duke 1.0 Release!

From the project page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.0 (see ReleaseNotes).


  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples DataSources.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This early presentation describes the ideas behind the engine and the intended architecture; a later and more up to date presentation has more practical detail and examples. There's also the ExamplesOfUse page, which lists real examples of using Duke, complete with data and configurations.

Excellent news on the data depulication front!

And for topic map authors as well (see the examples).

Kudos to Lars Marius Garshol!