Archive for the ‘Entities’ Category

“invisible entities having arcane but gravely important significances”

Sunday, June 19th, 2016

Allison Parrish tweeted: the “Other, Format” unicode category, full of invisible entities having arcane but gravely important significances

I just could not let a tweet with:

“invisible entities having arcane but gravely important significances”

pass without comment!

As of today, one-hundred and fifty (150) such entities. All with multiple properties.

How many of these “invisible entities” are familiar to you?

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

Wednesday, February 4th, 2015

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

From the webpage:

Researchers at Google annotated the English-language pages from the TREC KBA Stream Corpus 2014 with links to Freebase. The annotation was performed automatically and are imperfect. For each entity recognized with high confidence an annotation with a link to Freebase is provided (see the details below).

For any questions, join this discussion forum:

Data Description

The entity annotations are for the TREC KBA Stream Corpus 2014. These annotations are freely available. The annotation data for the corpus is provided as a collection of 2000 files (the partitioning is somewhat arbitrary) that total 196 GB, compressed (gz). Each file contains annotations for a batch of pages and the entities identified on the page. These annotations are freely available.

I first saw this in a tweet by Jeff Dalton.

Jeff has a blog post about this release at: Google Research Entity Annotations of the KBA Stream Corpus (FAKBA1). Jeff speculates on the application of this corpus to other TREC tasks.

Jeff suggests that you monitor Knowledge Data Releases for future data releases. I need to ping Jeff as the FAKBA1 release does not appear on the Knowledge Data Release page.

BTW, don’t be misled by the “9.4 billion entity annotations from over 496 million documents” statistic. Impressive but ask yourself, how many of your co-workers, their friends, families, relationships at work, projects where you work, etc. appear in Freebase? Sounds like there is a lot of work to be done with your documents and data that have little or nothing to do with Freebase. Yes?


Named Entity Recognition: A Literature Survey

Friday, September 19th, 2014

Named Entity Recognition: A Literature Survey by Rahul Sharnagat.


In this report, we explore various methods that are applied to solve NER. In section 1, we introduce the named entity problem. In section 2, various named entity recognition methods are discussed in three three broad categories of machine learning paradigm and explore few learning techniques in them. In the first part, we discuss various supervised techniques. Subsequently we move to semi-supervised and unsupervised techniques. In the end we discuss about the method from deep learning to solve NER.

If you are new to the named entity recognition issue or want to pass on an introduction, this may be the paper for you. It covers all the high points with a three page bibliography to get your started in the literature.

I first saw this in a tweet by Christopher.

Demystifying The Google Knowledge Graph

Monday, September 8th, 2014

Demystifying The Google Knowledge Graph by Barbara Starr.

knowledge graph

Barbara covers:

  • Explicit vs. Implicit Entities (and how to determine which is which on your webpages)
  • How to improve your chances of being in “the Knowledge Graph” using and JSON-LD.
  • Thinking about “things, not strings.”

Is there something special about “events?” I remember the early Semantic Web motivations being setting up tennis matches between colleagues. The examples here are of sporting and music events.

If your users don’t know how to use TicketMaster, repeating delivery of that data on your site isn’t going to help them.

On the other hand, this is a good reminder to extract from all the “types” that would be useful for my blog.

PS: A “string” doesn’t become a “thing” simply because it has a longer token. Having an agreed upon “longer token” from a vocabulary such as does provide more precise identification than an unadorned “string.”

Having said that, the power of having several key/value pairs and a declaration of which ones must, may or must not match, should be readily obvious. Particularly when those keys and values may themselves be collections of key/value pairs.

Is That An “Entity” On Your Webpage?

Sunday, March 30th, 2014

How To Tell Search Engines What “Entities” Are On Your Web Pages by Barbara Starr.

From the post:

Search engines have increasingly been incorporating elements of semantic search to improve some aspect of the search experience — for example, using markup to create enhanced displays in SERPs (as in Google’s rich snippets).

Elements of semantic search are now present at almost all stages of the search process, and the Semantic Web has played a key role. Read on for more detail and to learn how to take advantage of this opportunity to make your web pages more visible in this evolution of search.

semantic search

The identifications are fairly coarse, that is you get a pointer (URL) that identifies a subject but no idea why someone picked that URL.

But, we all know how well coarse pointers, document level pointers, have worked for the WWW.

Kinda surprising because we have had sub-document indexing for centuries.

Odd how simply pointing to a text blob suddenly became acceptable.

Think of the efforts by Google and as an attempt to recover indexing as it existed in the centuries before the advent of the WWW.

A New Entity Salience Task with Millions of Training Examples

Monday, March 10th, 2014

A New Entity Salience Task with Millions of Training Examples by Dan Gillick and Jesse Dunietz.


Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

The article concludes:

We believe entity salience is an important task with many applications. To facilitate further research, our automatically generated salience annotations, along with resolved entity ids, for the subset of the NYT corpus discussed in this paper are available here:

A classic approach to a CS article: new approach/idea, data + experiments, plus results and code. It doesn’t get any better.

The results won’t be perfect, but the question is: Are they “acceptable results?”

Which presumes a working definition of “acceptable” that you have hammered out with your client.

I first saw this in a tweet by Stefano Bertolo.

Reverse Entity Recognition? (Scrubbing)

Tuesday, December 10th, 2013

Improving privacy with language technologies by Rob Munro.

From the post:

One downside of the kind of technologies that we build at Idibon is that they can be used to compromise people’s privacy and, by extension, their safety. Any technology can be used for positive and negative purposes and as engineers we have a responsibility to ensure that what we create is for a better world.

For language technologies, the most negative application, by far, is eavesdropping: discovering information about people by monitoring their online communications and using that information in ways that harm the individuals. This can be something as direct and targeted as exposing the identities of at-risk individuals in a war-zone or it can be the broad expansion of government surveillance. The engineers at many technology companies announced their opposition to the latter with a loud, unified call today to reform government surveillance.

One way that privacy can be compromised at scale is the use of technology known as “named entity recognition”, which identifies the names of people, places, organizations, and other types of real-world entities in text. Given millions of sentences of text, named entity recognition can extract the names and addresses of everybody in the data in just a few seconds. But the same technology that can we used to uncover personally identifying information (PII) can also be used to remove the personally identifying information from the text. This is known as anonymizing or simply “scrubbing”.

Rob agrees that entity recognition can invade your personal privacy, but points out it can also protect your privacy.

You may think your “handle” on one or more networks provides privacy but it would not take much data to disappoint most people.

Entity recognition software can scrub data to remove “tells” that may identify you from it.

How much scrubbing is necessary depends on the data and the consequences of discovery.

Entity recognition is usually thought of as recognizing names, places, but it could just as easily be content analysis to recognize a particular author.

That would require more sophisticated “scrubbing” than entity recognition can support.

CoIN: a network analysis for document triage

Wednesday, November 13th, 2013

CoIN: a network analysis for document triage by Yi-Yu Hsu and Hung-Yu Kao. (Database (2013) 2013 : bat076 doi: 10.1093/database/bat076)


In recent years, there was a rapid increase in the number of medical articles. The number of articles in PubMed has increased exponentially. Thus, the workload for biocurators has also increased exponentially. Under these circumstances, a system that can automatically determine in advance which article has a higher priority for curation can effectively reduce the workload of biocurators. Determining how to effectively find the articles required by biocurators has become an important task. In the triage task of BioCreative 2012, we proposed the Co-occurrence Interaction Nexus (CoIN) for learning and exploring relations in articles. We constructed a co-occurrence analysis system, which is applicable to PubMed articles and suitable for gene, chemical and disease queries. CoIN uses co-occurrence features and their network centralities to assess the influence of curatable articles from the Comparative Toxicogenomics Database. The experimental results show that our network-based approach combined with co-occurrence features can effectively classify curatable and non-curatable articles. CoIN also allows biocurators to survey the ranking lists for specific queries without reviewing meaningless information. At BioCreative 2012, CoIN achieved a 0.778 mean average precision in the triage task, thus finishing in second place out of all participants.

Database URL:

From the introduction:

Network analysis concerns the relationships between processing entities. For example, the nodes in a social network are people, and the links are the friendships between the nodes. If we apply these concepts to the ACT, PubMed articles are the nodes, while the co-occurrences of gene–disease, gene–chemical and chemical–disease relationships are the links. Network analysis provides a visual map and a graph-based technique for determining co-occurrence relationships. These graphical properties, such as size, degree, centralities and similar features, are important. By examining the graphical properties, we can gain a global understanding of the likely behavior of the network. For this purpose, this work focuses on two themes concerning the applications of biocuration: using the co-occurrence–based approach to obtain a normalized co-occurrence score and using the network-based approach to measure network properties, e.g. betweenness and PageRank. CoIN integrates co-occurrence features and network centralities when curating articles. The proposed method combines the co-occurrence frequency with the network construction from text. The co-occurrence networks are further analyzed to obtain the linking and shortest path features of the network centralities.

The authors’ ultimately conclude that the network-based approaches perform better than collocation-based approaches.

If this post sounds hauntingly familiar, you may be thinking about Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information, which was the first place finisher at BioCreative 2012 with a mean average precision (MAP) score of 0.8030.

Understanding Entity Search [Better Late Than Never]

Thursday, October 10th, 2013

Understanding Entity Search by Paul Bruemmer.

From the post:

Over the past two decades, the Internet, search engines, and Web users have had to deal with unstructured data, which is essentially any data that has not been organized or classified according to any sort of pre-defined data model. Thus, search engines were able to identify patterns within webpages (keywords) but were not really able to attach meaning to those pages.

Semantic Search provides a method for classifying the data by labeling each piece of information as an entity — this is referred to as structured data. Consider retail product data, which contains enormous amounts of unstructured information. Structured data enables retailers and manufacturers to provide extremely granular and accurate product data for search engines (machines/bots) to consume, understand, classify and link together as a string of verified information.

Semantic or entity search will optimize much more than just retail product data. Take a look at’s schema types – these schemas represent the technical language required to create a structured Web of data (entities with unique identifiers) — and this becomes machine-readable. Machine-readable structured data is disambiguated and more reliable; it can be cross-verified when compared with other sources of linked entity data (unique identifiers) on the Web.

Interesting to see unstructured data defined as:

any data that has not been organized or classified according to any sort of pre-defined data model.

I suppose you can say that but is that how any of us write?

We all write with specific entities in minds, entities that represent subjects we could identify with additional properties if required.

So it is more accurate to say that unstructured data can be defined as:

any data that has not been explicitly identified by one or more properties.

Well, that’s the trick isn’t it? We look at an entity and see properties that a machine does not.

Explicit identification is a requirement. But on the other hand, a “unique” identifier is not.

That’s not just a topic map opinion but is in fact in play at the Global Biodiversity Information Facility (GBIF) I posted about yesterday.

GBIF realizes that ongoing identifications are never going to converge on that happy state where every entity has only one unique reference. In part because an on-going system has to account for all existing data as well as new data which could have new identifiers.

There isn’t enough time or resources to find all prior means of identifying an entity and replacing those with an new identifier. Rather than cutting the Gordian knot of multiple identifiers with a URI sword, GBIF understands multiple identifiers for an entity.

Robust entity search capabilities require the capturing of all identifiers for an entity. So no user is disadvantaged by the identification they know for an entity.

The properties of subjects represented by entities and their identifiers serve as the basis for mapping between identifiers.

None of which needs to be exposed to the user. All a user may see is whatever identifier they have for an entity returns the correct entity and information that was recorded using other identifiers (if they look closely).

What else should an interface disclose other than the result desired by the user?

PS: “Better Late Than Never,” refers to Steve Newcomb and Michel Biezunski promotion of the use of properties to identify the subject represented by entities since the 1990’s. The W3C approach is to replace existing identifiers with a URI. How an opaque URI is better than an opaque string isn’t apparent to me.

Under the Hood: The entities graph

Sunday, June 9th, 2013

Under the Hood: The entities graph (Eric Sun is a tech lead on the entities team, and Venky Iyer is an engineering manager on the entities team.)

From the post:

Facebook’s social graph now comprises over 1 billion monthly active users, 600 million of whom log in every day. What unites each of these people is their social connections, and one way we map them is by traversing the graph of their friendships.

entity graph

But this is only a small portion of the connections on Facebook. People don’t just have connections to other people—they may use Facebook to check in to restaurants and other points of interest, they might show their favorite books and movies on their timeline, and they may also list their high school, college, and workplace. These 100+ billion connections form the entity graph.

There are even connections between entities themselves: a book has an author, a song has an artist, and movies have actors. All of these are represented by different kinds of edges in the graph, and the entities engineering team at Facebook is charged with building, cleaning, and understanding this graph.

Instructive read on building an entity graph.

Differs from NSA data churning in several important ways:

  1. The participants want their data to be found with like data. Participants generally have no motive to lie or hide.
  2. The participants seek out similar users and data.
  3. The participants correct bad data for the benefit of others.

None of those characteristics can be attributed to the victims of NSA data collection efforts.

Disambiguating Hilarys

Monday, April 15th, 2013

Hilary Mason (live, data scientist) writes about Google confusing her with Hilary Mason (deceased, actress) in Et tu, Google?

To be fair, Hilary Mason (live, data scientist), notes Bing has made the same mistake in the past.

Hilary Mason (live, data scientist) goes on to say:

I know that entity disambiguation is a hard problem. I’ve worked on it, though never with the kind of resources that I imagine Google can bring to it. And yet, this is absurd!

Is entity disambiguation a hard problem?

Or is entity disambiguation a hard problem after the act of authorship?

Authors (in general) know what entities they meant.

The hard part is inferring what entity they meant when they forgot to disambiguate between possible entities.

Rather than focusing on mining low grade ore (content where entities are not disambiguated), wouldn’t a better solution be authoring with automatic entity disambiguation?

We have auto-correction in word processing software now, why not auto-entity software that tags entities in content?

Presenting the author of content with disambiguated entities for them to accept, reject or change.

Won’t solve the problem of prior content with undistinguished entities but can keep the problem from worsening.

From Records to a Web of Library Data – Pt1 Entification

Saturday, March 16th, 2013

From Records to a Web of Library Data – Pt1 Entification by Richard Wallis.

From the post:


Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…

What is it, and why should I care, you may be asking.

I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources. Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.

As you know, I am far from convinced about the “benefits” of Linked Data, at least with its current definition.

Who knows what definition “Linked Data” may have in some future vision of the W3C? (URL Homonym Problem: A Topic Map Solution, a tale of how the W3C decided to redefine URL.)

But Richard’s point about the ugliness and utility of “entification” is well taken.

So long as you remember that every term can be described “in terms of other things.”

There are no primitive terms, not one.

Learning from Big Data: 40 Million Entities in Context

Saturday, March 9th, 2013

Learning from Big Data: 40 Million Entities in Context by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research,

A fuller explanation of the Wikilinks Corpus from Google:

When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.

To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages — over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.

Suggestions for using the data? The authors have those as well:

What might you do with this data? Well, we’ve already written one ACL paper on cross-document co-reference (and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:

  • Look into coreference — when different mentions mention the same entity — or entity resolution — matching a mention to the underlying entity
  • Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity
  • Learn things about entities by aggregating information across all the documents they’re mentioned in
  • Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
  • Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.

Those all sound like topic map tasks to me, especially if you capture your coreference results for merging with other coreference results.

…Wikilinks Corpus With 40M Mentions And 3M Entities

Saturday, March 9th, 2013

Google Research Releases Wikilinks Corpus With 40M Mentions And 3M Entities by Frederic Lardinois.

From the post:

Google Research just launched its Wikilinks corpus, a massive new data set for developers and researchers that could make it easier to add smart disambiguation and cross-referencing to their applications. The data could, for example, make it easier to find out if two web sites are talking about the same person or concept, Google says. In total, the corpus features 40 million disambiguated mentions found within 10 million web pages. This, Google notes, makes it “over 100 times bigger than the next largest corpus,” which features fewer than 100,000 mentions.

For Google, of course, disambiguation is something that is a core feature of the Knowledge Graph project, which allows you to tell Google whether you are looking for links related to the planet, car or chemical element when you search for ‘mercury,’ for example. It takes a large corpus like this one and the ability to understand what each web page is really about to make this happen.

Details follow on how to create this data set.

Very cool!

The only caution is that your entities, those specific to your enterprise, are unlikely to appear, even in 40M mentions.

But the Wikilinks Corpus + your entities, now that is something with immediate ROI for your enterprise.

Marketplace in Query Libraries? Marketplace in Identified Entities?

Thursday, February 7th, 2013

Using SPARQL Query Libraries to Generate Simple Linked Data API Wrappers by Tony Hirst.

From the post:

A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.

I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…

For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.

Tony’s marketplace of queries has a great deal of potential.

But I don’t think they need to be limited to SPARQL queries.

By extension his arguments should be true for searches on Google, Bing, etc., as well as vendor specialized search interfaces.

I would take that a step further into libraries for post-processing the results of such queries and presenting users with enhanced presentations and/or content.

And as part of that post-processing, I would add robust identification of entities as an additional feature of such a library/service.

For example, what if you have curated some significant portion of the ACM digital library and when passed what could be an ambiguous reference to a concept, you return to the user the properties that distinguish their reference into several groups.

Which frees every user from wading through unrelated papers and proceedings, when that reference comes up.

Would that be a service users would pay for?

I suppose that depends on how valuable their time is to them and/or their employers.

On ranking relevant entities in heterogeneous networks…

Tuesday, January 22nd, 2013

On ranking relevant entities in heterogeneous networks using a language-based model by Laure Soulier, Lamjed Ben Jabeur, Lynda Tamine, Wahiba Bahsoun. (Soulier, L., Jabeur, L. B., Tamine, L. and Bahsoun, W. (2013), On ranking relevant entities in heterogeneous networks using a language-based model. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22762)


A new challenge, accessing multiple relevant entities, arises from the availability of linked heterogeneous data. In this article, we address more specifically the problem of accessing relevant entities, such as publications and authors within a bibliographic network, given an information need. We propose a novel algorithm, called BibRank, that estimates a joint relevance of documents and authors within a bibliographic network. This model ranks each type of entity using a score propagation algorithm with respect to the query topic and the structure of the underlying bi-type information entity network. Evidence sources, namely content-based and network-based scores, are both used to estimate the topical similarity between connected entities. For this purpose, authorship relationships are analyzed through a language model-based score on the one hand and on the other hand, non topically related entities of the same type are detected through marginal citations. The article reports the results of experiments using the Bibrank algorithm for an information retrieval task. The CiteSeerX bibliographic data set forms the basis for the topical query automatic generation and evaluation. We show that a statistically significant improvement over closely related ranking models is achieved.

Note the “estimat[ion] of topic similarity between connected entities.”

Very good work but rather than a declaration of similarity (topic maps) we have an estimate of similarity.

Before you protest about the volume of literature/data, recall that some author write the documents in question. And selected the terms and references found therein.

Rather than guessing what may be similar to what the author wrote, why not devise a method to allow the author to say?

And build upon similarity/sameness declarations across heterogeneous networks of data.

Prioritizing PubMed articles…

Wednesday, November 21st, 2012

Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information by Sun Kim, Won Kim, Chih-Hsuan Wei, Zhiyong Lu and W. John Wilbur.


The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein–protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.

An interesting summary of entity recognition issues in bioinformatics occurs in this article:

The second problem is that chemical and disease mentions should be identified along with gene mentions. Named entity recognition (NER) has been a main research topic for a long time in the biomedical text-mining community. The common strategy for NER is either to apply certain rules based on dictionaries and natural language processing techniques (5–7) or to apply machine learning approaches such as support vector machines (SVMs) and conditional random fields (8–10). However, most NER systems are class specific, i.e. they are designed to find only objects of one particular class or set of classes (11). This is natural because chemical, gene and disease names have specialized terminologies and complex naming conventions. In particular, gene names are difficult to detect because of synonyms, homonyms, abbreviations and ambiguities (12,13). Moreover, there are no specific rules of how to name a gene that are actually followed in practice (14). Chemicals have systematic naming conventions, but finding chemical names from text is still not easy because there are various ways to express chemicals (15,16). For example, they can be mentioned as IUPAC names, brand names, generic names or even molecular formulas. However, disease names in literature are more standardized (17) compared with gene and chemical names. Hence, using terminological resources such as Medical Subject Headings (MeSH) and Unified Medical Language System (UMLS) Metathesaurus help boost the identification performance (17,18). But, a major drawback of identifying disease names from text is that they often use general English terms.

Having a common representative for a group of identifiers for a single entity, should simplify the creation of mappings between entities.


Kiji Project [Framework for HBase]

Wednesday, November 14th, 2012

Kiji Project: An Open Source Framework for Building Big Data Applications with Apache HBase by Aaron Kimball.

From the post:

Our team at WibiData has been developing applications on Hadoop since 2010 and we’ve helped many organizations transform how they use data by deploying Hadoop. HBase in particular has allowed companies of all types to drive their business using scalable, high performance storage. Organizations have started to leverage these capabilities for various big data applications, including targeted content, personalized recommendations, enhanced customer experience and social network analysis.

While building many of these applications, we have seen emerging tools, design patterns and best practices repeated across projects. One of the clear lessons learned is that Hadoop and HBase provide very low-level interfaces. Each large-scale application we have built on top of Hadoop has required a great deal of scaffolding and data management code. This repetitive programming is tedious, error-prone, and makes application interoperability more challenging in the long run.

Today, we are proud to announce the launch of the Kiji project (, as well as the first Kiji component: KijiSchema. The Kiji project was developed to host a suite of open source components built on top of Apache HBase and Apache Hadoop, that makes it easier for developers to:

  1. Use HBase as a real-time data storage and serving layer for applications
  2. Maximize HBase performance using data management best practices
  3. Get started building data applications quickly with easy startup and configuration

Kiji is open source and licensed under the Apache 2.0 license. The Kiji project is modularized into separate components to simplify adoption and encourage clean separation of functionality. Our approach emphasizes interoperability with other systems, leveraging the open source HBase, Avro and MapReduce projects, enabling you to easily fit Kiji into your development process and applications.

KijiSchema: Schema Management for HBase

The first component within the Kiji project is KijiSchema, which provides layout and schema management on top of HBase. KijiSchema gives developers the ability to easily store both structured and unstructured data within HBase using Avro serialization. It supports a variety of rich schema features, including complex, compound data types, HBase column key and time-series indexing, as well cell-level evolving schemas that dynamically encode version information.

KijiSchema promotes the use of entity-centric data modeling, where all information about a given entity (user, mobile device, ad, product, etc.), including dimensional and transaction data, is encoded within the same row. This approach is particularly valuable for user-based analytics such as targeting, recommendations, and personalization.

This looks important!

Reading further about their “entiity-centric” approach:

Entity-Centric Data Model

KijiSchema’s data model is entity-centric. Each row typically holds information about a single entity in your information scheme. As an example, a consumer e-commerce web site may have a row representing each user of their site. The entity-centric data model enables easier analysis of individual entities. For example, to recommend products to a user, information such as the user’s past purchases, previously viewed items, search queries, etc. all need to be brought together. The entity-centric model stores all of these attributes of the user in the same row, allowing for efficient access to relevant information.

The entity-centric data model stands in comparison to a more typical log-based approach to data collection. Many MapReduce systems import log files for analysis. Logs are action-centric; each action performed by a user (adding an item to a shopping cart, checking out, performing a search, viewing a product) generates a new log entry. Collecting all the data required for a per-user analysis thus requires a scan of many logs. The entity-centric model is a “pivoted” form of this same information. By pivoting the information as the data is loaded into KijiSchema, later analysis can be run more efficiently, either in a MapReduce job operating over all users, or in a more narrowly-targeted fashion if individual rows require further computation.

I’m already convinced about a single representative for an entity. 😉

Need to work through the documentation on capturing diverse information about a single entity in one row.

I suspect that the structures that capture data aren’t entities for purposes of this model.

Still, will be an interesting exploration.

Semi-Supervised Named Entity Recognition:… [Marketing?]

Sunday, June 3rd, 2012

Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision by David Nadeau (PhD Thesis, University of Ottawa, 2007).


Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts.

Nadeau demonstrates the successful construction of a Named Entity Recognition (NER) system using a few supplied examples for each entity.

But what explains the lack of annotation where the entities are well known? The King James Bible? Search for “Joseph.” We know not all of the occurrences of “Joseph” represent the same entity.

Looking at the client list for Infoglutton, is there a lack of interest in named entity recognition?

Have we focused on techniques and issues that interest us, and then, as an afterthought, tried to market the results to consumers?

A Survey of Named Entity Recognition and Classification

Sunday, June 3rd, 2012

A Survey of Named Entity Recognition and Classification by David Nadeau, Satoshi Sekine (Journal of Linguisticae Investigationes 30:1 ; 2007)


The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. We present here a survey of fifteen years of research in the NERC field, from 1991 to 2006. While early systems were making use of handcrafted rule-based algorithms, modern systems most often resort to machine learning techniques. We survey these techniques as well as other critical aspects of NERC such as features and evaluation methods. It was indeed concluded in a recent conference that the choice of features is at least as important as the choice of technique for obtaining a good NERC system (E. Tjong Kim Sang & De Meulder 2003). Moreover, the way NERC systems are evaluated and compared is essential to progress in the field. To the best of our knowledge, NERC features, techniques, and evaluation methods have not been surveyed extensively yet. The first section of this survey presents some observations on published work from the point of view of activity per year, supported languages, preferred textual genre and domain, and supported entity types. It was collected from the review of a hundred English language papers sampled from the major conferences and journals. We do not claim this review to be exhaustive or representative of all the research in all languages, but we believe it gives a good feel for the breadth and depth of previous work. Section 2 covers the algorithmic techniques that were proposed for addressing the NERC task. Most techniques are borrowed from the Machine Learning (ML) field. Instead of elaborating on techniques themselves, the third section lists and classifies the proposed features, i.e., descriptions and characteristic of words for algorithmic consumption. Section 4 presents some of the evaluation paradigms that were proposed throughout the major forums. Finally, we present our conclusions.

A bit dated now (2007) but a good starting point for named entity recognition research. The bibliography runs a little over four (4) pages and running those citations forward should capture most of the current research.

A Resource-Based Method for Named Entity Extraction and Classification

Sunday, June 3rd, 2012

A Resource-Based Method for Named Entity Extraction and Classification by Pablo Gamallo and Marcos Garcia. (Lecture Notes in Computer Science, vol. 7026, Springer-Verlag, 610-623. ISNN: 0302-9743).


We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres.

Of particular interest if you are interested in adding NEC resources to the FreeLing project.

The introduction starts off:

Named Entity Recognition and Classification (NERC) is the process of identifying and classifying proper names of people, organizations, locations, and other Named Entities (NEs) within text.

Curious, what happens if you don’t have a “named” entity? That is an entity mentioned in the text but that doesn’t (yet) have a proper name?

Thinking of legal texts where some provision may apply to all corporations that engage in activity Y and that have a gross annual income in excess of amount X.

I may want to “recognize” that entity so I can then put a name with that entity.

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Friday, May 18th, 2012

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas by Valentin Spitkovsky and Peter Norvig (Google Research Team).

From the post:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

(examples omitted)

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data. (emphasis added)

Did you catch those numbers?

Now there is a truly remarkable resource.

What will you make out of it?

Entity Matching for Semistructured Data in the Cloud

Friday, February 24th, 2012

Entity Matching for Semistructured Data in the Cloud by Marcus Paradies.

From the slides:

Main Idea

  • Use MapReduce and ChuQL to process semistructured data
  • Use a search-based blocking to generate candidate pairs
  • Apply similarity functions to candidate pairs within a block

Uses two of my favorite sources, CiteSeer and Wikipedia.

Looks like the start of an authoring stage of topic map work flow to me. You?

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings

Monday, January 16th, 2012

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings (PDF file)

There you will find:

Session 1:

  • High Performance Clustering for Web Person Name Disambiguation Using Topic Capturing by Zhengzhong Liu, Qin Lu, and Jian Xu (The Hong Kong Polytechnic University)
  • Extracting Dish Names from Chinese Blog Reviews Using Suffix Arrays and a Multi-Modal CRF Model by Richard Tzong-Han Tsai (Yuan Ze University, Taiwan)
  • LADS: Rapid Development of a Learning-To-Rank Based Related Entity Finding System using Open Advancement by Bo Lin, Kevin Dela Rosa, Rushin Shah, and Nitin Agarwal (Carnegie Mellon University)
  • Finding Support Documents with a Logistic Regression Approach by Qi Li and Daqing He (University of Pittsburgh)
  • The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data by Stephane Campinas (National University of Ireland), Diego Ceccarelli (University of Pisa), Thomas E. Perry (National University of Ireland), Renaud Delbru (National University of Ireland), Krisztian Balog (Norwegian University of Science and Technology) and Giovanni Tummarello (National University of Ireland)

Session 2

  • Cross-Domain Bootstrapping for Named Entity Recognition by Ang Sun and Ralph Grishman (New York University)
  • Semi-supervised Statistical Inference for Business Entities Extraction and Business Relations Discovery by Raymond Y.K. Lau and Wenping Zhang (City University of Hong Kong)
  • Unsupervised Related Entity Finding by Olga Vechtomova (University of Waterloo)

Session 3

  • Learning to Rank Homepages For Researcher-Name Queries by Sujatha Das, Prasenjit Mitra, and C. Lee Giles (The Pennsylvania State University)
  • An Evaluation Framework for Aggregated Temporal Information Extraction by Enrique Amigó, (UNED University), Javier Artiles (City University of New York), Heng Hi (City University of New York) and Qi Li (City University of New York)
  • Entity Search Evaluation over Structured Web Data by Roi Blanco (Yahoo! Research), Harry Halpin (University of Edinburgh), Daniel M. Herzig (Karlsruhe Institute of Technology), Peter Mika (Yahoo! Research), Jeffrey Pound (University of Waterloo), Henry S. Thompson (University of Edinburgh) and Thanh Tran Duc (Karlsruhe Institute of Technology)

A good start on what promises to be a strong conference series on entity-oriented search.

John Giannandrea on Freebase – A Rosetta Stone for Entities

Tuesday, November 15th, 2011

John Giannandrea on Freebase – A Rosetta Stone for Entities by Daniel Tunkelang.

From the post:

John started by introducing Freebase as a representation of structured objects corresponding to real-world entities and connected by a directed graph of relationships. In other words, a semantic web. While it isn’t quite web-scale, Freebase is a large and growing knowledge base consisting of 25 million entities and 500 million connections — and doubling annually. The core concept in Freebase is a type, and an entity can have many types. For example, Arnold Schwarzenegger is a politician and an actor. John emphasized the messiness of the real world. For example, most actors are people, but what about the dog who played Lassie? It’s important to support exceptions.

The main technical challenge for Freebase is reconciliation — that is, determining how similar a set of data is to existing Freebase topics. John pointed out how critical it is for Freebase to avoid duplication of content, since the utility of Freebase depends on unique nodes in its graph corresponding to unique objects in the world. Freebase obtains many of its entities by reconciling large, open-source knowledge bases — including Wikipedia, WordNet, Library of Congress Authorities, and metadata from the Stanford Library. Freebase uses a variety of tools to implement reconciliation, including Google Refine (formerly known as Freebase Gridworks) and Matchmaker, a tool for gathering human judgments. While reconciliation is a hard technical problem, it is made possible by making inferences across the web of relationships that link entities to one another.

John then presented Freebase as a Rosetta Stone for entities on the web. Since an entity is simply a collection of keys (one of which is its name), Freebase’s job is to reverse engineer the key-value store that is distributed among the entity’s web references, e.g., the structured databases backing web sites and encoding keys in URL parameters. He noted that Freebase itself is schema-less (it is a graph database), and that even the concept of a type is itself an entity (“Type type is the only type that is an instance of itself”). Google makes Freebase available through an API and the Metaweb Query Language (MQL).

(emphasis added)

<tedious-self-justification>…., entity is a collection of keys indeed! Key/value pairs I would say, with no presumptions about the structure of either one.</tedious-self-justification>

There is not now nor will there ever be agreement on the “unique objects in the world.” And why should that be a value? If we have the key/value pairs, we can each arrive at our own conclusions about whether certain “unique nodes” correspond to what we think of as “unique objects in the world.”

I suspect, but don’t know having never asked former President Bush II, that we disagree on the existence of any unique objects in the world and it is unlikely there is any evidence that would persuade either one of us to change.

Remember the Rosetta Stone had three (3) version of the same inscription. It did not try to say one version was closer to the original than the others.

The Rosetta Stone is one of the earliest honorings of semantic diversity. Unlike systems that try to push only one common semantic or vision.