Archive for the ‘Entity Extraction’ Category

Analysis of Named Entity Recognition and Linking for Tweets

Friday, October 24th, 2014

Analysis of Named Entity Recognition and Linking for Tweets by Leon Derczynski, et al.

Abstract:

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identi cation, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

A detailed review of existing solutions for mining tweets, where they fail along and why.

A comparison to spur tweet research:

Tweets Per Day > 500,000,000 Derczynski, p. 2
Annotated Tweets < 10,000 Derczynski, p. 27

Let’s see: 500,000,000 / 10,000 = 50,000.

The number of tweet per day is more than 50,000 times the number of tweets annotated with named entity types.

It may just be me but that sounds like the sort of statement you would see in a grant proposal to increase the number of annotated tweets.

Yes?

I first saw this in a tweet by Diana Maynard.

Evaluating Entity Linking with Wikipedia

Monday, April 28th, 2014

Evaluating Entity Linking with Wikipedia by Ben Hachey, et al.

Abstract:

Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Knowledge Base (KB). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or NIL. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets.

We reimplement three seminal NEL systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms.

A very deep survey of entity linking literature (including record linkage) and implementation of three complete entity linking systems for comparison.

At forty-eight (48) pages it isn’t a quick read but should be your starting point for pushing the boundaries on entity linking research.

I first saw this in a tweet by Alyona Medelyan.

Using NLTK for Named Entity Extraction

Sunday, April 27th, 2014

Using NLTK for Named Entity Extraction by Emily Daniels.

From the post:

Continuing on from the previous project, I was able to augment the functions that extract character names using NLTK’s named entity module and an example I found online, building my own custom stopwords list to run against the returned names to filter out frequently used words like “Come”, “Chapter”, and “Tell” which were caught by the named entity functions as potential characters but are in fact just terms in the story.

Whether you are trusting your software or using human proofing, named entity extraction is a key task in mining data.

Having extracted named entities, the harder task is uncovering relationships between them that may not be otherwise identified.

Challenging with the text of Oliver Twist but even more difficult when mining donation records and the Congressional record.

CoIN: a network analysis for document triage

Wednesday, November 13th, 2013

CoIN: a network analysis for document triage by Yi-Yu Hsu and Hung-Yu Kao. (Database (2013) 2013 : bat076 doi: 10.1093/database/bat076)

Abstract:

In recent years, there was a rapid increase in the number of medical articles. The number of articles in PubMed has increased exponentially. Thus, the workload for biocurators has also increased exponentially. Under these circumstances, a system that can automatically determine in advance which article has a higher priority for curation can effectively reduce the workload of biocurators. Determining how to effectively find the articles required by biocurators has become an important task. In the triage task of BioCreative 2012, we proposed the Co-occurrence Interaction Nexus (CoIN) for learning and exploring relations in articles. We constructed a co-occurrence analysis system, which is applicable to PubMed articles and suitable for gene, chemical and disease queries. CoIN uses co-occurrence features and their network centralities to assess the influence of curatable articles from the Comparative Toxicogenomics Database. The experimental results show that our network-based approach combined with co-occurrence features can effectively classify curatable and non-curatable articles. CoIN also allows biocurators to survey the ranking lists for specific queries without reviewing meaningless information. At BioCreative 2012, CoIN achieved a 0.778 mean average precision in the triage task, thus finishing in second place out of all participants.

Database URL: http://ikmbio.csie.ncku.edu.tw/coin/home.php

From the introduction:

Network analysis concerns the relationships between processing entities. For example, the nodes in a social network are people, and the links are the friendships between the nodes. If we apply these concepts to the ACT, PubMed articles are the nodes, while the co-occurrences of gene–disease, gene–chemical and chemical–disease relationships are the links. Network analysis provides a visual map and a graph-based technique for determining co-occurrence relationships. These graphical properties, such as size, degree, centralities and similar features, are important. By examining the graphical properties, we can gain a global understanding of the likely behavior of the network. For this purpose, this work focuses on two themes concerning the applications of biocuration: using the co-occurrence–based approach to obtain a normalized co-occurrence score and using the network-based approach to measure network properties, e.g. betweenness and PageRank. CoIN integrates co-occurrence features and network centralities when curating articles. The proposed method combines the co-occurrence frequency with the network construction from text. The co-occurrence networks are further analyzed to obtain the linking and shortest path features of the network centralities.

The authors’ ultimately conclude that the network-based approaches perform better than collocation-based approaches.

If this post sounds hauntingly familiar, you may be thinking about Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information, which was the first place finisher at BioCreative 2012 with a mean average precision (MAP) score of 0.8030.

Named Entity Recognition (NER) in Solr

Friday, August 2nd, 2013

Named Entity Recognition (NER) in Solr

From the post:

Named Entity Recognition, or NER for short, is a powerful paradigm which causes entities to be recognized within text. Typically these objects can be places, organizations or people. For example, given the phrase “Jon works at Searchbox”, a good NER would return that Jon is a person and Searchbox is an organization. Why is this powerful, especially in Solr? Using this information we can not only propose better suggestions for users searching for things, but using Solr faceting capability we’ll have the ability to facet directly on organizations (or people) without having to manually identify them in all of the documents.

In this blog post, extending from our two previous slideshares on how to develop search components and request handlers, we’ll teach you how to directly embed Stanford’s NER library into a production ready plugin which provides all of the mentioned benefits. We of course provide the full source code packaging here.

Very nice walk through on entity recognition with Solr.

Thought occurs to me that every instance of an entity that is recognized, could be presented to a user as occurrences of that entity. Plugging that search result into a topic that represents the subject.

So there is some static aspect to the topic map, the topic for that subject and a dynamic aspect, being the search results presented as occurrences.

You could enter information or relationships you discover in the occurrences on the static side of the map. Let software manage metadata from the document containing the occurrence.

Named Entities in Law & Order Episodes

Monday, July 22nd, 2013

Named Entities in Law & Order Episodes by Yhat.

A worked example of using natural language processing on a corpus of viewer summaries of episodes of Law & Order and Law & Order: Special Victims Unit.

The data is here.

Makes me wonder if there is a archive of the soap operas that have been on for decades?

They survived because they have supporting audiences. Suspect a resource about the same would as well.

Poor man’s “entity” extraction with Solr

Friday, June 28th, 2013

Poor man’s “entity” extraction with Solr by Erik Hatcher.

From the post:

My work at LucidWorks primarily involves helping customers build their desired solutions. Recently, more than one customer has inquired about doing “entity extraction”. Entity extraction, as defined on Wikipedia, “seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.” When drilling down into the specifics of the requirements from our customers, it turns out that many of them have straightforward solutions using built-in (Solr 4.x) components, such as:

  • Acronyms as facets
  • Key words or phrases, from a fixed list, as facets
  • Lat/long mentions as geospatial points

This article will describe and demonstrate how to do these, and as a bonus we’ll also extract URLs found in text too. Let’s start with an example input and the corresponding output all of the described techniques provides.

If you have been thinking about experimenting with Solr, Erik touches on some of its features by example.

A Survey of Stochastic and Gazetteer Based Approaches for Named Entity Recognition

Thursday, April 18th, 2013

A Survey of Stochastic and Gazetteer Based Approaches for Named Entity Recognition – Part 2 by Benjamin Bengfort.

From the post:

Generally speaking, the most effective named entity recognition systems can be categorized as rule-based, gazetteer and machine learning approaches. Within each of these approaches are a myriad of sub-approaches that combine to varying degrees each of these top-level categorizations. However, because of the research challenge posed by each approach, typically one or the other is focused on in the literature.

Rule-based systems utilize pattern-matching techniques in text as well as heuristics derived either from the morphology or the semantics of the input sequence. They are generally used as classifiers in machine-learning approaches, or as candidate taggers in gazetteers. Some applications can also make effective use of stand-alone rule-based systems, but they are prone to both overreach and skipping over named entities. Rule-based approaches are discussed in (10), (12), (13), and (14).

Gazetteer approaches make use of some external knowledge source to match chunks of the text via some dynamically constructed lexicon or gazette to the names and entities. Gazetteers also further provide a non-local model for resolving multiple names to the same entity. This approach requires either the hand crafting of name lexicons or some dynamic approach to obtaining a gazette from the corpus or another external source. However, gazette based approaches achieve better results for specific domains. Most of the research on this topic focuses on the expansion of the gazetteer to more dynamic lexicons, e.g. the use of Wikipedia or Twitter to construct the gazette. Gazette based approaches are discussed in (15), (16), and (17).

Stochastic approaches fare better across domains, and can perform predictive analysis on entities that are unknown in a gazette. These systems use statistical models and some form of feature identification to make predictions about named entities in text. They can further be supplemented with smoothing for universal coverage. Unfortunately these approaches require large amounts of annotated training data in order to be effective, and they don’t naturally provide a non-local model for entity resolution. Systems implemented with this approach are discussed in (7), (8), (4), (9), and (6).

Benjamin continues his excellent survey of named entity recognition techniques.

All of these techniques may prove to be useful in constructing topic maps from source materials.

An Introduction to Named Entity Recognition…

Wednesday, April 17th, 2013

An Introduction to Named Entity Recognition in Natural Language Processing – Part 1 by Benjamin Bengfort.

From the post:

Abstract:

The task of identifying proper names of people, organizations, locations, or other entities is a subtask of information extraction from natural language documents. This paper presents a survey of techniques and methodologies that are currently being explored to solve this difficult subtask. After a brief review of the challenges of the task, as well as a look at previous conventional approaches, the focus will shift to a comparison of stochastic and gazetteer based approaches. Several machine-learning approaches are identified and explored, as well as a discussion of knowledge acquisition relevant to recognition. This two-part white paper will show that applications that require named entity recognition will be served best by some combination of knowledge- based and non-deterministic approaches.

Introduction:

In school we were taught that a proper noun was “a specific person, place, or thing,” thus extending our definition from a concrete noun. Unfortunately, this seemingly simple mnemonic masks an extremely complex computational linguistic task—the extraction of named entities, e.g. persons, organizations, or locations from corpora (1). More formally, the task of Named Entity Recognition and Classification can be described as the identification of named entities in computer readable text via annotation with categorization tags for information extraction.

Not only is named entity recognition a subtask of information extraction, but it also plays a vital role in reference resolution, other types of disambiguation, and meaning representation in other natural language processing applications. Semantic parsers, part of speech taggers, and thematic meaning representations could all be extended with this type of tagging to provide better results. Other, NER-specific, applications abound including question and answer systems, automatic forwarding, textual entailment, and document and news searching. Even at a surface level, an understanding of the named entities involved in a document provides much richer analytical frameworks and cross-referencing.

Named entities have three top-level categorizations according to DARPA’s Message Understanding Conference: entity names, temporal expressions, and number expressions (2). Because the entity names category describes the unique identifiers of people, locations, geopolitical bodies, events, and organizations, these are usually referred to as named entities and as such, much of the literature discussed in this paper focuses solely on this categorization, although it is easy to imagine extending the proposed systems to cover the full MUC-7 task. Further, the CoNLL-2003 Shared Task, upon which the standard of evaluation for such systems is based, only evaluates the categorization of organizations, persons, locations, and miscellaneous named entities. For example:

(ORG S.E.C.) chief (PER Mary Shapiro) to leave (LOC Washington) in December.

This sentence contains three named entities that demonstrate many of the complications associated with named entity recognition. First, S.E.C. is an acronym for the Securities and Exchange Commission, which is an organization. The two words “Mary Shapiro” indicate a single person, and Washington, in this case, is a location and not a name. Note also that the token “chief” is not included in the person tag, although it very well could be. In this scenario, it is ambiguous if “S.E.C. chief Mary Shapiro” is a single named entity, or if multiple, nested tags would be required.

Nice introduction to the area and ends with a great set of references.

Looking forward to part 2!

Named entity extraction

Thursday, February 28th, 2013

Named entity extraction

From the webpage:

The techniques we discussed in the Cleanup and Reconciliation parts come in very handy when your data is already in a structured format. However, many fields (notoriously description) contain unstructured text, yet they usually convey a high amount of interesting information. To capture this in machine-processable format, named entity recognition can be used.

A Google Refine / OpenRefine extension developed by Multimedia Lab (ELIS — Ghent University / iMinds) and MasTIC (Université Libre de Bruxelles.

Described in: Named-Entity Recognition: A Gateway Drug for Cultural Heritage Collections to the Linked Data Cloud?

Abstract:

Unstructured metadata fields such as ‘description’ offer tremendous value for users to understand cultural heritage objects. However, this type of narrative information is of little direct use within a machine-readable context due to its unstructured nature. This paper explores the possibilities and limitations of Named-Entity Recognition (NER) to mine such unstructured metadata for meaningful concepts. These concepts can be used to leverage otherwise limited searching and browsing operations, but they can also play an important role to foster Digital Humanities research. In order to catalyze experimentation with NER, the paper proposes an evaluation of the performance of three thirdparty NER APIs through a comprehensive case study, based on the descriptive fields of the Smithsonian Cooper-Hewitt National Design Museum in New York. A manual analysis is performed of the precision, recall, and F-score of the concepts identified by the third party NER APIs. Based on the outcomes of the analysis, the conclusions present the added value of NER services, but also point out to the dangers of uncritically using NER, and by extension Linked Data principles, within the Digital Humanities. All metadata and tools used within the paper are freely available, making it possible for researchers and practitioners to repeat the methodology. By doing so, the paper offers a significant contribution towards understanding the value of NER for the Digital Humanities.

I commend the paper to you for a very close reading, particularly those of you in the humanities.

To conclude, the Digital Humanities need to launch a broader debate on how we can incorporate within our work the probabilistic character of tools such as NER services. Drucker eloquently states that ‘we use tools from disciplines whose epistemological foundations are at odds with, or even hostile to, the humanities. Positivistic, quantitative and reductive, these techniques preclude humanistic methods because of the very assumptions on which they are designed: that objects of knowledge can be understood as ahistorical and autonomous.’

Drucker, J. (2012), Debates in the Digital Humanities, Minesota Press, chapter Humanistic Theory and Digital Scholarship, pp. 85–95.

…that objects of knowledge can be understood as ahistorical and autonomous.

Certainly possible, but lossy, very lossy, in my view.

You?

Text Processing (part 1) : Entity Recognition

Sunday, February 17th, 2013

Text Processing (part 1) : Entity Recognition by Ricky Ho.

From the post:

Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation. It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing. In this first blog of a series of basic text processing algorithm, I will introduce some basic algorithm for entity recognition.

Looking forward to this series!

Developing CODE for a Research Database

Tuesday, December 11th, 2012

Developing CODE for a Research Database by Ian Armas Foster.

From the post:

The fact that there are a plethora of scientific papers readily available online would seem helpful to researchers. Unfortunately, the truth is that the volume of these articles has grown such that determining which information is relevant to a specific project is becoming increasingly difficult.

Austrian and German researchers are thus developing CODE, or Commercially Empowered Linked Open Data Ecosystems in Research, to properly aggregate research data from its various forms, such as PDFs of academic papers and data tables upon which those papers are based, into a single system. The project is in a prototype stage, with the goal being to integrate all forms into one platform by the project’s second year.

The researchers from the University of Passau in Germany and the Know-Center in Graz, Austria explored the challenges to CODE and how the team intends to deal with those challenges in this paper. The goal is to meliorate the research process by making it easier to not only search for both text and numerical data in the same query but also to use both varieties in concert. The basic architecture for the project is shown below.

Stop me if you have heard this one before: “There was this project that was going to disambiguate entities and create linked data….”

I would be the first one to cheer if such a project were successful. But, a few paragraphs in a paper, given the long history of entity resolution and its difficulties, isn’t enough to raise my hopes.

You?

Prioritizing PubMed articles…

Wednesday, November 21st, 2012

Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information by Sun Kim, Won Kim, Chih-Hsuan Wei, Zhiyong Lu and W. John Wilbur.

Abstract:

The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein–protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.

An interesting summary of entity recognition issues in bioinformatics occurs in this article:

The second problem is that chemical and disease mentions should be identified along with gene mentions. Named entity recognition (NER) has been a main research topic for a long time in the biomedical text-mining community. The common strategy for NER is either to apply certain rules based on dictionaries and natural language processing techniques (5–7) or to apply machine learning approaches such as support vector machines (SVMs) and conditional random fields (8–10). However, most NER systems are class specific, i.e. they are designed to find only objects of one particular class or set of classes (11). This is natural because chemical, gene and disease names have specialized terminologies and complex naming conventions. In particular, gene names are difficult to detect because of synonyms, homonyms, abbreviations and ambiguities (12,13). Moreover, there are no specific rules of how to name a gene that are actually followed in practice (14). Chemicals have systematic naming conventions, but finding chemical names from text is still not easy because there are various ways to express chemicals (15,16). For example, they can be mentioned as IUPAC names, brand names, generic names or even molecular formulas. However, disease names in literature are more standardized (17) compared with gene and chemical names. Hence, using terminological resources such as Medical Subject Headings (MeSH) and Unified Medical Language System (UMLS) Metathesaurus help boost the identification performance (17,18). But, a major drawback of identifying disease names from text is that they often use general English terms.

Having a common representative for a group of identifiers for a single entity, should simplify the creation of mappings between entities.

Yes?

BioContext: an integrated text mining system…

Wednesday, August 8th, 2012

BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events by Martin Gerner, Farzaneh Sarafraz, Casey M. Bergman, and Goran Nenadic. (Bioinformatics (2012) 28 (16): 2154-2161. doi: 10.1093/bioinformatics/bts332)

Abstract:

Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.

Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative.

Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.

If you are interested in text mining by professionals, this is a good place to start.

Should be of particular interest to anyone interested in mining literature for construction of a topic map.

Semi-Supervised Named Entity Recognition:… [Marketing?]

Sunday, June 3rd, 2012

Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision by David Nadeau (PhD Thesis, University of Ottawa, 2007).

Abstract:

Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts.

Nadeau demonstrates the successful construction of a Named Entity Recognition (NER) system using a few supplied examples for each entity.

But what explains the lack of annotation where the entities are well known? The King James Bible? Search for “Joseph.” We know not all of the occurrences of “Joseph” represent the same entity.

Looking at the client list for Infoglutton, is there a lack of interest in named entity recognition?

Have we focused on techniques and issues that interest us, and then, as an afterthought, tried to market the results to consumers?

A Survey of Named Entity Recognition and Classification

Sunday, June 3rd, 2012

A Survey of Named Entity Recognition and Classification by David Nadeau, Satoshi Sekine (Journal of Linguisticae Investigationes 30:1 ; 2007)

Abstract:

The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. We present here a survey of fifteen years of research in the NERC field, from 1991 to 2006. While early systems were making use of handcrafted rule-based algorithms, modern systems most often resort to machine learning techniques. We survey these techniques as well as other critical aspects of NERC such as features and evaluation methods. It was indeed concluded in a recent conference that the choice of features is at least as important as the choice of technique for obtaining a good NERC system (E. Tjong Kim Sang & De Meulder 2003). Moreover, the way NERC systems are evaluated and compared is essential to progress in the field. To the best of our knowledge, NERC features, techniques, and evaluation methods have not been surveyed extensively yet. The first section of this survey presents some observations on published work from the point of view of activity per year, supported languages, preferred textual genre and domain, and supported entity types. It was collected from the review of a hundred English language papers sampled from the major conferences and journals. We do not claim this review to be exhaustive or representative of all the research in all languages, but we believe it gives a good feel for the breadth and depth of previous work. Section 2 covers the algorithmic techniques that were proposed for addressing the NERC task. Most techniques are borrowed from the Machine Learning (ML) field. Instead of elaborating on techniques themselves, the third section lists and classifies the proposed features, i.e., descriptions and characteristic of words for algorithmic consumption. Section 4 presents some of the evaluation paradigms that were proposed throughout the major forums. Finally, we present our conclusions.

A bit dated now (2007) but a good starting point for named entity recognition research. The bibliography runs a little over four (4) pages and running those citations forward should capture most of the current research.

Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity

Sunday, June 3rd, 2012

Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity by David Nadeau, Peter D. Turney and Stan Matwin.

Abstract:

In this paper, we propose a named-entity recognition (NER) system that addresses two major limitations frequently discussed in the field. First, the system requires no human intervention such as manually labeling training data or creating gazetteers. Second, the system can handle more than the three classical named-entity types (person, location, and organization). We describe the system’s architecture and compare its performance with a supervised system. We experimentally evaluate the system on a standard corpus, with the three classical named-entity types, and also on a new corpus, with a new named-entity type (car brands).

The authors confide successful application of their techniques to more than 50 named-entity types.

They also recite heuristics that they apply to texts during the mining process.

Is there a common repository of observations or heuristics for mining texts? Just curious.

Source code for the project: http://balie.sourceforge.net.

Answer to the question I just posed?

A Resource-Based Method for Named Entity Extraction and Classification

Sunday, June 3rd, 2012

A Resource-Based Method for Named Entity Extraction and Classification by Pablo Gamallo and Marcos Garcia. (Lecture Notes in Computer Science, vol. 7026, Springer-Verlag, 610-623. ISNN: 0302-9743).

Abstract:

We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres.

Of particular interest if you are interested in adding NEC resources to the FreeLing project.

The introduction starts off:

Named Entity Recognition and Classification (NERC) is the process of identifying and classifying proper names of people, organizations, locations, and other Named Entities (NEs) within text.

Curious, what happens if you don’t have a “named” entity? That is an entity mentioned in the text but that doesn’t (yet) have a proper name?

Thinking of legal texts where some provision may apply to all corporations that engage in activity Y and that have a gross annual income in excess of amount X.

I may want to “recognize” that entity so I can then put a name with that entity.

Joint International Workshop on Entity-oriented and Semantic Search

Thursday, May 31st, 2012

1st Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012

Important Dates:

  • Submissions Due: July 2, 2012
  • Notification of Acceptance: July 23, 2012
  • Camera Ready: August 1, 2012
  • Workshop date: August 16th, 2012

Located at the 35th ACM SIGIR Conference, Portland, Oregon, USA, August 12–16, 2012.

From the homepage of the workshop:

About the Workshop:

The workshop encompasses various tasks and approaches that go beyond the traditional bag-of-words paradigm and incorporate an explicit representation of the semantics behind information needs and relevant content. This kind of semantic search, based on concepts, entities and relations between them, has attracted attention both from industry and from the research community. The workshop aims to bring people from different communities (IR, SW, DB, NLP, HCI, etc.) and backgrounds (both academics and industry practitioners) together, to identify and discuss emerging trends, tasks and challenges. This joint workshop is a sequel of the Entity-oriented Search and Semantic Search Workshop series held at different conferences in previous years.

Topics

The workshop aims to gather all works that discuss entities along three dimensions: tasks, data and interaction. Tasks include entity search (search for entities or documents representing entities), relation search (search entities related to an entity), as well as more complex tasks (involving multiple entities, spatio-temporal relations inclusive, involving multiple queries). In the data dimension, we consider (web/enterprise) documents (possibly annotated with entities/relations), Linked Open Data (LOD), as well as user generated content. The interaction dimension gives room for research into user interaction with entities, also considering how to display results, as well as whether to aggregate over multiple entities to construct entity profiles. The workshop especially encourages submissions on the interface of IR and other disciplines, such as the Semantic Web, Databases, Computational Linguistics, Data Mining, Machine Learning, or Human Computer Interaction. Examples of topic of interest include (but are not limited to):

  • Data acquisition and processing (crawling, storage, and indexing)
  • Dealing with noisy, vague and incomplete data
  • Integration of data from multiple sources
  • Identification, resolution, and representation of entities (in documents and in queries)
  • Retrieval and ranking
  • Semantic query modeling (detecting, modeling, and understanding search intents)
  • Novel entity-oriented information access tasks
  • Interaction paradigms (natural language, keyword-based, and hybrid interfaces) and result representation
  • Test collections and evaluation methodology
  • Case studies and applications

We particularly encourage formal evaluation of approaches using previously established evaluation benchmarks: Semantic Search Challenge 2010, Semantic Search Challenge 2011, TREC Entity Search Track.

All workshops are special to someone. This one sounds more special than most. Collocated with the ACM SIGIR 2012 meeting. Perhaps that’s the difference.

An Efficient Trie-based Method for Approximate Entity Extraction…

Monday, March 12th, 2012

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints by Dong Deng, Guoliang Li, and Jianhua Feng. (PDF)

Abstract:

Dictionary-based entity extraction has attracted much attention from the database community recently, which locates substrings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between substrings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. Firstly, they need to tune many parameters to achieve a high performance. Secondly, they are inefficient for large editdistance thresholds. We propose a trie-based method to address these problems. We partition each entity into a set of segments. We prove that if a substring of the document is similar to an entity, it must contain a segment of the entity. Based on this observation, we first search segments from the document, and then extend the matching segments in both entities and the document to find similar pairs. To facilitate searching segments, we use a trie structure to index segments and develop an efficient trie-based algorithm. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. The experimental results show that our method achieves much higher performance compared with state-of-the-art studies.

Project page with author contact information. Code coming soon.

In case you are wondering about the path for the project including the word “taste:”

To address these problems, we propose a trie-based method for dictionary-based approximate entity extraction with edit distance constraints, called TASTE. TASTE does not need to tune parameters. Moreover TASTE achieves much higher performance, even for large edit-distance thresholds.

Is there a word for a person who creates acronyms? Acronymist perhaps?

Deeply interesting paper on the use of tries for entity extraction. Interesting due to its performance but also because of its approach.

You do remember that tries were what made the original e-version of the OED (Oxford English Dictionary) possible? Extremely responsive on less powerful machines than in your smart phone.

One wonders how starting from a set of entities this approach would fare against the TREC legal archives? But that would be “seeding” the application with entities. It may be that being given queries against a dark corpus isn’t all that realistic.

Populating the Semantic Web…

Saturday, March 3rd, 2012

Populating the Semantic Web – Combining Text and Relational Databases as RDF Graphs by Kate Bryne.

I ran across this while looking for RDF graph material today. Delighted to find someone interested in the problem of what do we do with existing data, even if new data is in some semantic web format?

Abstract:

The Semantic Web promises a way of linking distributed information at a granular level by interconnecting compact data items instead of complete HTML pages. New data is gradually being added to the SemanticWeb but there is a need to incorporate existing knowledge. This thesis explores ways to convert a coherent body of information from various structured and unstructured formats into the necessary graph form. The transformation work crosses several currently active disciplines, and there are further research questions that can be addressed once the graph has been built.

Hybrid databases, such as the cultural heritage one used here, consist of structured relational tables associated with free text documents. Access to the data is hampered by complex schemas, confusing terminology and difficulties in searching the text effectively. This thesis describes how hybrid data can be unified by assembly into a graph. A major component task is the conversion of relational database content to RDF. This is an active research field, to which this work contributes by examining weaknesses in some existing methods and proposing alternatives.

The next significant element of the work is an attempt to extract structure automatically from English text using natural language processing methods. The first claim made is that the semantic content of the text documents can be adequately captured as a set of binary relations forming a directed graph. It is shown that the data can then be grounded using existing domain thesauri, by building an upper ontology structure from these. A schema for cultural heritage data is proposed, intended to be generic for that domain and as compact as possible.

Another hypothesis is that use of a graph will assist retrieval. The structure is uniform and very simple, and the graph can be queried even if the predicates (or edge labels) are unknown. Additional benefits of the graph structure are examined, such as using path length between nodes as a measure of relatedness (unavailable in a relational database where there is no equivalent concept of locality), and building information summaries by grouping the attributes of nodes that share predicates.

These claims are tested by comparing queries across the original and the new data structures. The graph must be able to answer correctly queries that the original database dealt with, and should also demonstrate valid answers to queries that could not previously be answered or where the results were incomplete.

This will take some time to read but it looks quite enjoyable.

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings

Monday, January 16th, 2012

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings (PDF file)

There you will find:

Session 1:

  • High Performance Clustering for Web Person Name Disambiguation Using Topic Capturing by Zhengzhong Liu, Qin Lu, and Jian Xu (The Hong Kong Polytechnic University)
  • Extracting Dish Names from Chinese Blog Reviews Using Suffix Arrays and a Multi-Modal CRF Model by Richard Tzong-Han Tsai (Yuan Ze University, Taiwan)
  • LADS: Rapid Development of a Learning-To-Rank Based Related Entity Finding System using Open Advancement by Bo Lin, Kevin Dela Rosa, Rushin Shah, and Nitin Agarwal (Carnegie Mellon University)
  • Finding Support Documents with a Logistic Regression Approach by Qi Li and Daqing He (University of Pittsburgh)
  • The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data by Stephane Campinas (National University of Ireland), Diego Ceccarelli (University of Pisa), Thomas E. Perry (National University of Ireland), Renaud Delbru (National University of Ireland), Krisztian Balog (Norwegian University of Science and Technology) and Giovanni Tummarello (National University of Ireland)

Session 2

  • Cross-Domain Bootstrapping for Named Entity Recognition by Ang Sun and Ralph Grishman (New York University)
  • Semi-supervised Statistical Inference for Business Entities Extraction and Business Relations Discovery by Raymond Y.K. Lau and Wenping Zhang (City University of Hong Kong)
  • Unsupervised Related Entity Finding by Olga Vechtomova (University of Waterloo)

Session 3

  • Learning to Rank Homepages For Researcher-Name Queries by Sujatha Das, Prasenjit Mitra, and C. Lee Giles (The Pennsylvania State University)
  • An Evaluation Framework for Aggregated Temporal Information Extraction by Enrique Amigó, (UNED University), Javier Artiles (City University of New York), Heng Hi (City University of New York) and Qi Li (City University of New York)
  • Entity Search Evaluation over Structured Web Data by Roi Blanco (Yahoo! Research), Harry Halpin (University of Edinburgh), Daniel M. Herzig (Karlsruhe Institute of Technology), Peter Mika (Yahoo! Research), Jeffrey Pound (University of Waterloo), Henry S. Thompson (University of Edinburgh) and Thanh Tran Duc (Karlsruhe Institute of Technology)

A good start on what promises to be a strong conference series on entity-oriented search.

IBM Redbooks Reveals Content Analytics

Saturday, December 17th, 2011

IBM Redbooks Reveals Content Analytics

From Beyond Search:

IBM Redbooks has put out some juicy reading for the azure chip consultants wanting to get smart quickly with IBM Content Analytics Version 2.2: Discovering Actionable Insight from Your Content. The sixteen chapters of this book take the reader from an overview of IBM content analytics, through understanding the details, to troubleshooting tips. The above link provides an abstract of the book, as well as links to download it as a PDF, view in HTML/Java, or order a hardcopy.

Abstract:

With IBM® Content Analytics Version 2.2, you can unlock the value of unstructured content and gain new business insight. IBM Content Analytics Version 2.2 provides a robust interface for exploratory analytics of unstructured content. It empowers a new class of analytical applications that use this content. Through content analysis, IBM Content Analytics provides enterprises with tools to better identify new revenue opportunities, improve customer satisfaction, and provide early problem detection.

To help you achieve the most from your unstructured content, this IBM Redbooks® publication provides in-depth information about Content Analytics. This book examines the power and capabilities of Content Analytics, explores how it works, and explains how to design, prepare, install, configure, and use it to discover actionable business insights.

This book explains how to use the automatic text classification capability, from the IBM Classification Module, with Content Analytics. It explains how to use the LanguageWare® Resource Workbench to create custom annotators. It also explains how to work with the IBM Content Assessment offering to timely decommission obsolete and unnecessary content while preserving and using content that has business value.

The target audience of this book is decision makers, business users, and IT architects and specialists who want to understand and use their enterprise content to improve and enhance their business operations. It is also intended as a technical guide for use with the online information center to configure and perform content analysis with Content Analytics.

The cover article points out the Redbooks have an IBM slant, which isn’t surprising. When you need big iron for an enterprise project, that IBM is one of a handful of possible players isn’t surprising either.

Template-Based Information Extraction without the Templates

Monday, November 28th, 2011

Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky.

Abstract:

Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

Can you say association?

Definitely points towards a pipeline approach to topic map authoring. To abuse the term, perhaps a “dashboard” that allows selection of data sources followed by the construction of workflows with preliminary analysis being displayed at “breakpoints” in the processing. No particular reason why stages have to be wired together other than tradition.

Just looking a little bit into the future, imagine that some entities weren’t being recognized at a high enough rate. So you shift that part of the data to several thousand human entity processors and take the average of their results, higher than what you were getting and feed that back into the system. Could have knowledge workers who work full time but shift from job to job performing tasks too difficult to program effectively.

Entities, Relationships, and Semantics: the State of Structured Search

Saturday, November 12th, 2011

Entities, Relationships, and Semantics: the State of Structured Search

Jeff Dalton’s notes on a panel discussion moderated by Daniel Tunkelang. The panel consisted of Andrew Hogue (Google NY), Breck Baldwin (alias-i), Evan Sandhause (NY Times), and Wlodek Zadrozny (IBM. Watson).

Read the notes, watch the discussion.

BTW, Sandhause (New York Times) points out that librarians have been working with structured data for a very long time.

So, libraries want to be more like web search engines and the folks building search engines want to be more like libraries.

Sounds to me like both communities need to spend more time reading each others blogs, cross-attending conferences, etc.

Introducing fise, the Open Source RESTful Semantic Engine

Saturday, October 22nd, 2011

Introducing fise, the Open Source RESTful Semantic Engine

From the post:

fise is now known as the Stanbol Enhancer component of the Apache Stanbol incubating project.

As a member of the IKS european project Nuxeo contributes to the development of an Open Source software project named fise whose goal is to help bring new and trendy semantic features to CMS by giving developers a stack of reusable HTTP semantic services to build upon.

Presenting the software in Q/A form:

What is a Semantic Engine?

A semantic engine is a software component that extracts the meaning of a electronic document to organize it as partially structured knowledge and not just as a piece of unstructured text content.

Current semantic engines can typically:

  • categorize documents (is this document written in English, Spanish, Chinese? is this an article that should be filed under the  Business, Lifestyle, Technology categories? …);
  • suggest meaningful tags from a controlled taxonomy and assert there relative importance with respect to the text content of the document;
  • find related documents in the local database or on the web;
  • extract and recognize mentions of known entities such as famous people, organizations, places, books, movies, genes, … and link the document to there knowledge base entries (like a biography for a famous person);
  • detect yet unknown entities of the same afore mentioned types to enrich the knowledge base;
  • extract knowledge assertions that are present in the text to fill up a knowledge base along with a reference to trace the origin of the assertion. Examples of such assertions could be the fact that a company is buying another along with the amount of the transaction, the release date of a movie, the new club of a football player…

During the last couple of years, many such engines have been made available through web-based API such as Open Calais, Zemanta and Evri just to name a few. However to our knowledge there aren't many such engines distributed under an Open Source license to be used offline, on your private IT infrastructure with your sensitive data.

Impressive work that I found through a later post on using this software on Wikipedia. See Mining Wikipedia with Hadoop and Pig for Natural Language Processing.

Traditional Entity Extraction’s Six Weaknesses

Wednesday, September 28th, 2011

Traditional Entity Extraction’s Six Weaknesses

From the post:

Most university programming courses ignore entity extraction. Some professors talk about the challenges of identifying people, places, things, events, Social Security Numbers and leave the rest to the students. Other professors may have an assignment related to parsing text and detecting anomalies or bound phrases. But most of those emerging with a degree in computer science consign the challenge of entity extraction to the Miscellaneous file.

Entity extraction means processing text to identify, tag, and properly account for those elements that are the names of person, numbers, organizations, locations, and expressions such as a telephone number, among other items. An entity can consist of a single word like Cher or a bound sequence of words like White House. The challenge of figuring out names is tough one for several reasons. Many names exist in richly varied forms. You can find interesting naming conventions in street addresses in Madrid, Spain, and for the owner of a falafel shop in Tripoli.

Entities, as information retrieval experts have learned since the first DARPA conference on the subject in 1987, are quite important to certain types of content analysis. Digital Reasoning has been working for more than 11 years on entity extraction and related content processing problems. Entity oriented analytics have become a very important issue these days as companies deal with too much data, the need to understand the meaning and not the just the statistics of the data and finally to understand entities in context – critical to understanding code terms, etc.

I want to highlight the six weaknesses of traditional entity extraction and highlight Digital Reasoning’s patented, fully automated method. Let’s look at the weaknesses.

www.digitalreasoning.com

For my library class: No, I am not endorsing this product and yes it is a promotional piece. You are going to encounter those as librarians for your entire careers. And you are going to need to be able to ask questions that focus on the information needs of your library and its patrons. Not what the software is said to do well.

Read the full piece and visit the product’s website. What would you ask? Why? What more information do you think you would need?

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Monday, June 20th, 2011

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction by Guoliang Li, Dong Deng, and Jianhua Feng in SIGMOD ’11.

Abstract:

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

Entity extraction should be high on your list of topic map skills. A good set of user defined dictionaries is a good start. Creating those dictionaries as topic maps is an even better start.

On string metrics, you might want to visit Sam’s String Metrics, which lists some thirty algorithms.

Workshop on Entity-Oriented Search (EOS) – Beijing

Wednesday, June 1st, 2011

The First International Workshop on Entity-Oriented Search (EOS)

Important Dates

Submissions due: June 10, 2011
Notification of acceptance: June 25, 2011
Camera-ready submission: July 1, 2011 (provisional, awaiting confirmation)
Workshop date: July 28, 2011

From the website:

Workshop Theme

Many user information needs concern entities: people, organizations, locations, products, etc. These are better answered by returning specific objects instead of just any type of documents. Both commercial systems and the research community are displaying an increased interest in returning “objects”, “entities”, or their properties in response to a user’s query. While major search engines are capable of recognizing specific types of objects (e.g., locations, events, celebrities), true entity search still has a long way to go.

Entity retrieval is challenging as “objects” unlike documents, are not directly represented and need to be identified and recognized in the mixed space of structured and unstructured Web data. While standard document retrieval methods applied to textual representations of entities do seem to provide reasonable performance, a big open question remains how much influence the entity type should have on the ranking algorithms developed.

Avoiding repeated document searching by successive users will require identification as suggested here. Sub-document addressing and retrieval of portions of documents is another aspect to the entity issue.

Disease Named Entity Recognition

Tuesday, March 22nd, 2011

Disease named entity recognition using semisupervised learning and conditional random fields.

Nichalin, S., Zhu, Z., & Hsinchun, C. (2011). Disease named entity recognition using semisupervised learning and conditional random fields. Journal of the American Society for Information Science & Technology, 62(4), 727-737.

Abstract:

Information extraction is an important text-mining task that aims at extracting prespecified types of information from large text collections and making them available in structured representations such as databases. In the biomedical domain, information extraction can be applied to help biologists make the most use of their digital-literature archives. Currently, there are large amounts of biomedical literature that contain rich information about biomedical substances. Extracting such knowledge requires a good named entity recognition technique. In this article, we combine conditional random fields (CRFs), a state-of-the-art sequence-labeling algorithm, with two semisupervised learning techniques, bootstrapping and feature sampling, to recognize disease names from biomedical literature. Two data-processing strategies for each technique also were analyzed: one sequentially processing unlabeled data partitions and another one processing unlabeled data partitions in a round-robin fashion. The experimental results showed the advantage of semisupervised learning techniques given limited labeled training data. Specifically, CRFs with bootstrapping implemented in sequential fashion outperformed strictly supervised CRFs for disease name recognition.

Not to take anything away from this sort of technique, which would stand in good stead for topic map construction, but I am left feeling like it stops short of the mark.

In other words, say that I am happy with the result of its recognition, how do I share that with someone else, who has another set of identified subjects, perhaps from the same data?

Or for that matter, how do I combine it with data that I myself have extracted from the same data?

Can’t very well ask the software why it “recognized” one name or another can I?

Thinking I would have to add what seemed to me to be useful information to the name, in order to re-use it with other data.

Starting to sound like a topic map isn’t it?