The task of identifying proper names of people, organizations, locations, or other entities is a subtask of information extraction from natural language documents. This paper presents a survey of techniques and methodologies that are currently being explored to solve this difficult subtask. After a brief review of the challenges of the task, as well as a look at previous conventional approaches, the focus will shift to a comparison of stochastic and gazetteer based approaches. Several machine-learning approaches are identified and explored, as well as a discussion of knowledge acquisition relevant to recognition. This two-part white paper will show that applications that require named entity recognition will be served best by some combination of knowledge- based and non-deterministic approaches.
In school we were taught that a proper noun was “a specific person, place, or thing,” thus extending our definition from a concrete noun. Unfortunately, this seemingly simple mnemonic masks an extremely complex computational linguistic task—the extraction of named entities, e.g. persons, organizations, or locations from corpora (1). More formally, the task of Named Entity Recognition and Classification can be described as the identification of named entities in computer readable text via annotation with categorization tags for information extraction.
Not only is named entity recognition a subtask of information extraction, but it also plays a vital role in reference resolution, other types of disambiguation, and meaning representation in other natural language processing applications. Semantic parsers, part of speech taggers, and thematic meaning representations could all be extended with this type of tagging to provide better results. Other, NER-specific, applications abound including question and answer systems, automatic forwarding, textual entailment, and document and news searching. Even at a surface level, an understanding of the named entities involved in a document provides much richer analytical frameworks and cross-referencing.
Named entities have three top-level categorizations according to DARPA’s Message Understanding Conference: entity names, temporal expressions, and number expressions (2). Because the entity names category describes the unique identifiers of people, locations, geopolitical bodies, events, and organizations, these are usually referred to as named entities and as such, much of the literature discussed in this paper focuses solely on this categorization, although it is easy to imagine extending the proposed systems to cover the full MUC-7 task. Further, the CoNLL-2003 Shared Task, upon which the standard of evaluation for such systems is based, only evaluates the categorization of organizations, persons, locations, and miscellaneous named entities. For example:
(ORG S.E.C.) chief (PER Mary Shapiro) to leave (LOC Washington) in December.
This sentence contains three named entities that demonstrate many of the complications associated with named entity recognition. First, S.E.C. is an acronym for the Securities and Exchange Commission, which is an organization. The two words “Mary Shapiro” indicate a single person, and Washington, in this case, is a location and not a name. Note also that the token “chief” is not included in the person tag, although it very well could be. In this scenario, it is ambiguous if “S.E.C. chief Mary Shapiro” is a single named entity, or if multiple, nested tags would be required.
Nice introduction to the area and ends with a great set of references.