A Survey of Stochastic and Gazetteer Based Approaches for Named Entity Recognition – Part 2 by Benjamin Bengfort.
From the post:
Generally speaking, the most effective named entity recognition systems can be categorized as rule-based, gazetteer and machine learning approaches. Within each of these approaches are a myriad of sub-approaches that combine to varying degrees each of these top-level categorizations. However, because of the research challenge posed by each approach, typically one or the other is focused on in the literature.
Rule-based systems utilize pattern-matching techniques in text as well as heuristics derived either from the morphology or the semantics of the input sequence. They are generally used as classifiers in machine-learning approaches, or as candidate taggers in gazetteers. Some applications can also make effective use of stand-alone rule-based systems, but they are prone to both overreach and skipping over named entities. Rule-based approaches are discussed in (10), (12), (13), and (14).
Gazetteer approaches make use of some external knowledge source to match chunks of the text via some dynamically constructed lexicon or gazette to the names and entities. Gazetteers also further provide a non-local model for resolving multiple names to the same entity. This approach requires either the hand crafting of name lexicons or some dynamic approach to obtaining a gazette from the corpus or another external source. However, gazette based approaches achieve better results for specific domains. Most of the research on this topic focuses on the expansion of the gazetteer to more dynamic lexicons, e.g. the use of Wikipedia or Twitter to construct the gazette. Gazette based approaches are discussed in (15), (16), and (17).
Stochastic approaches fare better across domains, and can perform predictive analysis on entities that are unknown in a gazette. These systems use statistical models and some form of feature identification to make predictions about named entities in text. They can further be supplemented with smoothing for universal coverage. Unfortunately these approaches require large amounts of annotated training data in order to be effective, and they don’t naturally provide a non-local model for entity resolution. Systems implemented with this approach are discussed in (7), (8), (4), (9), and (6).
Benjamin continues his excellent survey of named entity recognition techniques.
All of these techniques may prove to be useful in constructing topic maps from source materials.