Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 20, 2011

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Filed under: Algorithms,Entity Extraction,String Matching — Patrick Durusau @ 3:31 pm

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction by Guoliang Li, Dong Deng, and Jianhua Feng in SIGMOD ’11.

Abstract:

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

Entity extraction should be high on your list of topic map skills. A good set of user defined dictionaries is a good start. Creating those dictionaries as topic maps is an even better start.

On string metrics, you might want to visit Sam’s String Metrics, which lists some thirty algorithms.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress