Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 21, 2011

Reusable TokenStreams

Filed under: Lucene,Text Analytics — Patrick Durusau @ 7:21 pm

Reusable TokenStreams by Chris Male.

Abstract:

This white paper covers how Lucene’s text analysis system works today and explores the system and provides an understanding of what a TokenStream is, what the difference between Analyzers, TokenFilters and Tokenizers are, and how reuse impacts the design and implementation of each of these components.

Useful treatment of Lucene’s text analysis features. Those are still developing and more changes are promised (but left rather vague) for the future.

One feature that is covered of particular interest was the ability to associate geographic location data with terms deemed to represent locations.

Occurs to me that such a feature could also be used to annotate terms during text analysis to associate subject identifiers with those terms.

An application doesn’t have to “understand” that terms have different meanings so long as it can distinguish one from another based on annotations. (Or map them together despite different identifiers.)

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress