Lucene’s TokenStreams are actually graphs!

Lucene’s TokenStreams are actually graphs!

Mike McCandless starts:

Lucene’s TokenStream class produces the sequence of tokens to be indexed for a document’s fields. The API is an iterator: you call incrementToken to advance to the next token, and then query specific attributes to obtain the details for that token. For example, CharTermAttribute holds the text of the token; OffsetAttribute has the character start and end offset into the original string corresponding to this token, for highlighting purposes. There are a number of standard token attributes, and some tokenizers add their own attributes.

He continues to illustrate the creation of graphs using SynonymFilter and discusses other aspects of graph creation from tokenstreams.

Including where the production of graphs needs to be added and issues in the current implementation.

If you see any of the Neo4j folks at the graph meetup in Chicago later today, you might want to mention Mike’s post to them.

Leave a Reply

You must be logged in to post a comment.