Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 26, 2011

Computing Document Similarity using Lucene Term Vectors

Filed under: Lucene,Similarity,Vectors — Patrick Durusau @ 6:58 pm

Computing Document Similarity using Lucene Term Vectors

From the post:

Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn’t used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

Nice walk through on document vectors.

Plus a reminder that “document” similarity can only take you so far. Once you find a relevant document, you still have to search for the subject of interest. Not to mention that you view that subject absent its relationship to other subjects, etc.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress