Computing Document Similarity using Lucene Term Vectors
From the post:
Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn’t used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.
I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:
Nice walk through on document vectors.
Plus a reminder that “document” similarity can only take you so far. Once you find a relevant document, you still have to search for the subject of interest. Not to mention that you view that subject absent its relationship to other subjects, etc.