TF-IDF Weight Vectors With Lucene And Mahout

How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout

From the website:

You have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.

You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.

Lucene and Mahout can help you to do that almost in a snap.

Why is this important for topic maps?

Wikipedia reports:

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. (, cited in this posting)

Knowing the important terms in a document collection is one step towards a useful topic map. May not be definitive but it is a step in the right direction.

Comments are closed.