TF-IDF using flambo by Muslim Baig.
From the post:
flambo is a Clojure DSL for Spark created by the data team at Yieldbot. It allows you to create and manipulate Spark data structures using idiomatic Clojure. The following tutorial demonstrates typical flambo API usage and facilities by implementing the classic tf-idf algorithm.
The complete runnable file of the code presented in this tutorial is located under the flambo.example.tfidf namespace, under the flambo /test/flambo/example directory. We recommend you download flambo and follow along in your REPL.
…
Working through the Clojure code you will get a better understanding of the TF-IDF algorithm.
I don’t know if it was intentional, but the division of the data into “documents” illustrates one of the fundamental questions for most indexing techniques:
What do you mean by document?
It is a non-trivial question and one that has a major impact on the results of the algorithm.
If I get to choose what is considered a “document,” then I can weight the results while using the same algorithm as everyone else.
Think about it. My “documents” may have the term “example” in each one, as opposed to “example” appearing three times in a single document. See the last section in the Wikipedia article tf-idf for the impact of such splitting.
Other algorithms are subject to similar manipulation. It isn’t ever enough to know the algorithms applied to data, you need to see the data itself.