Text Feature Extraction (tf-idf) – Part 1 by Christian Perone.
To give you a taste of the post:
Short introduction to Vector Space Model (VSM)
In information retrieval or text mining, the term frequency – inverse document frequency also called tf-idf, is a well know method to evaluate how important is a word in a document. tf-idf are also a very interesting way to convert the textual representation of information into a Vector Space Model (VSM), or into sparse features, we’ll discuss more about it later, but first, let’s try to understand what is tf-idf and the VSM.
VSM has a very confusing past, see for example the paper The most influential paper Gerard Salton Never Wrote that explains the history behind the ghost cited paper which in fact never existed; in sum, VSM is an algebraic model representing textual information as a vector, the components of this vector could represent the importance of a term (tf–idf) or even the absence or presence (Bag of Words) of it in a document; it is important to note that the classical VSM proposed by Salton incorporates local and global parameters/information (in a sense that it uses both the isolated term being analyzed as well the entire collection of documents). VSM, interpreted in a lato sensu, is a space where text is represented as a vector of numbers instead of its original string textual representation; the VSM represents the features extracted from the document.
The link to the The most influential paper Gerard Salton Never Wrote fails. Try the cached copy at CiteSeer: The most influential paper Gerard Salton Never Wrote.