Archive for the ‘Vector Space Model (VSM)’ Category

An Indexing Structure for Dynamic Multidimensional Data in Vector Space

Saturday, September 1st, 2012

An Indexing Structure for Dynamic Multidimensional Data in Vector Space by Elena Mikhaylova, Boris Novikov and Anton Volokhov. (Advances in Databases and Information Systems, Advances in Intelligent Systems and Computing, 2013, Volume 186, 185-193, DOI: 10.1007/978-3-642-32741-4_17)

Abstract:

The multidimensional k – NN (k nearest neighbors) query problem is relevant to a large variety of database applications, including information retrieval, natural language processing, and data mining. To solve it efficiently, the database needs an indexing structure that provides this kind of search. However, attempts to find an exact solution are hardly feasible in multidimensional space. In this paper, a novel indexing technique for the approximate solution of k – NN problem is described and analyzed. The construction of the indexing tree is based on clustering. Indexing structure is implemented on top of high-performance industrial DBMS.

The review of recent work is helpful but when the paper reaches the algorithm for indexing “…dynamic multidimensional data…,” it slips away from me.

Where is the dynamic nature of the data that is being overcome by the indexing?

I ask because we are human observers are untroubled by the curse of dimensionality, even when data is dynamically changing.

Although those are two important aspects when we process it by machine:

  • The number of dimensions of data, and
  • The rate at which the data is changing.

Documents as geometric objects: how to rank documents for full-text search

Wednesday, January 25th, 2012

Documents as geometric objects: how to rank documents for full-text search Michael Nielsen on July 7, 2011.

From the post:

When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine the importance of different webpages. But what should we do when the documents aren’t on the web, and there is no link structure? How should we determine which documents most closely match the intent of the query?

In this post I explain the basic ideas of how to rank different documents according to their relevance. The ideas used are very beautiful. They are based on the fearsome-sounding vector space model for documents. Although it sounds fearsome, the vector space model is actually very simple. The key idea is to transform search from a linguistic problem into a geometric problem. Instead of thinking of documents and queries as strings of letters, we adopt a point of view in which both documents and queries are represented as vectors in a vector space. In this point of view, the problem of determining how relevant a document is to a query is just a question of determining how parallel the query vector and the document vector are. The more parallel the vectors, the more relevant the document is.

This geometric way of treating documents turns out to be very powerful. It’s used by most modern web search engines, including (most likely) web search engines such as Google and Bing, as well as search libraries such as Lucene. The ideas can also be used well beyond search, for problems such as document classification, and for finding clusters of related documents. What makes this approach powerful is that it enables us to bring the tools of geometry to bear on the superficially very non-geometric problem of understanding text.

Very much looking forward to future posts in this series. There is no denying the power of “vector space model” but that leaves unasked what is lost in the transition from linguistic to geometric space?

We Are Not Alone!

Thursday, October 20th, 2011

While following some references I ran across: A proposal for transformation of topic-maps into similarities of topics (pdf) by Dr. Dominik Kuropka.

Abstract:

Newer information filtering and retrieval models like the Fuzzy Set Model or the Topic-based Vector Space Model consider term dependencies by means of numerical similarities between two terms. This leads to the question from what and how these numerical values can be deduced? This paper proposes an algorithm for the transformation of topic-maps into numerical similarities of paired topics. Further the relation of this work towards the above named information filtering and retrieval models is discussed.

Based in part on his paper Topic-Based Vector Space (2003).

This usage differs from ours in part because the work is designed to work at the document level in a traditional IR type context. “Topic maps,” in the ISO sense, are not limited to retrieval of documents or comparison by a particular method, however useful that method may be.

Still, it is good to get to know one’s neighbours so I will be sending him a note about our efforts.

Text Feature Extraction (tf-idf) – Part 1

Sunday, September 18th, 2011

Text Feature Extraction (tf-idf) – Part 1 by Christian Perone.

To give you a taste of the post:

Short introduction to Vector Space Model (VSM)

In information retrieval or text mining, the term frequency – inverse document frequency also called tf-idf, is a well know method to evaluate how important is a word in a document. tf-idf are also a very interesting way to convert the textual representation of information into a Vector Space Model (VSM), or into sparse features, we’ll discuss more about it later, but first, let’s try to understand what is tf-idf and the VSM.

VSM has a very confusing past, see for example the paper The most influential paper Gerard Salton Never Wrote that explains the history behind the ghost cited paper which in fact never existed; in sum, VSM is an algebraic model representing textual information as a vector, the components of this vector could represent the importance of a term (tf–idf) or even the absence or presence (Bag of Words) of it in a document; it is important to note that the classical VSM proposed by Salton incorporates local and global parameters/information (in a sense that it uses both the isolated term being analyzed as well the entire collection of documents). VSM, interpreted in a lato sensu, is a space where text is represented as a vector of numbers instead of its original string textual representation; the VSM represents the features extracted from the document.

The link to the The most influential paper Gerard Salton Never Wrote fails. Try the cached copy at CiteSeer: The most influential paper Gerard Salton Never Wrote.

Recommended reading.