Archive for the ‘Indirect Inference’ Category

Semantic Vectors

Tuesday, August 16th, 2011

Semantic Vectors

From the webpage:

Semantic Vector indexes, created by applying a Random Projection algorithm to term-document matrices created using Apache Lucene. The package was created as part of a project by the University of Pittsburgh Office of Technology Management, and is now developed and maintained by contributors from the University of Texas, Queensland University of Technology, the Austrian Research Institute for Artificial Intelligence, Google Inc., and other institutions and individuals.

The package creates a WordSpace model, of the kind developed by Stanford University’s Infomap Project and other researchers during the 1990s and early 2000s. Such models are designed to represent words and documents in terms of underlying concepts, and as such can be used for many semantic (concept-aware) matching tasks such as automatic thesaurus generation, knowledge representation, and concept matching.

The Semantic Vectors package uses a Random Projection algorithm, a form of automatic semantic analysis. Other methods supported by the package include Latent Semantic Analysis (LSA) and Reflective Random Indexing. Unlike many other methods, Random Projection does not rely on the use of computationally intensive matrix decomposition algorithms like Singular Value Decomposition (SVD). This makes Random Projection a much more scalable technique in practice. Our application of Random Projection for Natural Language Processing (NLP) is descended from Pentti Kanerva’s work on Sparse Distributed Memory, which in semantic analysis and text mining, this method has also been called Random Indexing. A growing number of researchers have applied Random Projection to NLP tasks, demonstrating:

  • Semantic performance comparable with other forms of Latent Semantic Analysis.
  • Significant computational performance advantages in creating and maintaining models.

So, after reading about random indexing, etc., you can take those techniques out for a spin. It doesn’t get any better than that!

Distributional Semantics

Tuesday, August 16th, 2011

Distributional Semantics.

Trevor Cohen’s, co-author with Roger Schvaneveldt, and Dominic Widdows of Reflective Random Indexing and indirect inference…, page on distributional semantics which starts with:

Empirical Distributional Semantics is an emerging discipline that is primarily concerned with the derivation of semantics (or meaning) from the distribution of terms in natural language text. My research in DS is concerned primarily with spatial models of meaning, in which terms are projected into high-dimensional semantic space, and an estimate of their semantic relatedness is derived from the distance between them in this space.

The relations derived by these models have many useful applications in biomedicine and beyond. A particularly interesting property of distributional semantics models is their capacity to recognize connections between terms that do not occur together in the same document, as this has implications for knowledge discovery. In many instances it is possible also to reveal a plausible pathway linking these terms by using the distances estimated by distributional semantic models to generate a network representation, and using Pathfinder networks (PFNETS) to reveal the most significant links in this network, as shown in the example below:

Links to projects, software and other cool stuff! Making a separate post on one of his software libraries.

Introduction to Random Indexing

Tuesday, August 16th, 2011

Introduction to Random Indexing by Magnus Sahlgren.

I thought this would be useful alongside Reflective Random Indexing and indirect inference….

Just a small sample of what you will find:

Note that this methodology constitutes a radically different way of conceptualizing how context vectors are constructed. In the “traditional” view, we first construct the co-occurrence matrix and then extract context vectors. In the Random Indexing approach, on the other hand, we view the process backwards, and first accumulate the context vectors. We may then construct a cooccurrence matrix by collecting the context vectors as rows of the matrix.

I like non-traditional approaches. Some work (like random indexing) and some don’t.

What new/non-traditional approaches have you tried in the last week? We learn as much (if not more) from failure as success.

Reflective Random Indexing and indirect inference…

Tuesday, August 16th, 2011

Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections by Trevor Cohen, Roger Schvaneveldt, Dominic Widdows.


The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.

The term “direct inference” is used for establishing a relationship between terms with a shared “bridging” term. That is the terms don’t co-occur in a text but share a third term that occurs in both texts. “Indirect inference,” that is finding terms with no shared “bridging” term is the focus of this paper.

BTW, if you don’t have access to the Journal of Biomedical Informatics version, try the draft: Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections