Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections by Trevor Cohen, Roger Schvaneveldt, Dominic Widdows.

Abstract:

The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.

The term “direct inference” is used for establishing a relationship between terms with a shared “bridging” term. That is the terms don’t co-occur in a text but share a third term that occurs in both texts. “Indirect inference,” that is finding terms with no shared “bridging” term is the focus of this paper.

BTW, if you don’t have access to the Journal of Biomedical Informatics version, try the draft: Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections

[…] Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity « Reflective Random Indexing and indirect inference… […]

[…] Reflective Random Indexing and indirect inference… cites Kanerva as follows: Random Indexing (RI) [cites omitted] has recently emerged as a scalable alternative to LSA for the derivation of spatial models of semantic distance from large text corpora. For a thorough introduction to Random Indexing and hyper-dimensional computing in general, see [Kanerva, this paper] [cite omitted]. […]

[…] Cohen’s, co-author with Roger Schvaneveldt, and Dominic Widdows of Reflective Random Indexing and indirect inference…, page on distributional semantics which starts with: Empirical Distributional Semantics is an […]

Some comments on the Reflective Indexing approach.

I am interested in feedback on my observations:

We start from D, the term-document matrix (terms are rows, documents are columns).

The Random Indexing approach of Shalgren computes Rd (random indexed D), as follows:

1. Create a highly sparse random term matrix Tr (rows are sparse random ternary vectors {-1|0|1} = term signatures)

2. Compute Rd = D’ * Tr (rows of Rd are random indexed documents)

The Reflective Indexing of Cohen et. al. is concerned with terms rather than documents, but we can look at it the other way around. The idea is to first randomly index terms (with sparse ternary random document signatures) and then use these term signatures to index the documents. The motivation appears to be that now term signatures are not just random near-orthogonal hash codes; terms that co occur in documents will have ‘similar’ or ‘related’ signatures.

When we index documents with these term signatures we will get similar encoding for documents which share an extensively overlapping vocabulary – and semantically related terms.

This proceeds as follows:

1. Create a highly sparse random document matrix Dr (rows are sparse document random ternary vectors {-1|0|1})

2. Compute Rt = D * Dr (rows of Rd are random indexed terms = term signatures)

3. Compute Rd = D’ * Rt = D’ * D * Rt

But this is actually just random indexed (D’*D) instead of random indexed D. A document is now represented vector of dot-products with all other documents. So instead of representing a document by its content, with this approach we represent a document by its overlap (dot-product) with all other documents.

This may make sense in clustering – two documents that have a similar relationship to all other documents are probably similar to each other. For clustering this should work.

The same applies if we prefer to work with semantic term spaces, we just work with terms signatures rather than documents as the final product.

While this makes sense for clustering, it is not clear why this scheme should work well for searching (indeed, it doesn’t work well in my experiments).

Repeating Reflective Indexing, a la Cohen et. al. can be looked at as follows:

1. We start the second iteration with the document signatures, to random index terms, and then we random index the documents again

Rd(2) = D’ * D * Rd(1) = D’ * D * D’ * D * Rt

Rd(2) = (D’ * D)^2 * Rt

2. Doing this N times:

Rd(N) = (D’ * D)^N * Rt

There is no more information in this representation.

We just apply the symmetric Matrix (D’ * D) to itself N times and randomly project that matrix.

This leads to convergence pretty quickly and soon enough we find that Rd(N+1) is proportional to Rd(N)

( – that’s why the numerical analysis power method works for finding eigenvalues of a matrix)

This suggests to me that we cannot get any more benefit from repeating the mapping.

Any comments on this analysis?

Shlomo