ThisPlusThat.me: A Search Engine That Lets You ‘Add’ Words as Vectors by Christopher Moody.
From the post:
Natural language isn’t that great for searching. When you type a search query into Google, you miss out on a wide spectrum of human concepts and human emotions. Queried words have to be present in the web page, and then those pages are ranked according to the number of inbound and outbound links. That’s great for filtering out the cruft on the internet — and there’s a lot of that out there. What it doesn’t do is understand the relationships between words and understand the similarities or dissimilarities.
That’s where ThisPlusThat.me comes in — a search site I built to experiment with the word2vec algorithm recently released by Google. word2vec allows you to add and subtract concepts as if they were vectors, and get out sensible, and interesting results. I applied it to the Wikipedia corpus, and in doing so, tried creating an interactive search site that would allow users to put word2vec through it’s paces.
For example, word2vec allows you to evaluate a query like King – Man + Woman and get the result Queen. This means you can do some totally new searches.
… (examples omitted)
word2vec is a type of distributed word representation algorithm that trains a neural network in order to assign a vector to every word. Each of the dimensions in the vector tries to encapsulate some property of the word. Crudely speaking, one dimension could encode that man, woman, king and queen are all ‘people,’ whereas other dimensions could encode associations or dissociations with ‘royalty’ and ‘gender’. These traits are learned by trying to predict the context in a sentence and learning from correct and incorrect guesses.
Doing it with word2vec requires large training sets of data. No doubt a useful venture if you are seeking to discover or document the word vectors in a domain.
But what if you wanted to declare vectors for words?
And then run word2vec (or something very similar) across the declared vectors.
Thinking along the lines of a topic map construct that has a “word” property with a non-null value. All the properties that follow are key/value pairs representing the positive and negative dimensions that are dimensions that give that word meaning.
Associations are collections of vector sums that identify subjects that take part in an association.
If we do all addressing by vector sums, we lose the need to track and collect system identifiers.
I think this could have legs.
PS: For efficiency reasons, I suspect we should allow storage of computed vector sum(s) on a construct. But that would not prohibit another analysis reaching a different vector sum for different purposes.