Archive for the ‘Vectors’ Category

Automating Family/Party Feud

Monday, February 15th, 2016

Semantic Analysis of the Reddit Hivemind

From the webpage:

Our neural network read every comment posted to Reddit in 2015, and built a semantic map using word2vec and spaCy.

Try searching for a phrase that’s more than the sum of its parts to see what the model thinks it means. Try your favourite band, slang words, technical things, or something totally random.

Lynn Cherny suggested in a tweet to use “actually.”

If you are interested in the background on this tool, see: Sense2vec with spaCy and Gensim by Matthew Honnibal.

From the post:

If you were doing text analytics in 2015, you were probably using word2vec. Sense2vec (Trask et al., 2015) is a new twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. This post motivates the idea, explains our implementation, and comes with an interactive demo that we’ve found surprisingly addictive.

Polysemy: the problem with word2vec

When humans write dictionaries and thesauruses, we define concepts in relation to other concepts. For automatic natural language processing, it’s often more effective to use dictionaries that define concepts in terms of their usage statistics. The word2vec family of models are the most popular way of creating these dictionaries. Given a large sample of text, word2vec gives you a dictionary where each definition is just a row of, say, 300 floating-point numbers. To find out whether two entries in the dictionary are similar, you ask how similar their definitions are – a well-defined mathematical operation.

Certain to be a hit at technical conferences and parties.

SGML wasn’t mentioned even once during 2015 in Reddit Comments.

Try some your favorites words and phrases.


Infinite Dimensional Word Embeddings [Variable Representation, Death to Triples]

Thursday, November 19th, 2015

Infinite Dimensional Word Embeddings by Eric Nalisnick and Sachin Ravi.


We describe a method for learning word embeddings with stochastic dimensionality. Our Infinite Skip-Gram (iSG) model specifies an energy-based joint distribution over a word vector, a context vector, and their dimensionality, which can be defined over a countably infinite domain by employing the same techniques used to make the Infinite Restricted Boltzmann Machine (Cote & Larochelle, 2015) tractable. We find that the distribution over embedding dimensionality for a given word is highly interpretable and leads to an elegant probabilistic mechanism for word sense induction. We show qualitatively and quantitatively that the iSG produces parameter-efficient representations that are robust to language’s inherent ambiguity.

Even better from the introduction:

To better capture the semantic variability of words, we propose a novel embedding method that produces vectors with stochastic dimensionality. By employing the same mathematical tools that allow the definition of an Infinite Restricted Boltzmann Machine (Côté & Larochelle, 2015), we describe ´a log-bilinear energy-based model–called the Infinite Skip-Gram (iSG) model–that defines a joint distribution over a word vector, a context vector, and their dimensionality, which has a countably infinite domain. During training, the iSGM allows word representations to grow naturally based on how well they can predict their context. This behavior enables the vectors of specific words to use few dimensions and the vectors of vague words to elongate as needed. Manual and experimental analysis reveals this dynamic representation elegantly captures specificity, polysemy, and homonymy without explicit definition of such concepts within the model. As far as we are aware, this is the first word embedding method that allows representation dimensionality to be variable and exhibit data-dependent growth.

Imagine a topic map model that “allow[ed] representation dimensionality to be variable and exhibit data-dependent growth.

Simple subjects, say the sort you find at, can have simple representations.

More complex subjects, say the notion of “person” in U.S. statutory law (no, I won’t attempt to list them here), can extend its dimensional representation as far as is necessary.

Of course in this case, the dimensions are learned from a corpus but I don’t see any barrier to the intentional creation of dimensions for subjects and/or a combined automatic/directed creation of dimensions.

Or as I put it in the title, Death to All Triples.

More precisely, not just triples but any pre-determined limit on representation.

Looking forward to taking a slow read on this article and those it cites. Very promising.

Use Google’s Word2Vec for movie reviews

Saturday, January 10th, 2015

Use Google’s Word2Vec for movie reviews Kaggle Tutorial.

From the webpage:

In this tutorial competition, we dig a little “deeper” into sentiment analysis. Google’s Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis.

Sentiment analysis is a challenging subject in machine learning. People express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers. There’s another Kaggle competition for movie review sentiment analysis. In this tutorial we explore how Word2Vec can be applied to a similar problem.

Mark Needham mentions this Kaggle tutorial in Thoughts on Software Development Python NLTK/Neo4j:….

The description also mentions:

Since deep learning is a rapidly evolving field, large amounts of the work has not yet been published, or exists only as academic papers. Part 3 of the tutorial is more exploratory than prescriptive — we experiment with several ways of using Word2Vec rather than giving you a recipe for using the output.

To achieve these goals, we rely on an IMDB sentiment analysis data set, which has 100,000 multi-paragraph movie reviews, both positive and negative.

Movie, book, TV, etc., reviews are fairly common.

Where would you look for a sentiment analysis data set on contemporary U.S. criminal proceedings?

New Directions in Vector Space Models of Meaning

Tuesday, September 16th, 2014

New Directions in Vector Space Models of Meaning by Edward Grefenstette, Karl Moritz Hermann, Georgiana Dinu, and Phil Blunsom. (video)

From the description:

This is the video footage, aligned with slides, of the ACL 2014 Tutorial on New Directions in Vector Space Models of Meaning, by Edward Grefenstette (Oxford), Karl Moritz Hermann (Oxford), Georgiana Dinu (Trento) and Phil Blunsom (Oxford).

This tutorial was presented at ACL 2014 in Baltimore by Ed, Karl and Phil.

The slides can be found at

Running time is 2:45:12 so you had better get a cup of coffee before you start.

Includes a review of distributional models of semantics.

The sound isn’t bad but the acoustics are so you will have to listen closely. Having the slides in front of you helps as well.

The semantics part starts to echo topic map theory with the realization that having a single token isn’t going to help you with semantics. Tokens don’t stand alone but in a context of other tokens. Each of which has some contribution to make to the meaning of a token in question.

Topic maps function in a similar way with the realization that identifying any subject of necessity involves other subjects, which have their own identifications. For some purposes, we may assume some subjects are sufficiently identified without specifying the subjects that in our view identify it, but that is merely a design choice that others may choose to make differently.

Working through this tutorial and the cited references (one advantage to the online version) will leave you with a background in vector space models and the contours of the latest research.

I first saw this in a tweet by Kevin Safford. [Topic Vectors?]

Tuesday, December 10th, 2013 A Search Engine That Lets You ‘Add’ Words as Vectors by Christopher Moody.

From the post:

Natural language isn’t that great for searching. When you type a search query into Google, you miss out on a wide spectrum of human concepts and human emotions. Queried words have to be present in the web page, and then those pages are ranked according to the number of inbound and outbound links. That’s great for filtering out the cruft on the internet — and there’s a lot of that out there. What it doesn’t do is understand the relationships between words and understand the similarities or dissimilarities.

That’s where comes in — a search site I built to experiment with the word2vec algorithm recently released by Google. word2vec allows you to add and subtract concepts as if they were vectors, and get out sensible, and interesting results. I applied it to the Wikipedia corpus, and in doing so, tried creating an interactive search site that would allow users to put word2vec through it’s paces.

For example, word2vec allows you to evaluate a query like King – Man + Woman and get the result Queen. This means you can do some totally new searches.

… (examples omitted)

word2vec is a type of distributed word representation algorithm that trains a neural network in order to assign a vector to every word. Each of the dimensions in the vector tries to encapsulate some property of the word. Crudely speaking, one dimension could encode that man, woman, king and queen are all ‘people,’ whereas other dimensions could encode associations or dissociations with ‘royalty’ and ‘gender’. These traits are learned by trying to predict the context in a sentence and learning from correct and incorrect guesses.



Doing it with word2vec requires large training sets of data. No doubt a useful venture if you are seeking to discover or document the word vectors in a domain.

But what if you wanted to declare vectors for words?

And then run word2vec (or something very similar) across the declared vectors.

Thinking along the lines of a topic map construct that has a “word” property with a non-null value. All the properties that follow are key/value pairs representing the positive and negative dimensions that are dimensions that give that word meaning.

Associations are collections of vector sums that identify subjects that take part in an association.

If we do all addressing by vector sums, we lose the need to track and collect system identifiers.

I think this could have legs.


PS: For efficiency reasons, I suspect we should allow storage of computed vector sum(s) on a construct. But that would not prohibit another analysis reaching a different vector sum for different purposes.

Computing Document Similarity using Lucene Term Vectors

Wednesday, October 26th, 2011

Computing Document Similarity using Lucene Term Vectors

From the post:

Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn’t used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

Nice walk through on document vectors.

Plus a reminder that “document” similarity can only take you so far. Once you find a relevant document, you still have to search for the subject of interest. Not to mention that you view that subject absent its relationship to other subjects, etc.

We Are Not Alone!

Thursday, October 20th, 2011

While following some references I ran across: A proposal for transformation of topic-maps into similarities of topics (pdf) by Dr. Dominik Kuropka.


Newer information filtering and retrieval models like the Fuzzy Set Model or the Topic-based Vector Space Model consider term dependencies by means of numerical similarities between two terms. This leads to the question from what and how these numerical values can be deduced? This paper proposes an algorithm for the transformation of topic-maps into numerical similarities of paired topics. Further the relation of this work towards the above named information filtering and retrieval models is discussed.

Based in part on his paper Topic-Based Vector Space (2003).

This usage differs from ours in part because the work is designed to work at the document level in a traditional IR type context. “Topic maps,” in the ISO sense, are not limited to retrieval of documents or comparison by a particular method, however useful that method may be.

Still, it is good to get to know one’s neighbours so I will be sending him a note about our efforts.

Counting Triangles

Saturday, October 8th, 2011

Counting Triangles

From the post:

Recently I’ve heard from or read about people who use Hadoop because their analytic jobs can’t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles “because a database doesn’t perform the necessary joins efficiently.”

Perhaps I’m being dense but I don’t understand why a database doesn’t efficiently support these use-cases. In fact, I have a hard time believing they wouldn’t perform better in a columnar, MPP database like Vertica – where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar jobs are run or data is updated and the same job is re-run multiple times. Of course, performance isn’t everything; ease-of-use and maintainability are important factors that Vertica excels at as well.

Since the “gauntlet was thrown down”, to steal a line from Good Will Hunting, I decided to take up the challenge of computing the number of triangles in a graph (and include the solutions in GitHub so others can experiment – more on this at the end of the post).

I don’t think you will surprised at the outcome but the exercise is instructive in a number of ways. Primarily don’t assume performance without testing. If all the bellowing leads to more competition and close work on software and algorithms, I think there will be some profit from it.

Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization

Tuesday, September 13th, 2011

Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization by Mircea Namolaru, Albert Cohen, Grigori Fursin, Ayal Zaks, and Ari Freund.


Iterative search combined with machine learning is a promising approach to design optimizing compilers harnessing the complexity of modern computing systems. While traversing a program optimization space, we collect characteristic feature vectors of the program, and use them to discover correlations across programs, target architectures, data sets, and performance. Predictive models can be derived from such correlations, effectively hiding the time-consuming feedback-directed optimization process from the application programmer.

One key task of this approach, naturally assigned to compiler experts, is to design relevant features and implement scalable feature extractors, including statistical models that filter the most relevant information from millions of lines of code. This new task turns out to be a very challenging and tedious one from a compiler construction perspective. So far, only a limited set of ad-hoc, largely syntactical features have been devised. Yet machine learning is only able to discover correlations from information it is fed with: it is critical to select topical program features for a given optimization problem in order for this approach to succeed.

We propose a general method for systematically generating numerical features from a program. This method puts no restrictions on how to logically and algebraically aggregate semantical properties into numerical features. We illustrate our method on the difficult problem of selecting the best possible combination of 88 available optimizations in GCC. We achieve 74% of the potential speedup obtained through iterative compilation on a wide range of benchmarks and four different general-purpose and embedded architectures. Our work is particularly relevant to embedded system designers willing to quickly adapt the optimization heuristics of a mainstream compiler to their custom ISA, microarchitecture, benchmark suite and workload. Our method has been integrated with the publicly released MILEPOST GCC [14].

Read the portions on extracting features, inference of new relations, extracting relations from programs, extracting features from relations and tell me this isn’t a description of pre-topic map processing! 😉

Shogun – Google Summer of Code 2011

Sunday, April 3rd, 2011

Shogun – Google Summer of Code 2011

Students! Here is your change to work on a cutting edge software library for machine learning!

Posted ideas, or submit your own.

From the website:

SHOGUN is a machine learning toolbox, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines for classification and regression, hidden Markov models, multiple kernel learning, linear discriminant analysis, linear programming machines, and perceptrons. Most of the specific algorithms are able to deal with several different data classes, including dense and sparse vectors and sequences using floating point or discrete data types. We have used this toolbox in several applications from computational biology, some of them coming with no less than 10 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond.

SHOGUN is implemented in C++ and interfaces to MATLAB, R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at

This summer we are looking to extend the library in four different ways: Improving interfaces to other machine learning libraries or integrating them when appropriate, improved i/o support, framework improvements and new machine algorithms. Here is listed a set of suggestions for projects.

A prior post on Shogun.

Learning to classify text using support vector machines

Friday, March 4th, 2011

I saw a tweet recently that pointed to: Learning to classify text using support vector machines, which is Thorsten Joachims’ dissertation, The Maximum-Margin Approach to Learning Text Classifiers as published by Kluwer (not Springer).

Of possible greater interest would be Joachims more recent work found at his homepage, which includes software from his dissertation as well as more recent projects.

I am sure his dissertation will repay close study but at > $150 U.S., I am going to have to wait for an library ILL to find its way to me.

Introducing Vector Maps

Monday, February 21st, 2011

Introducing Vector Maps

From the post:

Modern distributed data stores such as CouchDB and Riak, use variants of Multi-Version Concurrency Control to detect conflicting database updates and present these as multi-valued responses.

So, if I and my buddy Ola both update the same data record concurrently, the result may be that the data record now has multiple values – both mine and Ola’s – and it will be up to the eventual consumer of the data record to resolve the problem. The exact schemes used to manage the MVCC differs from system to system, but the effect is the same; the client is left with the turd to sort out.

This let me to an idea, of trying to create a data structure which is by it’s very definition itself able to be merged, and then store such data in these kinds of databases. So, if you are handed two versions, there is a reconciliation function that will take those two records and “merge” them into one sound record, by some definition of “sound”.

Seems to me that reconciliation should not be limited to records differing based on time stamps. 😉

Will have to think about this one for while but it looks deeply similar to issues we are confronting in topic maps.

BTW, saw this first at Alex Popescu’s myNoSQL blog.

Spectral Based Information Retrieval

Saturday, December 25th, 2010

Spectral Based Information Retrieval Author: Laurence A. F. Park (2003)

Every now and again I run into a dissertation that is an interesting and useful survey of a field and an original contribution to the literature.

Not often but it does happen.

It happened in this case with Park’s dissertation.

The beginning of an interesting threat of research that treats terms in a document as a spectrum and then applies spectral transformations to the retrieval problem.

The technique has been developed and extended since the appearance of Park’s work.

Highly recommended, particularly if you are interested in tracing the development of this technique in information retrieval.

My interest is in the use of spectral representations of text in information retrieval as part of topic map authoring and its potential as a subject identity criteria.

Actually I should broaden that to include retrieval of images and other data as well.


  1. Prepare an annotated bibliography of ten (10) recent papers usually spectral analysis for information retrieval.
  2. Spectral analysis helps retrieve documents but what if you are searching for ideas? Does spectral analysis offer any help?
  3. How would you extend the current state of spectral based information retrieval? (5-10 pages, project proposal, citations)


Saturday, December 11th, 2010


From the website:

Matrix algebra underpins the way many Big Data algorithms and data structures are composed: full-text search can be viewed as doing matrix multiplication of the term-document matrix by the query vector (giving a vector over documents where the components are the relevance score), computing co-occurrences in a collaborative filtering context (people who viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest) is taking the squaring the user-item interation matrix, calculating users who are k-degrees separated from each other in a social network or web-graph can be found by looking at the k-fold product of the graph adjacency matrix, and the list goes on (and these are all cases where the linear structure of the matrix is preserved!)
Currently implemented: Singular Value Decomposition using the Asymmetric Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb’s paper and there is a Lanczos implementation, both single-threaded, and in the contrib/hadoop subdirectory, as a hadoop map-reduce (series of) job(s). Coming soon: stochastic decomposition.

This code is in the process of being absorbed into the Apache Mahout Machine Learning Project.

Useful in learning to use search technology but also for recognizing at a very fundamental level, the limitations of that technology.

Document and query vectors are constructed without regard to the semantics of their components.

Using co-occurrence, for example, doesn’t give a search engine greater access to the semantics of the terms in question.

It simply makes the vector longer and so matches are less frequent and hopefully, less frequent = more precise.

That may or may not be the case. It also doesn’t account for case where the vectors are different but the subject in question is the same.