Archive for the ‘GraphChi’ Category

Co-EM algorithm in GraphChi

Wednesday, February 20th, 2013

Co-EM algorithm in GraphChi by Danny Bickson.

From the post:

Following the previous post about label propagation, as well as some request from US based startup to implement this method in GraphChi, I have decided to write a quick tutorial for Co-EM algorithm.

You cannot live by press reports alone. ;-)

Enjoy some time with GraphChi!

Label propagation in GraphChi

Monday, February 11th, 2013

Label propagation in GraphChi by Danny Bickson.

From the post:

A few days ago I got a request from Jidong, from the Chinese Renren company to implement label propagation in GraphChi. The algorithm is very simple described here: Zhu, Xiaojin, and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.

The basic idea is that we start with a group of users that we have some information about the categories they are interested in. Following the weights in the social network, we propagate the label probabilities from the user seed node (the ones we have label information about) into the general social network population. After several iterations, the algorithm converges and the output is labels for the unknown nodes.

I assume there is more unlabeled data for topic maps than labeled data.

Depending upon your requirements, this could prove to be a useful technique for completing those unlabeled nodes.

Setting up Java GraphChi development environment…

Sunday, February 10th, 2013

Setting up Java GraphChi development environment – and running sample ALS by Danny Bickson.

From the post:

As you may know, our GraphChi collaborative filtering toolkit in C is becoming more and more popular. Recently, Aapo Kyrola did a great effort for porting GraphChi C into Java and implementing more methods on top of it.

In this blog post I explain how to setup GraphChi Java development environment in Eclipse and run alternating least squares algorithm (ALS) on a small subset of Netflix data.

Based on the level of user feedback I am going to receive for this blog post, we will consider porting more methods to Java. So email me if you are interested in trying it out.

If you are interested in more machine learning methods in Java, here’s your chance!

Not to mention your interest in graph based solutions.

Case study: million songs dataset

Sunday, February 3rd, 2013

Case study: million songs dataset by Danny Bickson.

From the post:

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli’s cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities. 

Just in case you need some data for practice with your GraphChi installation. ;-)

Seriously, nice way to gain familiarity with the data set.

What value you extract from it is up to you.

GraphChi version 0.2 released!

Thursday, January 24th, 2013

GraphChi version 0.2 released! by Danny Bickson.

From the post:

GraphChi version 0.2 is the first major update to the GraphChi software for disk-based computation on massive graphs. This upgrade brings two major changes: compressed data storage (shards) and support for dynamically sized edges.

We also thank for your interest on GraphChi so far: since the release in July 9th 2012, there has been over 8,000 unique visitors to the Google Code project page, at least 2,000 downloads of the source package, several blog posts and hundreds of tweets. GraphChi is a research project, and your feedback has helped us tremendously in our work.

Excellent news!

A link for the Dynamic Edge Data tutorial was omitted from the original post.

GraphChi visual toolkit – or understanding your data

Saturday, November 24th, 2012

GraphChi visual toolkit – or understanding your data by Danny Bickson.

Danny walks through using GraphChi to visual the Orange d4d data set. (cell phone usage).

Easy instructions and heavy on interesting graphics so you need to read the original post.

Very cool!

Item based similarity with GraphChi

Tuesday, September 11th, 2012

Item based similarity with GraphChi by Danny Bickson.

From the post:

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to. Following my mega collaborator Justin Yan‘s advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.

I need to check on Danny’s blog and the GraphChi/GraphLab webpages more often!

GraphChi parsers toolkit

Tuesday, September 11th, 2012

GraphChi parsers toolkit by Danny Bickson.

From the post:

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.

Danny starts us off with an LDA parser (with worked example of its use) and then adds a Twitter parser that creates a graph of retweets.

Enjoy!

GraphLab and GraphChi group on LinkedIn

Tuesday, August 21st, 2012

GraphLab and GraphChi group on LinkedIn by Igor Carron.

From the post:

Danny just started the GraphLab and GraphChi group on LinkedIn. If you want to be part of that disruptive discussion, we may want to join.

OK, I just hit “join.” What about you?

Collaborative filtering with GraphChi

Sunday, August 19th, 2012

Collaborative filtering with GraphChi by Danny Bickson.

From the post:

A couple of weeks ago I covered GraphChi by Aapo Kyrola in my blog.

Here is a quick tutorial for trying out GraphChi collaborative filtering. Currently it supports ALS (alternating least squares), SGD (stochastic gradient descent), bias-SGD (biased stochastic gradient descent) and SVD++, but I am soon going to implement several more algorithms.

If you are already experimenting with GraphChi, you will really like this post.

Web Scale with a Laptop? [GraphChi]

Thursday, July 19th, 2012

GraphChi promises in part:

The promise of GraphChi is to bring web-scale graph computation, such as analysis of social networks, available to anyone with a modern laptop.

Well, that certainly draws a line in the sand doesn’t it?

A bit more from the introduction:

GraphChi is a spin-off of the GraphLab ( http://www.graphlab.org ) -project from the Carnegie Mellon University. It is based on research by Aapo Kyrola ( http://www.cs.cmu.edu/~akyrola/) and his advisors.

GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google's Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph. Section 'Performance' contains some examples of applications implemented for GraphChi and their running times on GraphChi.

The promise of GraphChi is to bring web-scale graph computation, such as analysis of social networks, available to anyone with a modern laptop. It saves you from the hassle and costs of working with a distributed cluster or cloud services. We find it much easier to debug applications on a single computer than trying to understand how a distributed algorithm is executed.

In some cases GraphChi can solve bigger problems in reasonable time than many other available distributed frameworks. GraphChi also runs efficiently on servers with plenty of memory, and can use multiple disks in parallel by striping the data.

Even if you do require the processing power of high-performance clusters, GraphChi can be an excellent tool for developing and debugging your algorithms prior to deploying them to the cluster. For high-performance graph computation in the distributed setting, we direct you to GraphLab's new version (v2.1), which can now handle large graphs in astonishing speed. GraphChi supports also most of the new GraphLab v2.1 API (with some restrictions), making the transition easy.

GraphChi is implemented in plain C++, and available as open-source under the flexible Apache License 2.0.

Java version

Java-version of GraphChi: http://code.google.com/p/graphchi-java

The performance numbers are impressive.

Not sure I would want to run production code on a laptop in any case but performance should be enough for on-the-road experiments.

Good documentation and examples that should ease you into experimenting with GraphChi.

I first saw this at High Scalability.