Archive for the ‘GraphChi’ Category

Large-Scale Graph Computation on Just a PC

Thursday, May 8th, 2014

Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense by Aapo Kyroa.

If you are looking for an overview of Kyroa’s work, this is the resource for you.

Slide 8: “Benefits of single machine systems Assuming it can handle your big problems…”, currently reads:

  1. Programmer productivity – Global state, debuggers…
  2. Inexpensive to install, administer, less power.
  3. Scalability – Use cluster of single-machine systems to solve many tasks in parallel. Idea: Trade latency for throughput < 32K bits/sec 8

I would add:

  1. A single machine forces creation of efficient data structures.

Think of it as using computation resources more effectively as opposed to scaling out to accommodate a problem.

GraphChi Users Survey

Saturday, April 19th, 2014

GraphChi Users Survey

From the form:

This survey is used to find out about experiences of users of GraphChi. These results will be used in Aapo Kyrola’s Ph.D. thesis.

If you are using GraphChi, your experiences can help with Aapo Kyrola’s Ph.D. thesis.

Pass this along to anyone you know using GraphChi (and try GraphChi yourself).

Tutorial Statistical Graph Analysis

Thursday, April 17th, 2014

Tutorial Statistical Graph Analysis by Aapo Kyrola.

From the post:

GraphChi-DB can be used for efficiently querying induced subgraphs from very large networks. Thus, you can for example, easily sample a vertex, and retrive induced neighborhood graph of the vertex. Or you can choose a random set of vertices and compute their induced subgraph.

Assuming you have the data loaded in database, and the GraphChiDatabase object in a value “DB”, here is how you request edges for the induced subgraph of a set of vertices:

GraphChi-DB is released two days ago, GraphChi-DB [src released] and today you have a tutorial written for it.

Not bad, not bad at all.

GraphChi-DB [src released]

Tuesday, April 15th, 2014

GraphChi-DB

From the webpage:

GraphChi-DB is a scalable, embedded, single-computer online graph database that can also execute similar large-scale graph computation as GraphChi. it has been developed by Aapo Kyrola as part of his Ph.D. thesis.

GraphChi-DB is written in Scala, with some Java code. Generally, you need to know Scala quite well to be able to use it.

IMPORTANT: GraphChi-DB is early release, research code. It is buggy, it has awful API, and it is provided with no guarantees. DO NOT USE IT FOR ANYTHING IMPORTANT.

GraphChi-DB source code arrives!

Enjoy!

Think globally and solve locally:… (GraphChi News)

Tuesday, March 18th, 2014

Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction by Marco Mesiti, Matteo Re, and Giorgio Valentini.

Abstract:

Background: Network-based learning algorithms for Automated Function Prediction (AFP) are negatively a ffected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their e ffectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satis able with simple off -the-shelf computers.

Results: We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off -the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the fi rst work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins.

Conclusions: The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.

The biomolecular network material may be deep wading but you will find GraphChi making a significant difference in this use case.

What I found particularly interesting in Table 7 (page 20) was the low impact that additional RAM has on GraphChi.

I take that to mean that GraphChi can run efficiently on low-end boxes (4 GB RAM).

Yes?

I first saw this in a tweet by Aapo Kyrola.

GraphChi-DB: Simple Design…

Thursday, March 6th, 2014

GraphChi-DB: Simple Design for a Scalable Graph Database System — on Just a PC by Aapo Kyrola and Carlos Guestrin.

Abstract:

We propose a new data structure, Parallel Adjacency Lists (PAL), for efficiently managing graphs with billions of edges on disk. The PAL structure is based on the graph storage model of GraphChi (Kyrola et. al., OSDI 2012), but we extend it to enable online database features such as queries and fast insertions. In addition, we extend the model with edge and vertex attributes. Compared to previous data structures, PAL can store graphs more compactly while allowing fast access to both the incoming and the outgoing edges of a vertex, without duplicating data. Based on PAL, we design a graph database management system, GraphChi-DB, which can also execute powerful analytical graph computation.

We evaluate our design experimentally and demonstrate that GraphChi-DB achieves state-of-the-art performance on graphs that are much larger than the available memory. GraphChi-DB enables anyone with just a laptop or a PC to work with extremely large graphs.

Open source will be released at: https://github.com/GraphChi.

With data structure improvements like you find with GraphChi-DB, it won’t be long until the average laptop becomes a weapons grade munition. 😉

Study the Partitioned Adjacency List (PAL) details carefully to follow up on the suggestions of using PAL for RDBMS and RDF storage (topic maps?).

Highly recommended!

BPP: Large Graph Storage for Efficient Disk Based Processing

Monday, January 13th, 2014

BPP: Large Graph Storage for Efficient Disk Based Processing by Kamran Najeebullah, Kifayat Ullah Khan, Waqas Nawaz, and Young-Koo Lee.

Abstract:

Processing very large graphs like social networks, biological and chemical compounds is a challenging task. Distributed graph processing systems process the billion-scale graphs efficiently but incur overheads of efficient partitioning and distribution of the graph over a cluster of nodes. Distributed processing also requires cluster management and fault tolerance. In order to overcome these problems GraphChi was proposed recently. GraphChi significantly outperformed all the representative distributed processing frameworks. Still, we observe that GraphChi incurs some serious degradation in performance due to 1) high number of non-sequential I/Os for processing every chunk of graph; and 2) lack of true parallelism to process the graph. In this paper we propose a simple yet powerful engine BiShard Parallel Processor (BPP) to efficiently process billions-scale graphs on a single PC. We extend the storage structure proposed by GraphChi and introduce a new processing model called BiShard Parallel (BP). BP enables full CPU parallelism for processing the graph and significantly reduces the number of non-sequential I/Os required to process every chunk of the graph. Our experiments on real large graphs show that our solution significantly outperforms GraphChi.

…[B]illion-scale graph on a single PC.

Cool!

Err, but the experimental results in the paper are based on “7 thousand plus vertices and more than 1 hundred thousand edges.”

I’m not sure how I get to a “billion-scale” graph?

Interesting results and quite possibly will lead to other breakthroughs in graph processing.

A bit more attention to making the abstract match the results would be appreciated.

Not to mention finding acronyms that don’t conflict with better known ones, like “BP.”

Searching for “BP” isn’t likely to find this paper even in a very long tail of results.

I first saw this in a tweet by Stefano Bertolo.

Large-Scale Machine Learning and Graphs

Saturday, December 7th, 2013

Large-Scale Machine Learning and Graphs by Carlos Guestrin.

The presentation starts with a history of the evolution of GraphLab, which is interesting in and of itself.

Carlos then goes beyond a history lesson and gives a glimpse of a very exciting future.

Such as: installing GraphLab with Python, using Python for local development, running the same Python with Graphlab in the cloud.

Thought that might catch your eye.

Something to remember when people talk about scaling graph analysis.

If you are interested in seeing one possible future of graph processing today, not some day, check out: GraphLab Notebook (Beta).

BTW, Carlos mentions a technique call “think like a vertex” which involves distributing vertexes across machines rather than splitting graphs on edges.

Seems to me that would work to scale the processing of topic maps by splitting topics as well. Once “merging” has occurred on different machines, then “merge” the relevant topics back together across machines.

GraphChi – Moving on up!

Wednesday, July 24th, 2013

GraphChi 0.2 and a new home!

GraphChi-cpp/java have reached 0.2 and are now hosted at Github!

From the release notes:

!GraphChi version 0.2 is the first major update to the !GraphChi software for disk-based computation on massive graphs. This upgrade brings two major changes: compressed data storage (shards) and support for dynamically sized edges.

We also thank for your interest on GraphChi so far: since the release in July 9th 2012, there has been over 8,000 unique visitors to the Google Code project page, at least 2,000 downloads of the source package, several blog posts and hundreds of tweets. GraphChi is a research project, and your feedback has helped us tremendously in our work.

Enjoy!

Co-EM algorithm in GraphChi

Wednesday, February 20th, 2013

Co-EM algorithm in GraphChi by Danny Bickson.

From the post:

Following the previous post about label propagation, as well as some request from US based startup to implement this method in GraphChi, I have decided to write a quick tutorial for Co-EM algorithm.

You cannot live by press reports alone. 😉

Enjoy some time with GraphChi!

Label propagation in GraphChi

Monday, February 11th, 2013

Label propagation in GraphChi by Danny Bickson.

From the post:

A few days ago I got a request from Jidong, from the Chinese Renren company to implement label propagation in GraphChi. The algorithm is very simple described here: Zhu, Xiaojin, and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.

The basic idea is that we start with a group of users that we have some information about the categories they are interested in. Following the weights in the social network, we propagate the label probabilities from the user seed node (the ones we have label information about) into the general social network population. After several iterations, the algorithm converges and the output is labels for the unknown nodes.

I assume there is more unlabeled data for topic maps than labeled data.

Depending upon your requirements, this could prove to be a useful technique for completing those unlabeled nodes.

Setting up Java GraphChi development environment…

Sunday, February 10th, 2013

Setting up Java GraphChi development environment – and running sample ALS by Danny Bickson.

From the post:

As you may know, our GraphChi collaborative filtering toolkit in C is becoming more and more popular. Recently, Aapo Kyrola did a great effort for porting GraphChi C into Java and implementing more methods on top of it.

In this blog post I explain how to setup GraphChi Java development environment in Eclipse and run alternating least squares algorithm (ALS) on a small subset of Netflix data.

Based on the level of user feedback I am going to receive for this blog post, we will consider porting more methods to Java. So email me if you are interested in trying it out.

If you are interested in more machine learning methods in Java, here’s your chance!

Not to mention your interest in graph based solutions.

Case study: million songs dataset

Sunday, February 3rd, 2013

Case study: million songs dataset by Danny Bickson.

From the post:

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli’s cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities. 

Just in case you need some data for practice with your GraphChi installation. 😉

Seriously, nice way to gain familiarity with the data set.

What value you extract from it is up to you.

GraphChi version 0.2 released!

Thursday, January 24th, 2013

GraphChi version 0.2 released! by Danny Bickson.

From the post:

GraphChi version 0.2 is the first major update to the GraphChi software for disk-based computation on massive graphs. This upgrade brings two major changes: compressed data storage (shards) and support for dynamically sized edges.

We also thank for your interest on GraphChi so far: since the release in July 9th 2012, there has been over 8,000 unique visitors to the Google Code project page, at least 2,000 downloads of the source package, several blog posts and hundreds of tweets. GraphChi is a research project, and your feedback has helped us tremendously in our work.

Excellent news!

A link for the Dynamic Edge Data tutorial was omitted from the original post.

GraphChi visual toolkit – or understanding your data

Saturday, November 24th, 2012

GraphChi visual toolkit – or understanding your data by Danny Bickson.

Danny walks through using GraphChi to visual the Orange d4d data set. (cell phone usage).

Easy instructions and heavy on interesting graphics so you need to read the original post.

Very cool!

Item based similarity with GraphChi

Tuesday, September 11th, 2012

Item based similarity with GraphChi by Danny Bickson.

From the post:

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to. Following my mega collaborator Justin Yan‘s advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.

I need to check on Danny’s blog and the GraphChi/GraphLab webpages more often!

GraphChi parsers toolkit

Tuesday, September 11th, 2012

GraphChi parsers toolkit by Danny Bickson.

From the post:

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.

Danny starts us off with an LDA parser (with worked example of its use) and then adds a Twitter parser that creates a graph of retweets.

Enjoy!

GraphLab and GraphChi group on LinkedIn

Tuesday, August 21st, 2012

GraphLab and GraphChi group on LinkedIn by Igor Carron.

From the post:

Danny just started the GraphLab and GraphChi group on LinkedIn. If you want to be part of that disruptive discussion, we may want to join.

OK, I just hit “join.” What about you?

Collaborative filtering with GraphChi

Sunday, August 19th, 2012

Collaborative filtering with GraphChi by Danny Bickson.

From the post:

A couple of weeks ago I covered GraphChi by Aapo Kyrola in my blog.

Here is a quick tutorial for trying out GraphChi collaborative filtering. Currently it supports ALS (alternating least squares), SGD (stochastic gradient descent), bias-SGD (biased stochastic gradient descent) and SVD++, but I am soon going to implement several more algorithms.

If you are already experimenting with GraphChi, you will really like this post.

Web Scale with a Laptop? [GraphChi]

Thursday, July 19th, 2012

GraphChi promises in part:

The promise of GraphChi is to bring web-scale graph computation, such as analysis of social networks, available to anyone with a modern laptop.

Well, that certainly draws a line in the sand doesn’t it?

A bit more from the introduction:

GraphChi is a spin-off of the GraphLab ( http://www.graphlab.org ) -project from the Carnegie Mellon University. It is based on research by Aapo Kyrola ( http://www.cs.cmu.edu/~akyrola/) and his advisors.

GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google's Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph. Section 'Performance' contains some examples of applications implemented for GraphChi and their running times on GraphChi.

The promise of GraphChi is to bring web-scale graph computation, such as analysis of social networks, available to anyone with a modern laptop. It saves you from the hassle and costs of working with a distributed cluster or cloud services. We find it much easier to debug applications on a single computer than trying to understand how a distributed algorithm is executed.

In some cases GraphChi can solve bigger problems in reasonable time than many other available distributed frameworks. GraphChi also runs efficiently on servers with plenty of memory, and can use multiple disks in parallel by striping the data.

Even if you do require the processing power of high-performance clusters, GraphChi can be an excellent tool for developing and debugging your algorithms prior to deploying them to the cluster. For high-performance graph computation in the distributed setting, we direct you to GraphLab's new version (v2.1), which can now handle large graphs in astonishing speed. GraphChi supports also most of the new GraphLab v2.1 API (with some restrictions), making the transition easy.

GraphChi is implemented in plain C++, and available as open-source under the flexible Apache License 2.0.

Java version

Java-version of GraphChi: http://code.google.com/p/graphchi-java

The performance numbers are impressive.

Not sure I would want to run production code on a laptop in any case but performance should be enough for on-the-road experiments.

Good documentation and examples that should ease you into experimenting with GraphChi.

I first saw this at High Scalability.