Archive for the ‘Pregel’ Category

GPS: A Graph Processing System

Wednesday, June 18th, 2014

GPS: A Graph Processing System

From the post:

GPS is an open-source system for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system, and Apache Giraph.GPS is a distributed system designed to run on a cluster of machines, such as Amazon’s EC2.

In systems such as GPS and Pregel, the input graph (directed, possibly with values on edges) is distributed across machines and vertices send each other messages to perform a computation. Computation is divided into iterations called supersteps. Analogous to the map() and reduce() functions of the MapReduce framework, in each superstep a user-defined function called vertex.compute() is applied to each vertex in parallel. The user expresses the logic of the computation by implementing vertex.compute(). This design is based on Valiant’s Bulk Synchronous Parallel model of computation. A detailed description can be found in the original Pregel paper.

There are five main differences between Pregel and GPS:

  • GPS is open-source.
  • GPS extends Pregel’s API with a master.compute() function, which enables easy and efficient implementation of algorithms that are composed of multiple vertex-centric computations, combined with global computations
  • GPS has an optional dynamic repartitioning scheme, which reassigns vertices to different machines during graph computation to improve performance, based on observing communication patterns.
  • GPS has an optimization called LALP that reduces the network I/O in when running certain algorithms on real-world graphs that have skewed degree distributions.
  • GPS programs can be implemented using a higher-level domain specific language called Green-Marl, and automatically compiled into native GPS code. Green-Marl is a traditional imperative language with several graph-specific language constructs that enable intuitive and simple expression of complicated algorithms.

We have completed an initial version of GPS, which is available to download. We have run GPS on up to 100 Amazon EC2 large instances and on graphs of up to 250 million vertices and 10 billion edges. (emphasis added)

In light of the availability and performance statement, I suppose we can overlook the choice of a potentially confusing acronym. 😉

The Green-Marl compiler can be used to implement algorithms for GPS. Consult the Green-Marl paper before deciding its assumptions about processing will fit your use cases.

The team also wrote: Optimizing Graph Algorithms on Pregel-like Systems, due to appear in VLDB 2014.

I first saw this in a tweet by James Thornton.

Mizan

Sunday, November 3rd, 2013

Mizan

From the webpage:

What is Mizan?

Mizan is an advanced clone to Google’s graph processing system Pregel that utilizes online graph vertex migrations to dynamically optimizes the execution of graph algorithms. You can use our Mizan system to develop any vertex centric graph algorithm and run in parallel over a local cluster or over cloud infrastructure. Mizan is compatible with Pregel’s API, written in C++ and uses MPICH2 for communication. You can download a copy of Mizan and start using it today on your local machine or try Mizan on EC2. We also welcome programers who are interested to go deeper into our Mizan code to optimize or tweak.

Mizan is published in EuroSys 13 as “Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing“. We have an earlier work of Mizan as “Mizan: Optimizing Graph Mining in Large Parallel Systems“, which we recently changed it to Libra to avoid confusions. We show below the abstract for Mizan’s EuroSys 13 paper. We also include Mizan’s general architecture and its API available for users.

Abstract

Pregel was recently introduced as a scalable graph mining system that can provide significant performance improvements over traditional MapReduce implementations. Existing implementations focus primarily on graph partitioning as a preprocessing step to balance computation across compute nodes. In this paper, we examine the runtime characteristics of a Pregel system. We show that graph partitioning alone is insufficient for minimizing end-to-end computation. Especially where data is very large or the runtime behavior of the algorithm is unknown, an adaptive approach is needed. To this end, we introduce Mizan, a Pregel system that achieves efficient load balancing to better adapt to changes in computing needs. Unlike known implementations of Pregel, Mizan does not assume any a priori knowledge of the structure of the graph or behavior of the algorithm. Instead, it monitors the runtime characteristics of the system. Mizan then performs efficient fine-grained vertex migration to balance computation and communication. We have fully implemented Mizan; using extensive evaluation we show that—especially for highly-dynamic workloads— Mizan provides up to 84% improvement over techniques leveraging static graph pre-partitioning.

Post like this one make me want to build a local cluster at home. 😉

Graph Landscape Survey

Monday, May 20th, 2013

Improving options for unlocking your graph data by Ben Lorica.

From the post:

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).
(…)

Ben summarizes graph resources for:

  • Data wrangling: creating graphs
  • Data management and search
  • Graph-parallel frameworks
  • Machine-learning and analytics
  • Visualization

It would be hard to find a better starting place for investigating the buzz about graphs.

I first saw this in An Overview of Graph Processing Frameworks by Danny Bickson.

Graph processing platform Apache Giraph reaches 1.0

Friday, May 10th, 2013

Graph processing platform Apache Giraph reaches 1.0

From the post:

Used by Facebook and Yahoo, the Apache Giraph project for distributed graph processing has released version 1.0. This is the first new version since the project left incubation and became a top-level project in May 2012, though for some reason it has yet to make it to the Apache index of top level projects.

Giraph allows social graphs and other richly interconnected data structures with many billions of edges to be analysed using hundreds of machines. It is inspired by the Bulk Synchronous Parallel abstract computer model and the Google Pregel system for large scale graph-processing. The developers of Giraph say that unlike those systems, Giraph is an open source, scalable platform built atop of the Apache Hadoop infrastructure which has no single point of failure by design. The documentation includes an introduction to Giraph’s iterative graph processing and how to implement graph processing functions in Java. The Giraph project has seen contributions from Yahoo!, Twitter, Facebook and LinkedIn and from academic institutions around the world.

It’s a little early to be downloading software for the weekend but why not? 😉

Enjoy!

Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons

Tuesday, February 21st, 2012

Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons

René Pickhardt summarizes two of the papers for tomorrow’s meeting on graph databases:

One of the reading club assignments was to read the paper about Google Pregel and Signal Collect, compare them and point out pros and cons of both approaches.

So after I read both papers as well as Claudios overview on Pregel clones and took some notes here are my thoughts but first a short summary of both papers.

What are your thoughts on these or some of the other readings for tomorrow?

Google MapReduce/Pregel – Graph Reading Club – 15 February 2012

Thursday, February 16th, 2012

Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club by René Pickhardt.

From the post:

The first meeting of our reading club was quite a success. Everyone was well prepared and we discussed some issues about Google’s Map Reduce framework and I had the feeling that everyone now better understands what is going on there. I will now post a summary of what has been discussed and will also post some feedback and reading for next week to the end of this post. Most importantly: The reading club will meet next week Wednesday February 22nd at 2 o’clock pm CET.

René includes some rules/guidance for the next meeting and a very interesting looking reading list!

Pregel

Saturday, January 28th, 2012

Pregel by Michael Nielsen.

From the post:

http://tm.durusau.net/wp-admin/post-new.php

In this post, I describe a simple but powerful framework for distributed computing called Pregel. Pregel was developed by Google, and is described in a 2010 paper written by seven Googlers. In 2009, the Google Research blog announced that the Pregel system was being used in dozens of applications within Google.

Pregel is a framework oriented toward graph-based algorithms. I won’t formally define graph-based algorithms here – we’ll see an example soon enough – but roughly speaking a graph-based algorithm is one which can be easily expressed in terms of the vertices of a graph, and their adjacent edges and vertices. Examples of problems which can be solved by graph-based algorithms include determining whether two vertices in a graph are connected, where there are clusters of connected vertices in a graph, and many other well-known graph problems. As a concrete example, in this post I describe how Pregel can be used to determine the PageRank of a web page.

What makes Pregel special is that it’s designed to scale very easily on a large-scale computer cluster. Typically, writing programs for clusters requires the programmer to get their hands dirty worrying about details of the cluster architecture, communication between machines in the cluster, considerations of fault-tolerance, and so on. The great thing about Pregel is that Pregel programs can be scaled (within limits) automatically on a cluster, without requiring the programmer to worry about the details of distributing the computation. Instead, they can concentrate on the algorithm they want to implement. In this, Pregel is similar to the MapReduce framework. Like MapReduce, Pregel gains this ability by concentrating on a narrow slice of problems. What makes Pregel interesting and different to MapReduce is that it is well-adapted to a somewhat different class of problems.

What class of problems would you say Pregel is “well-adapted” to solve?

I ask because I am unaware of any data structure that a graph is cannot represent. If there is an issue, it isn’t one of representation, at least in theory.

Is it a problem in practice/implementation?

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Sunday, November 20th, 2011

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

  1. In or between which of these elements, would human analysis/judgment have the greatest impact?
  2. Would human analysis/judgment be best made by experts or crowds?
  3. What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
  4. Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

Google Pregel: the Rise of the Clones

Wednesday, September 14th, 2011

Google Pregel: the Rise of the Clones

Claudio Martella gives a quick overview of Pregel “clones,” Apache Hama, GoldenOrb, Giraph, and Phoebus.

Claudio concludes:

So, here it is, fire up your Hadoop pseudo-cluster and get back to me if you have something to add.

Graph Processing versus Graph Databases

Tuesday, August 30th, 2011

Graph Processing versus Graph Databases

Jim Webber describes the different problems addressed by graph processing and graph databases. Worth reading so you will pick the correct tool for the problem you are facing.

Webber visualizes the following distinctions:

What Pregel and Hadoop have in common is their tendency towards the data analytics (OLAP) end of the spectrum, rather than being focussed on transaction processing. This is in stark contrast to graph databases like Neo4j which optimise storage and querying of connected data for online transaction processing (OLTP) scenarios – much like a regular RDBMS, only with a more expressive and powerful data model.

See the post for the graphic.

Hama

Sunday, April 3rd, 2011

Hama

Apache Incubator project that describes itself as:

Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.

A little better explanation appears on the Hama blog when answering the question: “How will Hama BSP different from Pregel?:”

Hama BSP is a computing engine, based on BSP model, like a Pregel, and it’ll be compatible with existing HDFS cluster, or any FileSystem and Database in the future. However, we believe that the BSP computing model is not limited to a problems of graph; it can be used for widely distributed software such as Map/Reduce. In addition to a field of graph, there are many other algorithms, which have similar problems with graph processing using Map/Reduce. Actually, the BSP model has been researched for many years in the field of matrix computation, too. http://blogs.apache.org/hama/

Wikipedia has a short article on Bulk synchronous parallel (BSP) computing techniques with some references.