Archive for the ‘Pregel’ Category

Graph Landscape Survey

Monday, May 20th, 2013

Improving options for unlocking your graph data by Ben Lorica.

From the post:

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).
(…)

Ben summarizes graph resources for:

  • Data wrangling: creating graphs
  • Data management and search
  • Graph-parallel frameworks
  • Machine-learning and analytics
  • Visualization

It would be hard to find a better starting place for investigating the buzz about graphs.

I first saw this in An Overview of Graph Processing Frameworks by Danny Bickson.

Graph processing platform Apache Giraph reaches 1.0

Friday, May 10th, 2013

Graph processing platform Apache Giraph reaches 1.0

From the post:

Used by Facebook and Yahoo, the Apache Giraph project for distributed graph processing has released version 1.0. This is the first new version since the project left incubation and became a top-level project in May 2012, though for some reason it has yet to make it to the Apache index of top level projects.

Giraph allows social graphs and other richly interconnected data structures with many billions of edges to be analysed using hundreds of machines. It is inspired by the Bulk Synchronous Parallel abstract computer model and the Google Pregel system for large scale graph-processing. The developers of Giraph say that unlike those systems, Giraph is an open source, scalable platform built atop of the Apache Hadoop infrastructure which has no single point of failure by design. The documentation includes an introduction to Giraph’s iterative graph processing and how to implement graph processing functions in Java. The Giraph project has seen contributions from Yahoo!, Twitter, Facebook and LinkedIn and from academic institutions around the world.

It’s a little early to be downloading software for the weekend but why not? ;-)

Enjoy!

Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons

Tuesday, February 21st, 2012

Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons

René Pickhardt summarizes two of the papers for tomorrow’s meeting on graph databases:

One of the reading club assignments was to read the paper about Google Pregel and Signal Collect, compare them and point out pros and cons of both approaches.

So after I read both papers as well as Claudios overview on Pregel clones and took some notes here are my thoughts but first a short summary of both papers.

What are your thoughts on these or some of the other readings for tomorrow?

Google MapReduce/Pregel – Graph Reading Club – 15 February 2012

Thursday, February 16th, 2012

Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club by René Pickhardt.

From the post:

The first meeting of our reading club was quite a success. Everyone was well prepared and we discussed some issues about Google’s Map Reduce framework and I had the feeling that everyone now better understands what is going on there. I will now post a summary of what has been discussed and will also post some feedback and reading for next week to the end of this post. Most importantly: The reading club will meet next week Wednesday February 22nd at 2 o’clock pm CET.

René includes some rules/guidance for the next meeting and a very interesting looking reading list!

Pregel

Saturday, January 28th, 2012

Pregel by Michael Nielsen.

From the post:

http://tm.durusau.net/wp-admin/post-new.php

In this post, I describe a simple but powerful framework for distributed computing called Pregel. Pregel was developed by Google, and is described in a 2010 paper written by seven Googlers. In 2009, the Google Research blog announced that the Pregel system was being used in dozens of applications within Google.

Pregel is a framework oriented toward graph-based algorithms. I won’t formally define graph-based algorithms here – we’ll see an example soon enough – but roughly speaking a graph-based algorithm is one which can be easily expressed in terms of the vertices of a graph, and their adjacent edges and vertices. Examples of problems which can be solved by graph-based algorithms include determining whether two vertices in a graph are connected, where there are clusters of connected vertices in a graph, and many other well-known graph problems. As a concrete example, in this post I describe how Pregel can be used to determine the PageRank of a web page.

What makes Pregel special is that it’s designed to scale very easily on a large-scale computer cluster. Typically, writing programs for clusters requires the programmer to get their hands dirty worrying about details of the cluster architecture, communication between machines in the cluster, considerations of fault-tolerance, and so on. The great thing about Pregel is that Pregel programs can be scaled (within limits) automatically on a cluster, without requiring the programmer to worry about the details of distributing the computation. Instead, they can concentrate on the algorithm they want to implement. In this, Pregel is similar to the MapReduce framework. Like MapReduce, Pregel gains this ability by concentrating on a narrow slice of problems. What makes Pregel interesting and different to MapReduce is that it is well-adapted to a somewhat different class of problems.

What class of problems would you say Pregel is “well-adapted” to solve?

I ask because I am unaware of any data structure that a graph is cannot represent. If there is an issue, it isn’t one of representation, at least in theory.

Is it a problem in practice/implementation?

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Sunday, November 20th, 2011

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

  1. In or between which of these elements, would human analysis/judgment have the greatest impact?
  2. Would human analysis/judgment be best made by experts or crowds?
  3. What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
  4. Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

Google Pregel: the Rise of the Clones

Wednesday, September 14th, 2011

Google Pregel: the Rise of the Clones

Claudio Martella gives a quick overview of Pregel “clones,” Apache Hama, GoldenOrb, Giraph, and Phoebus.

Claudio concludes:

So, here it is, fire up your Hadoop pseudo-cluster and get back to me if you have something to add.

Graph Processing versus Graph Databases

Tuesday, August 30th, 2011

Graph Processing versus Graph Databases

Jim Webber describes the different problems addressed by graph processing and graph databases. Worth reading so you will pick the correct tool for the problem you are facing.

Webber visualizes the following distinctions:

What Pregel and Hadoop have in common is their tendency towards the data analytics (OLAP) end of the spectrum, rather than being focussed on transaction processing. This is in stark contrast to graph databases like Neo4j which optimise storage and querying of connected data for online transaction processing (OLTP) scenarios – much like a regular RDBMS, only with a more expressive and powerful data model.

See the post for the graphic.

Hama

Sunday, April 3rd, 2011

Hama

Apache Incubator project that describes itself as:

Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.

A little better explanation appears on the Hama blog when answering the question: “How will Hama BSP different from Pregel?:”

Hama BSP is a computing engine, based on BSP model, like a Pregel, and it’ll be compatible with existing HDFS cluster, or any FileSystem and Database in the future. However, we believe that the BSP computing model is not limited to a problems of graph; it can be used for widely distributed software such as Map/Reduce. In addition to a field of graph, there are many other algorithms, which have similar problems with graph processing using Map/Reduce. Actually, the BSP model has been researched for many years in the field of matrix computation, too. http://blogs.apache.org/hama/

Wikipedia has a short article on Bulk synchronous parallel (BSP) computing techniques with some references.