Archive for the ‘Giraph’ Category

Wikipedia in Python, Gephi, and Neo4j

Thursday, January 8th, 2015

Wikipedia in Python, Gephi, and Neo4j: Vizualizing relationships in Wikipedia by Matt Krzus.

From the introduction:

g3

We have had a bit of a stretch here where we used Wikipedia for a good number of things. From Doc2Vec to experimenting with word2vec layers in deep RNNs, here are a few of those cool visualization tools we’ve used along the way.

Cool things you will find in this post:

  • Building relationship links between Categories and Subcategories
  • Visualization with Networkx (think Betweenness Centrality and PageRank)
  • Neo4j and Cypher (the author thinks avoiding the Giraph learning curve is a plus, I leave that for you to decide)
  • Visualization with Gephi

Enjoy!

GPS: A Graph Processing System

Wednesday, June 18th, 2014

GPS: A Graph Processing System

From the post:

GPS is an open-source system for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system, and Apache Giraph.GPS is a distributed system designed to run on a cluster of machines, such as Amazon’s EC2.

In systems such as GPS and Pregel, the input graph (directed, possibly with values on edges) is distributed across machines and vertices send each other messages to perform a computation. Computation is divided into iterations called supersteps. Analogous to the map() and reduce() functions of the MapReduce framework, in each superstep a user-defined function called vertex.compute() is applied to each vertex in parallel. The user expresses the logic of the computation by implementing vertex.compute(). This design is based on Valiant’s Bulk Synchronous Parallel model of computation. A detailed description can be found in the original Pregel paper.

There are five main differences between Pregel and GPS:

  • GPS is open-source.
  • GPS extends Pregel’s API with a master.compute() function, which enables easy and efficient implementation of algorithms that are composed of multiple vertex-centric computations, combined with global computations
  • GPS has an optional dynamic repartitioning scheme, which reassigns vertices to different machines during graph computation to improve performance, based on observing communication patterns.
  • GPS has an optimization called LALP that reduces the network I/O in when running certain algorithms on real-world graphs that have skewed degree distributions.
  • GPS programs can be implemented using a higher-level domain specific language called Green-Marl, and automatically compiled into native GPS code. Green-Marl is a traditional imperative language with several graph-specific language constructs that enable intuitive and simple expression of complicated algorithms.

We have completed an initial version of GPS, which is available to download. We have run GPS on up to 100 Amazon EC2 large instances and on graphs of up to 250 million vertices and 10 billion edges. (emphasis added)

In light of the availability and performance statement, I suppose we can overlook the choice of a potentially confusing acronym. 😉

The Green-Marl compiler can be used to implement algorithms for GPS. Consult the Green-Marl paper before deciding its assumptions about processing will fit your use cases.

The team also wrote: Optimizing Graph Algorithms on Pregel-like Systems, due to appear in VLDB 2014.

I first saw this in a tweet by James Thornton.

Faceboook Gets Smarter with Graph Engine Optimization

Saturday, April 12th, 2014

Faceboook Gets Smarter with Graph Engine Optimization by Alex Woodie.

From the post:

Last fall, the folks in Facebook’s engineering team talked about how they employed the Apache Giraph engine to build a graph on its Hadoop platform that can host more than a trillion edges. While the Graph Search engine is capable of massive graphing tasks, there were some workloads that remained outside the company’s technical capabilities–until now.

Facebook turned to the Giraph engine to power its new Graph Search offering, which it unveiled in January 2013 as a way to let users perform searches on other users to determine, for example, what kind of music their Facebook friends like, what kinds of food they’re into, or what activities they’ve done recently. An API for Graph Search also provides advertisers with a new revenue source for Facebook. It’s likely the world’s largest graph implementation, and a showcase of what graph engines can do.

The company picked Giraph because it worked on their existing Hadoop implementation, including HDFS and its MapReduce infrastructure stack (known as Corona). Compared to running the computation workload on Hive, an internal Facebook test of a 400-billion edge graph ran 126x faster on Giraph, and had a 26x performance advantage, as we explained in a Datanami story last year.

When Facebook scaled its internal test graph up to 1 trillion edges, they were able to keep the processing of each iteration of the graph under four minutes on a 200-server cluster. That amazing feat was done without any optimization, the company claimed. “We didn’t cheat,” Facebook developer Avery Ching declared in a video. “This is a random hashing algorithm, so we’re randomly assigning the vertices to different machines in the system. Obviously, if we do some separation and locality optimization, we can get this number down quite a bit.”

High level view with technical references on how Facebook is optimizing its Apache Giraph engine.

If you are interested in graphs, this is much more of a real world scenario than building “big” graphs out of uniform time slices.

Write and Run Giraph Jobs on Hadoop

Sunday, February 9th, 2014

Write and Run Giraph Jobs on Hadoop by Mirko Kämpf.

From the post:

Create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets. Giraph was inspired by Google’s Pregel framework and integrates well with Apache Accumulo, Apache HBase, Apache Hive, and Cloudera Impala.

Currently, the upstream “quick start” document explains how to deploy Giraph on a Hadoop cluster with two nodes running Ubuntu Linux. Although this setup is appropriate for lightweight development and testing, using Giraph with an enterprise-grade CDH-based cluster requires a slightly more robust approach.

In this how-to, you will learn how to use Giraph 1.0.0 on top of CDH 4.x using a simple example dataset, and run example jobs that are already implemented in Giraph. You will also learn how to set up your own Giraph-based development environment. The end result will be a setup (not intended for production) for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets. (In future posts, I will explain how to implement your own graph algorithms and graph generators as well as how to export your results to Gephi, the “Adobe Photoshop for graphs”, through Impala and JDBC for further inspection.)

The first in a series of posts on Giraph.

This is great stuff!

It should keep you busy during your first conference call and/or staff meeting on Monday morning.

Monday won’t seem so bad. 😉

Benchmarking Graph Databases

Wednesday, September 25th, 2013

Benchmarking Graph Databases by Alekh Jindal.

Speaking of data skepticism.

From the post:

Graph data management has recently received a lot of attention, particularly with the explosion of social media and other complex, inter-dependent datasets. As a result, a number of graph data management systems have been proposed. But this brings us to the question: What happens to the good old relational database systems (RDBMSs) in the context of graph data management?

The article names some of the usual graph database suspects.

But for its comparison, it selects only one (Neo4j) and compares it against three relational databases, MySQL, Vertica and VoltDB.

What’s missing? How about expanding to include GraphLab (GraphLab – Next Generation [Johnny Come Lately VCs]) and Giraph (Scaling Apache Giraph to a trillion edges) or some of the other heavy hitters (insert your favorite) in the graph world?

Nothing against Neo4j. It is making rapid progress on a query language and isn’t hard to learn. But it lacks the raw processing power of an application like Apache Giraph. Giraph, after all, is used to process the entire Facebook data set, not a “4k nodes and 88k edges” Facebook sample as in this comparison.

Not to mention that only two algorithms were used in this comparison: PageRank and Shortest Paths.

Personally I can imagine users being interested in running more than two algorithms. But that’s just me.

Every benchmarking project has to start somewhere but this sort of comparison doesn’t really advance the discussion of competing technologies.

Not that any comparison would be complete without a discussion of typical uses cases and user observations on how each candidate did or did not meet their expectations.

Scaling Apache Giraph to a trillion edges

Friday, September 13th, 2013

Scaling Apache Giraph to a trillion edges by Avery Ching.

From the post:

Graph structures are ubiquitous: they provide a basic model of entities with connections between them that can represent almost anything. Flight routes connect airports, computers communicate to one another via the Internet, webpages have hypertext links to navigate to other webpages, and so on. Facebook manages a social graph that is composed of people, their friendships, subscriptions, and other connections. Open graph allows application developers to connect objects in their applications with real-world actions (such as user X is listening to song Y).

Analyzing these real world graphs at the scale of hundreds of billions or even a trillion (10^12) edges with available software was impossible last year. We needed a programming framework to express a wide range of graph algorithms in a simple way and scale them to massive datasets. After the improvements described in this article, Apache Giraph provided the solution to our requirements.

In the summer of 2012, we began exploring a diverse set of graph algorithms across many different Facebook products as well as academic literature. We selected a few representative use cases that cut across the problem space with different system bottlenecks and programming complexity. Our diverse use cases and the desired features of the programming framework drove the requirements for our system infrastructure. We required an iterative computing model, graph-based API, and fast access to Facebook data. Based on these requirements, we selected a few promising graph-processing platforms including Apache Hive, GraphLab, and Apache Giraph for evaluation.

For your convenience:

Apache Giraph

Apache Hive

GraphLab

Your appropriate scale is probably less than a trillion edges but everybody likes a great scaling story.

This is a great scaling story.

Apache Bigtop: The “Fedora of Hadoop”…

Wednesday, June 26th, 2013

Apache Bigtop: The “Fedora of Hadoop” is Now Built on Hadoop 2.x by Roman Shaposhnik.

From the post:

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

  • Apache Zookeeper 3.4.5
  • Apache Flume 1.3.1
  • Apache HBase 0.94.5
  • Apache Pig 0.11.1
  • Apache Hive 0.10.0
  • Apache Sqoop 2 (AKA 1.99.2)
  • Apache Oozie 3.3.2
  • Apache Whirr 0.8.2
  • Apache Mahout 0.7
  • Apache Solr (SolrCloud) 4.2.1
  • Apache Crunch (incubating) 0.5.0
  • Apache HCatalog 0.5.0
  • Apache Giraph 1.0.0
  • LinkedIn DataFu 0.0.6
  • Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉

Enjoy!

(Participate if you can but at least send a note of appreciation to Cloudera.)

Graph processing platform Apache Giraph reaches 1.0

Friday, May 10th, 2013

Graph processing platform Apache Giraph reaches 1.0

From the post:

Used by Facebook and Yahoo, the Apache Giraph project for distributed graph processing has released version 1.0. This is the first new version since the project left incubation and became a top-level project in May 2012, though for some reason it has yet to make it to the Apache index of top level projects.

Giraph allows social graphs and other richly interconnected data structures with many billions of edges to be analysed using hundreds of machines. It is inspired by the Bulk Synchronous Parallel abstract computer model and the Google Pregel system for large scale graph-processing. The developers of Giraph say that unlike those systems, Giraph is an open source, scalable platform built atop of the Apache Hadoop infrastructure which has no single point of failure by design. The documentation includes an introduction to Giraph’s iterative graph processing and how to implement graph processing functions in Java. The Giraph project has seen contributions from Yahoo!, Twitter, Facebook and LinkedIn and from academic institutions around the world.

It’s a little early to be downloading software for the weekend but why not? 😉

Enjoy!

MapReduce: Detecting Cycles in Network Graph [Merging Duplicate Identifiers]

Tuesday, January 8th, 2013

MapReduce: Detecting Cycles in Network Graph by Ricky Ho.

From the post:

I recently received an email from an audience of my blog on Map/Reduce algorithm design regarding how to detect whether a graph is acyclic using Map/Reduce. I think this is an interesting problem and can imagine there can be wide range of application to it.

Although I haven’t solved this exact problem in the past, I’d like to sketch out my thoughts on a straightforward approach, which may not be highly optimized. My goal is to invite other audience who has solved this problem to share their tricks.

To define the problem: Given a simple directed graph, we want to tell whether it contains any cycles.

Relevant to processing of identifiers in topic maps which may occur on more than one topic (prior to merging).

What is your solution in a mapreduce context?

Constructing Case-Control Studies With Hadoop

Sunday, April 15th, 2012

Constructing Case-Control Studies With Hadoop by Josh Wills.

From the post:

San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.

Case-Control Studies

A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.

Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

Great walk through on constructing a case-control study, including the use of the Apache Giraph library.

Apache Giraph

Saturday, February 25th, 2012

Apache Giraph

From the webpage:

Web and online social graphs have been rapidly growing in size and scale during the past decade. In 2008, Google estimated that the number of web pages reached over a trillion. Online social networking and email sites, including Yahoo!, Google, Microsoft, Facebook, LinkedIn, and Twitter, have hundreds of millions of users and are expected to grow much more in the future. Processing these graphs plays a big role in relevant and personalized information for users, such as results from a search engine or news in an online social networking site.

Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, personalization-based popularity, etc.) have become quite popular. Some recent examples include Pregel and HaLoop. For general-purpose big data computation, the map-reduce computing model has been well adopted and the most deployed map-reduce infrastructure is Apache Hadoop. We have implemented a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.

Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.

Giraph 0.1-incubating released. (Feb. 6th, 2012)

Another graph contender.

How many do you have on your system?

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Sunday, November 20th, 2011

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

  1. In or between which of these elements, would human analysis/judgment have the greatest impact?
  2. Would human analysis/judgment be best made by experts or crowds?
  3. What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
  4. Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

Google Pregel: the Rise of the Clones

Wednesday, September 14th, 2011

Google Pregel: the Rise of the Clones

Claudio Martella gives a quick overview of Pregel “clones,” Apache Hama, GoldenOrb, Giraph, and Phoebus.

Claudio concludes:

So, here it is, fire up your Hadoop pseudo-cluster and get back to me if you have something to add.