Archive for the ‘GraphX’ Category

Congressional PageRank… [How To Avoid Bribery Charges]

Saturday, October 17th, 2015

Congressional PageRank – Analyzing US Congress With Neo4j and Apache Spark by William Lyon.

From the post:

As we saw previously, legis-graph is an open source software project that imports US Congressional data from Govtrack into the Neo4j graph database. This post shows how we can apply graph analytics to US Congressional data to find influential legislators in Congress. Using the Mazerunner open source graph analytics project we are able to use Apache Spark GraphX alongside Neo4j to run the PageRank algorithm on a collaboration graph of US Congress.

While Neo4j is a powerful graph database that allows for efficient OLTP queries and graph traversals using the Cypher query language, it is not optimized for global graph algorithms, such as PageRank. Apache Spark is a distributed in-memory large-scale data processing engine with a graph processing framework called GraphX. GraphX with Apache Spark is very efficient at performing global graph operations, like the PageRank algorithm. By using Spark alongside Neo4j we can enhance our analysis of US Congress using legis-graph.

Excellent walk-through to get you started on analyzing influence in congress, with modern data analysis tools. Getting a good grip on all these tools with be valuable.

Political scientists, among others, have studied the question of influence in Congress for decades so if you don’t want to repeat the results of others, being by consulting the American Political Science Review for prior work in this area.

An article that reports counter-intuitive results is: The Influence of Campaign Contributions on the Legislative Process by Lynda W. Powell.

From the introduction:

Do campaign donors gain disproportionate influence in the legislative process? Perhaps surprisingly, political scientists have struggled to answer this question. Much of the research has not identified an effect of contributions on policy; some political scientists have concluded that money does not matter; and this bottom line has been picked up by reporters and public intellectuals.1 It is essential to answer this question correctly because the result is of great normative importance in a democracy.

It is important to understand why so many studies find no causal link between contributions and policy outcomes. (emphasis added)

Linda cites much of the existing work on the influence of donations on process so her work makes a great starting point for further research.

As far as the lack of a “casual link between contributions and policy outcomes,” I think the answer is far simpler than Linda suspects.

The existence of a quid-pro-quo, the exchange of value for a vote on a particular bill, is the essence of the crime of public bribery. For the details (in the United States), see: 18 U.S. Code § 201 – Bribery of public officials and witnesses

What isn’t public bribery is to donate funds to an office holder on a regular basis, unrelated to any particular vote or act on the part of that official. Think of it as bribery on an installment plan.

When U.S. officials, such as former Secretary of State Hillary Clinton complain of corruption in other governments, they are criticizing quid-pro-quo bribery and not installment plan bribery as it is practiced in the United States.

Regular contributions gains ready access to legislators and, not surprisingly, more votes will go in your favor than random chance would allow.

Regular contributions are more expensive than direct bribes but avoiding the “causal link” is essential for all involved.

Spark Release 1.5.0

Thursday, September 10th, 2015

Spark Release 1.5.0

From the post:

Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page.

You can consult JIRA for the detailed changes. We have curated a list of high level changes here:

Time for your Fall Spark Upgrade!


The impact of fast networks on graph analytics, part 1

Monday, July 13th, 2015

The impact of fast networks on graph analytics, part 1 by Frank McSherry.

From the post:

This is a joint post with Malte Schwarzkopf, cross-blogged here and at the CamSaS blog.

tl;dr: A recent NSDI paper argued that data analytics stacks don’t get much faster at tasks like PageRank when given better networking, but this is likely just a property of the stack they evaluated (Spark and GraphX) rather than generally true. A different framework (timely dataflow) goes 6x faster than GraphX on a 1G network, which improves by 3x to 15-17x faster than GraphX on a 10G network.

I spent the past few weeks visiting the CamSaS folks at the University of Cambridge Computer Lab. Together, we did some interesting work, which we – Malte Schwarzkopf and I – are now going to tell you about.

Recently, a paper entitled “Making Sense of Performance in Data Analytics Frameworks” appeared at NSDI 2015. This paper contains some surprising results: in particular, it argues that data analytics stacks are limited more by CPU than they are by network or disk IO. Specifically,

“Network optimizations can only reduce job completion time by a median of at most 2%. The network is not a bottleneck because much less data is sent over the network than is transferred to and from disk. As a result, network I/O is mostly irrelevant to overall performance, even on 1Gbps networks.” (§1)

The measurements were done using Spark, but the authors argue that they generalize to other systems. We thought that this was surprising, as it doesn’t match our experience with other data processing systems. In this blog post, we will look into whether these observations do indeed generalize.

One of the three workloads in the paper is the BDBench query set from Berkeley, which includes a “page-rank-like computation”. Moreover, PageRank also appears as an extra example in the NSDI slide deck (slide 38-39), used there to illustrate that at most a 10% improvement in job completion time can be had even for a network-intensive workload.

This was especially surprising to us because of the recent discussion around whether graph computations require distributed data processing systems at all. Several distributed systems get beat by a simple, single-threaded implementation on a laptop for various graph computations. The common interpretation is that graph computations are communication-limited; the network gets in the way, and you are better off with one machine if the computation fits.[footnote omitted]

The authors introduce Rust and timely dataflow to achieve rather remarkable performance gains. That is if you think a 4x-16x speedup over GraphX on the same hardware is a performance gain. (Most do.)

Code and instructions are available so you can test their conclusions for yourself. Hardware is your responsibility.

While you are waiting for part 2 to arrive, try Frank’s homepage for some fascinating reading.

Running Spark GraphX algorithms on Library of Congress subject heading SKOS

Monday, May 4th, 2015

Running Spark GraphX algorithms on Library of Congress subject heading SKOS by Bob Ducharme.

From the post:

Well, one algorithm, but a very cool one.

Last month, in Spark and SPARQL; RDF Graphs and GraphX, I described how Apache Spark has emerged as a more efficient alternative to MapReduce for distributing computing jobs across clusters. I also described how Spark’s GraphX library lets you do this kind of computing on graph data structures and how I had some ideas for using it with RDF data. My goal was to use RDF technology on GraphX data and vice versa to demonstrate how they could help each other, and I demonstrated the former with a Scala program that output some GraphX data as RDF and then showed some SPARQL queries to run on that RDF.

Today I’m demonstrating the latter by reading in a well-known RDF dataset and executing GraphX’s Connected Components algorithm on it. This algorithm collects nodes into groupings that connect to each other but not to any other nodes. In classic Big Data scenarios, this helps applications perform tasks such as the identification of subnetworks of people within larger networks, giving clues about which products or cat videos to suggest to those people based on what their friends liked.

As so typically happens when you are reading one Bob DuCharme post, you see another that one requires reading!

Bob covers storing RDF in RDD (Resilient Distributed Dataset), the basic Spark data structure, creating the report on connected components and ends with heavily commented code for his program.

Sadly the “related” values assigned by the Library of Congress don’t say how or why the values are related, such as:

“Hiding places”





Related values could be useful in some cases but if I am searching on “privacy,” as in the sense of being free from government intrusion, then “solitude,” “loneliness,” and “hiding places” aren’t likely to be helpful.

That’s not a problem with Spark or SKOS, but a limitation of the data being provided.

Spark 1.2.0 released

Monday, December 22nd, 2014

Spark 1.2.0 released

From the post:

We are happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark’s largest release ever, with contributions from 172 developers and more than 1,000 commits!

This release brings operational and performance improvements in Spark core including a new network transport subsytem designed for very large shuffles. Spark SQL introduces an API for external data sources along with Hive 13 support, dynamic partitioning, and the fixed-precision decimal type. MLlib adds a new pipeline-oriented package ( for composing multiple algorithms. Spark Streaming adds a Python API and a write ahead log for fault tolerance. Finally, GraphX has graduated from alpha and introduces a stable API.

Visit the release notes to read about the new features, or download the release today.

It looks like Christmas came a bit early this year. 😉

Lots of goodies to try out!

A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX

Thursday, November 27th, 2014

A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX by Kenny Bastani.

From the post:

I’ve just released a useful new Docker image for graph analytics on a Neo4j graph database with Apache Spark GraphX. This image deploys a container with Apache Spark and uses GraphX to perform ETL graph analysis on subgraphs exported from Neo4j. This docker image is a great addition to Neo4j if you’re looking to do easy PageRank or community detection on your graph data. Additionally, the results of the graph analysis are applied back to Neo4j.

This gives you the ability to optimize your recommendation-based Cypher queries by filtering and sorting on the results of the analysis.

This rocks!

If you were looking for an excuse to investigate Docker or Spark or GraphX or Neo4j, it has arrived!


Mazerunner – Update – Neo4J – GraphX

Saturday, November 8th, 2014

Three new algorithms have been added to Mazerunner:

  • Triangle Count
  • Connected Components
  • Strongly Connected Components

From: Using Apache Spark and Neo4j for Big Data Graph Analytics

Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark’s GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.

That’s good news!

I first saw this in a tweet by Kenny Bastani

GraphX: Graph Processing in a Distributed Dataflow Framework

Monday, September 15th, 2014

GraphX: Graph Processing in a Distributed Dataflow Framework by Joseph Gonzalez, Reynold Xin, Ankur Dave, Dan Crankshaw, Michael Franklin, Ion Stoica.


In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation.

GraphX: Graph Processing in a Distributed Dataflow Framework (as PDF file)

The “other” systems for comparison were GraphLab and Giraph. Those systems were tuned in cooperation with experts in their use. These are some of the “fairest” benchmarks you are likely to see this year. Quite different from “shiny graph engine” versus lame or misconfigured system benchmarks.

Definitely the slow-read paper for this week!

I first saw this in a tweet by Arnon Rotem-Gal-Oz.

Spark Graduates Apache Incubator

Wednesday, February 19th, 2014

Spark Graduates Apache Incubator by Tiffany Trader.

From the post:

As we’ve touched on before, Hadoop was designed as a batch-oriented system, and its real-time capabilities are still emerging. Those eagerly awaiting this next evolution will be pleased to hear about the graduation of Apache Spark from the Apache Incubator. On Sunday, the Apache Spark Project committee unanimously voted to promote the fast data-processing tool out of the Apache Incubator.

Databricks refers to Apache Spark as “a powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics.” The computing framework supports Java, Scala, and Python and comes with a set of more than 80 high-level operators baked-in.

Spark runs on top of existing Hadoop clusters and is being pitched as a “more general and powerful alternative to Hadoop’s MapReduce.” Spark promises performance gains up to 100 times faster than Hadoop MapReduce for in-memory datasets, and 10 times faster when running on disk.

BTW, the most recent release, 0.90, includes GraphX.

Spark homepage.

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Thursday, February 13th, 2014

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics by Reynold S Xin, et. al.


From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model.

To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use.

Contributions of the paper:

1. a data model that unifies graphs and collections as composable first-class objects and enables both data-parallel and graph-parallel operations.

2. identifying a “narrow-waist” for graph computation, consisting of a small, core set of graph-operators cast in classic relational algebra; we believe these operators can express all graph computations in previous graph parallel systems, including Pregel and GraphLab.

3. an efficient distributed graph representation embedded in horizontally partitioned collections and indices, and a collection of execution strategies that achieve efficient graph computations by exploiting properties of graph computations.

UPDATE GraphX merged into Spark 0.9.0 release:

You will want to be familiar with Spark.

I first saw this in a tweet by Stefano Bertolo.

Analytics and Machine Learning at Scale [Aug. 29-30]

Tuesday, August 27th, 2013

AMP Camp Three – Analytics and Machine Learning at Scale

From the webpage:

AMP Camp Three – Analytics and Machine Learning at Scale will be held in Berkeley California, August 29-30, 2013. AMP Camp 3 attendees and online viewers will learn to solve big data problems using components of the Berkeley Data Analytics Stack (BDAS) and cutting edge machine learning algorithms.

Live streaming!

Sessions will cover (among other things): Mesos, Spark, Shark, Spark Streaming, BlinkDB, MLbase, Tachyon and GraphX.

Talk about a jolt before the weekend!

GraphX preview: Interactive Graph Processing on Spark [July 2, 2013 6:30 PM]

Monday, June 17th, 2013

GraphX preview: Interactive Graph Processing on Spark

Tuesday, July 2, 2013 6:30 PM

Flurry 360 3rd Street, 4th Floor, San Francisco, CA (map)

From the announcement:

This meetup will feature the first public preview of GraphX, a new graph processing framework for Spark. GraphX implements many ideas from an existing specialized graph processing system (in particular GraphLab) to make graph computation efficient. The new graph APIs builds on Spark’s existing RDD abstraction, and thus allows programmers to blend graph and tabular (RDD) views of graph data. For example, GraphX users will be able to use Spark to construct the graph from data on HDFS, and run graph computation (e.g. PageRank) directly on it in the same Spark cluster.

The proposed API enables extremely concise implementation of many graph algorithms. We provide implementations of 4 standard graph algorithms, including PageRank, Connected Components, Shortest Path all in less than 10 lines of code; we also implement the ALS algorithm for collaborative filtering in 40 lines of code. This simple API, coupled with the Scala REPL, enables users can use GraphX to interactively mine graph data in the console.

The talk will be presented by Reynold Xin and Joey Gonzalez. We would like to thank Flurry for providing the space and food.

Spots are going fast!

Register, attend, blog about it!

GraphX: A Resilient Distributed Graph System on Spark

Monday, May 20th, 2013

GraphX: A Resilient Distributed Graph System on Spark by Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica.


From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.

We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

Of particular note is the use of an immutable graph as the core data structure for GraphX.

The authors report that GraphX performs less well than PowerGraph (GraphLab 2.1) but promise performance gains and offsetting gains in productivity.

I didn’t find any additional resources at AMPLab on GraphX but did find:

Spark project homepage, and,

Screencasts on Spark

Both will benefit you when more information emerges on GraphX.