Using Apache Spark and Neo4j for Big Data Graph Analytics by Kenny Bastani.
From the post:
…
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.Still, where does all that data come from? Where does it go when the analysis is done?
Graph databases
I’ve been working with graph database technologies for the last few years and I have yet to become jaded by its powerful ability to combine both the transformation of data with analysis. Graph databases like Neo4j are solving problems that relational databases cannot.
Graph processing at scale from a graph database like Neo4j is a tremendously valuable power.
But if you wanted to run PageRank on a dump of Wikipedia articles in less than 2 hours on a laptop, you’d be hard pressed to be successful. More so, what if you wanted the power of a high-performance transactional database that seamlessly handled graph analysis at this scale?
Mazerunner for Neo4j
Mazerunner is a Neo4j unmanaged extension and distributed graph processing platform that extends Neo4j to do big data graph processing jobs while persisting the results back to Neo4j.
Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark’s GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.
…
Mazerunner is an alpha release with page rank as its only algorithm.
It has a great deal of potential so worth your time to investigate further.