TinkerPop 2.3.0 has been unleashed by Marko A. Rodriguez.
Release notes for:
Distributed Graph Computing with Gremlin by Marko A. Rodriguez.
From the post:
script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.
- Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
- Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.
Another must read post from Marko A. Rodriguez!
Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.
Titan-Android by David Wu.
From the webpage:
Titan-Android is a port/fork of Titan for the Android platform. It is meant to be a light-weight implementation of a graph database on mobile devices. The port removes HBase and Cassandra support as their usage make little sense on a mobile device (convince me otherwise!). Gremlin is only supported via the Java interface as I have not been able to port groovy successfully. Nevertheless, Titan-Android supports local storage backend via BerkeleyDB and supports the Tinkerpop stack natively.
Just in case there was an Android under the tree!
I first saw this in a tweet by Marko A. Rodriguez.
Big Graph Data on Hortonworks Data Platform by Marko Rodriguez.
The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.
In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.
If you like graphs at all or have been looking at graph solutions, you are going to like this post.
Conditional Traversals With Gremlin by Max Lincoln.
An eligibility test that depends upon the ability to traverse to a particular node in the graph.
Reminded me of my musings on transient properties/edges.
Is not choosing an edge is the same thing as the edge not being present? For all cases?
Max mentions that NoSQL Distilled says this use case isn’t the typical one for graphs.
My suggestion is to experiment and rely on your own requirements and experiences.
Authors have to paint with a very broad brush or their books would all look like the Oxford English Dictionary (OED). Fascinating but not for the faint of heart.
The Dutch dictionary Woordenboek der Nederlandsche Taal, which has similar aims to the OED, is the largest and it took twice as long to complete.
I don’t read Dutch but the dictionary is reported to be available for free at: http://gtb.inl.nl/
If you read Dutch, please confirm/deny the report. I would like to send a little note along the the OED crowd about access as a public service. (Like they would care what I think. Still, doesn’t hurt to comment every now and again.)
Alex Popescu at myNoSQL captures a slide deck by Pierre De Wilde, “A Walk in Graph Databases.”
Has extensive examples using Gremlin after a short graph theory introduction.
Amusing graphic of everything looking like a table if all you have is a relational database.
Truth is that everything looks like a graph from a certain point of view.
Design question: What graph qualities, if any, are appropriate for your data and goals?
Always possible that graph representation or properties are inappropriate for your project.
A message from Marko Rodriguez announced the release of TinkerPop2 with notes on the major features of each:
- Massive changes to blueprints-core API
- TreePipe added for exposing the spanning tree of a traversal
- Automatic path and query optimizations
- https://github.com/tinkerpop/gremlin/downloads (download)
- FramedGraph is simply a wrapper graph in the Blueprints sense
- Synchronicity with the Blueprints API
- https://github.com/tinkerpop/rexster/downloads (download)
BTW, Marko says:
As you may know, there are big changes to the API: package renaming, new core API method names, etc. While this may be shocking, it is all worth it. In 2 weeks, there is going to be a release of something very big for which TinkerPop2 will be a central piece of the puzzle. Stay tuned and get ready for a summer of insane, crazy graph madness.
So, something to look forward to!
Visualizing a set of Hiveplots with Neo4j by Max De Marzi.
If you want to learn more about Hive Plots, take a look at his website and this presentation (it is quite large at 20 MB). I cannot do it justice in this short blog post, and in all honestly haven’t had the time to study it properly.
Today I just want to give you a little taste of Hiveplots. I am going to visualize the github graphs of nine languages you might not have heard of: Boo, Dylan, Factor, Gosu, Mirah, Nemerle, Nu, Parrot, Self. I’m not going to show you how to create the graph this time, because this is real data we are using. You can take a look at it on the data folder in github.
The graph is basically: (Language)–(Repository)–(User). There are two relationships between Repository and User, wrote and forked.
Hive plots are an effort by Martin Krzywinski to enable viewers of a graph visualization to distinguish between two or more graphs and to recognize key features of those graphs. His website is: http://www.hiveplot.com/.
Exploring Wikipedia with Gremlin Graph Traversals by Marko Rodriguez.
From the post:
There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.
If you aren’t interested in graph representations of data before reading this post, it is likely you will be afterwards.
Take a few minutes to read it and then let me know what you think.
A Well-Woven Study of Graphs, Brains, and Gremlins by Marko Rodriguez.
From the post:
What do graphs and brains have in common? First, they both share a relatively similar structure: Vertices/neurons are connected to each other by edges/axons. Second, they both share a similar process: traversers/action potentials propagate to effect some computation that is a function of the topology of the structure. If there exists a mapping between two domains, then it is possible to apply the processes of one domain (the brain) to the structure of the other (the graph). The purpose of this post is to explore the application of neural algorithms to graph systems.
Entertaining and informative post by Marko Rodriguez comparing graphs, brains and the graph query language Gremlin.
I agree with Marko on the potential of graphs but am less certain than I read him to be on how well we understand the brain. Both the brain and graphs have many dark areas yet to be explored. As we shine new light on one place, more unknown places are just beyond the reach of our light.
Romiko Derbynew writes:
The Neo4jClient now supports Cypher as a query language with Neo4j. However I noticed the following:
- Simple graph traversals are much more efficient when using Gremlin
- Queries in Gremlin are 30-50% faster for simple traversals
- Cypher is ideal for complex traversals where back tracking is required
- Cypher is our choice of query language for reporting
- Gremlin is our choice of query language for simple traversals where projections are not required
- Cypher has intrinsic table projection model, where Gremlins table projection model relies on AS steps which can be cumbersome when backtracking e.g. Back(), As() and _CopySplit, where cypher is just comma separated matches
- Cypher is much better suited for outer joins than Gremlin, to achieve similar results in gremlin requires parallel querying with CopySplit.
- Gremlin is ideal when you need to retrieve very simple data structures
- Table projection in gremlin can be very powerful, however outer joins can be very verbose
So in a nutshell, we like to use Cypher when we need tabular data back from Neo4j and is especially useful in outer joins.
Excellent comparison of Gremlin vs. Cypher. Both have their advantages.
Max Flow with Gremlin and Transactions
Max De Marzi writes:
The maximum flow problem was formulated by T.E. Harris as follows:
Consider a rail network connecting two cities by way of a number of intermediate cities, where each link of the network has a number assigned to it representing its capacity. Assuming a steady state condition, and a maximal flow from one given city to the other.
Back in the mid 1950s the US Military had an interest in finding out how much capacity the Soviet railway network had to move cargo from the Western Soviet Union to Eastern Europe. This lead to the Maximum Flow problem and the Ford–Fulkerson algorithm to solve it.
If you’ve been reading the Neo4j Gremlin Plugin documentation, you’ll remember it has a section on Flow algorithms with Gremlin. Let’s add a couple of things and bring this example to life.
If that sounds like an out-dated Cold War problem, consider Max’s conclusion:
The max flow and related problems manifest in many ways. Water or sewage through underground pipes, passengers on a subway system, data through a network (the internet is just a series of tubes!), roads and highway planning, airline routes, even determining which sports teams have been eliminated from the playoffs.
What else can be modeled as max flow or related problems? Drug/weapons smuggling? Oil/gas/electricity transport? Others?
A quick list of the new features:
Wait! Did I say Cypher and Gremlin!?
Looks like this graph querying stuff is spreading.
Even if you are not working in bioinformatics, Bio4j is worth more than a quick look.
I ran across this Wikipedia book while working on one of the data structures posts for today.
I think you may find it useful but some cautions:
First, being a collection of Wikipedia articles, it doesn’t have a consistent editorial voice. That is more than being fussy, the depth and usefulness of explanations will vary from article to article.
Second, you will find topics that are “stubs,” and hence not very useful.
Third, I think with the advent of Neo4j, Grelim, Cypher and other graph databases/software, future entries should have in addition to text, exercises that users can perform with common software to reinforce their understanding of entries.
From the post:
In a time long, long right now and a place far, far within, there exists a little green gremlin named…well, Gremlin. Gremlin lives in a place known as TinkerPop. For those who think of a “place” as some terrestrial surface coating a sphere that is circling one of the many massive fiery nuclear reactors in the known universe, TinkerPop is that, yet at the same time, a wholly different type of place indeed.
In a day of obscure (are there any other kind?) errors and annoyances, this is an absolute delight!
New homepage design: http://tinkerpop.com
Blueprints 1.1 (Blueberry):
Gremlin 1.4 (Ain’t No Thing But a Chicken Wing):
You didn’t really want to spend all weekend holiday shopping and hanging out with relatives did you?
OK the real title is: JVM Language Implementations. I like mine better.
From the webpage:
Gremlin is a style of graph traversing that can be hosted in any number of languages. The benefit of this is that users can make use of the programming language they are most comfortable with and still be able to evaluate Gremlin-style traversals. This model is different than, lets say, using SQL in Java where the query is evaluated by passing a string representation of the query to the SQL engine. On the contrary, with native Gremlin support for other JVM languages, there is no string passing. Instead, simple method chaining in Gremlin’s fluent style. However, the drawback of this model is that for each JVM language, there are syntactic variations that must be accounted for.
Seeing is believing.
From the readme file:
Pilot is a graph database operator that allows you to perform common application-level operations on graph databases without delving into the details of their implementation or requiring knowledge of the component technologies.
Pilot aims to support graph databases conforming to the property graph model. Pilot employs technologies from the Tinkerpop stack — specifically Blueprints and Gremlin — for general access and manipulation of the underlying graph database, but also uses native graph database APIs to further optimize performance for certain operations. In addition, Pilot also handles multithreading and transaction management, while keeping all of these abstracted away from the calling application. As such, Pilot is ideally suited for use in concurrent web applications.
- Supported graph database providers:
- Some of the functionality currently supported by Pilot include:
- Get edges between given vertices
- Get neighbors of a given vertex
- Retrieving vertices corresponding to some properties (see Property Graph Model)
- Transaction management
- Thread synchronization for multithreaded access
- Large commit optimization
- Application profiling
- Planned additions:
- Support for Furnace
Graph databases aren’t a new idea. I don’t have the reference at hand but once ran across a relational database that was implemented as a hypergraph. It may be that computing power has finally gotten to the point that graph databases, or at least their capabilities, will be the common expectation.
An Introduction to Tinkerpop by Takahiro Inoue.
Excellent introduction to the Tinkerpop stack.
Marko Rodriguez posted the following note to the Grelim-users mailing list today:
For many months, the TinkerPop community has been trying to realize the best way to go about providing a graph analysis package to the TinkerPop stack ( http://bit.ly/qCMlcP ). With the increased flexibility and power of Pipes and the partitioning of Gremlin into multiple JVM languages, we feel that the stack is organized correctly now to support Furnace — A Property Graph Algorithms Package.
The project is currently just stubbed, but overtime you can expect the ability to evaluate standard (and non-standard) graph analysis algorithms over Blueprints-enabled graphs in a way that respects explicit and implicit associations in the graph. In short, it will implement the ideas articulated in:
This will be possible due to Pipes and the ability to represent abstract relationships using Pipes, Gremlin_groovy (and the upcoming Gremlin_scala). Moreover, while more thought is needed, there will be a way to talk at the Frames-levels (http://frames.tinkerpop.com) and thus, calculate graph algorithms according to one’s domain model. Ultimately, in time, as Furnace develops, we will see a Rexster-Kibble that supports the evaluation of algorithms via Rexster.
While the project is still developing, please feel free to contribute ideas and/or participate in the development process. To conclude, we hope people are excited about the promises that Furnace will bring by raising the processing abstraction level above the imperative representations of Pipes/Gremlin.
You have been waiting for the opportunity to contribute to the Tinkerpop stack, particularly on graph analysis, so here is your chance! Seriously, you need to forward this to every graph person, graph project and graduate student taking graph theory.
We can use simple graphs and hope (pray?) the world is a simple place. Or use more complex graphs to model the world. Do you feel lucky? Do you?
A Graph-Based Movie Recommender Engine by Marko A. Rodriguez.
From the post:
A recommender engine helps a user find novel and interesting items within a pool of resources. There are numerous types of recommendation algorithms and a graph can serve as a general-purpose substrate for evaluating such algorithms. This post will demonstrate how to build a graph-based movie recommender engine using the publicly available MovieLens dataset, the graph database Neo4j, and the graph traversal language Gremlin. Feel free to follow along in the Gremlin console as the post will go step-by-step from data acquisition, to parsing, and ultimately, to traversing.
As important as graph engines, algorithms and research are at present, and as important as they will become, I think the Neo4j community itself is worthy of direct study. There are stellar contributors to the technology and the community, but is that what makes it such an up and coming community? Or perhaps how they contributed? It would take a raft (is that the term for a group of sociologists?) of sociologists and perhaps there are existing studies of online communities that might have some clues. I mention that because there are other groups I would like to see duplicate the success of the Neo4j community.
Marko takes you from data import to a useful (albeit limited) application in less than 2500 words. (measured to the end of the conclusion, excluding further reading)
And leaves you with suggestions for further exploring.
That is a blog post that promotes a paradigm. (And for anyone who takes offense at that observation, it applies to my efforts as well. There are other ways to promote a paradigm but you have to admit, this is a fairly compelling one.)
Put Marko’s post on your read with evening coffee list.
Marko Rodriguez announced a new round of Tinkerpop Stack Releases today:
The TinkerPop stack went through another round of releases this morning.
- Blueprints 1.0 (Blueprints): = https://github.com/tinkerpop/blueprints/wiki/Release-Notes
- Pipes 0.8 (Cleaner): = https://github.com/tinkerpop/pipes/wiki/Release-Notes
- Frames 0.5 (Beams): = https://github.com/tinkerpop/frames/wiki/Release-Notes
- Gremlin 1.3 (On the Case): = https://github.com/tinkerpop/gremlin/wiki/Release-Notes
- Rexster 0.6 (Dalmatian): = https://github.com/tinkerpop/rexster/wiki/Release-Notes
- Rexster-Kibbles 0.6 = http://rexster-kibbles.tinkerpop.com
For those using Gremlin, Pipes, and Rexster, be sure to look through the release notes as APIs have changed slightly. Here are the main points of this release:
- Blueprints now has transaction buffers and Neo4jBatchGraph for bulk loading a Neo4j graph.
- Pipes makes use of FluentPipeline and PipeFunction which yields great expressivity and further opens up the framework to other JVM languages.
- Gremlin is ~2.5x faster in many situations and has relegated most of its functionality to Pipes and native Java.
- Rexster supports Neo4j High Availability and more updates to its REST API.
From the wiki:
After the first public release of Spring Data Graph in April 2011 we mainly focused on user feedback.
With the improved documentation around the tooling and an upgraded AspectJ version we addressed many of the AspectJ issues that where reported by users. With the latest STS and Eclipse and hopefully with Idea11 it is possible to develop Spring Data Graph applications without the red wiggles. To further ease the development we also provided sample build scripts for ant/ivy and a plugin for gradle.
Of course we kept pace with development of Neo4j, currently using the latest stable release of Neo4j (1.4.1).
So we strove to support it on all levels. Now, it is possible to execute Cypher queries from Spring Data Graph Repositories, from the Neo4j-Template but also as part of dynamic field annotations and via the introduced entity methods. The same goes for Gremlin scripts. What’s possible with this new expressive power? Let’s take a look. …
OK, better? Worse? About the same? Projects can’t improve without your feedback. Issues discussed only around water coolers can’t be addressed. Yes?
There’s some famous so-and-so’s Law about non-reported comments but I can’t find the reference. You?
On the Nature of Pipes by Marko Rodriguez.
From the post:
Pipes is a data flow framework developed by TinkerPop. The graph traversal language Gremlin is a Groovy-based domain-specific language for processing Blueprints-enabled graph databases with Pipes. Since the release of Pipes 0.7 on August 1, 2011, much of the functionality in Gremlin has been generalized and made available through Pipes. This has opened up the door for other JVM languages (e.g. JRuby, Jython, Clojure, etc.) to serve as host languages for graph traversal DSLs. In order to promote this direction, this post will explain Pipes from the vantage point of Gremlin.
You may not be a graph database enthusiast after reading Marko’s post but you will increase your understanding of them.
That you are not then a graph database enthusiast will be your own fault.
Good news from Marko Rodriguez:
TinkerPop just released a new round of stable releases.
Blueprints 0.9 (Mavin) – https://github.com/tinkerpop/blueprints/wiki/Release-Notes
Pipes 0.7 (PVC) – https://github.com/tinkerpop/pipes/wiki/Release-Notes
Frames 0.4 (Studs) – https://github.com/tinkerpop/frames/wiki/Release-Notes
Gremlin 1.2 (New Sheriff in Town) – https://github.com/tinkerpop/gremlin/wiki/Release-Notes
Rexster 0.5 (Dog Star) – https://github.com/tinkerpop/rexster/wiki/Release-Notes
Here is the main points with each release:
The Pathology of Graph Databases by Marko A. Rodriguez.
If you want to learn Gremlin as a graph traversal language you would be hard pressed to find a better starting place.
From the Overview:
Bulbs is an open-source Python persistence framework for graph databases and the first piece of a larger Web-development toolkit that will be released in the upcoming weeks.
It’s like an ORM for graphs, but instead of SQL, you use the graph-traveral language Gremlin to query the database.
This means your code is portable because you can to plug into different graph database backends without worrying about vendor lock in.
Bulbs was developed in the process of building Whybase, a startup that will open for preview this fall. Whybase needed a persistence layer to model its complex relationships, and Bulbs is an open-source version of that framework.
Will be watching for future developments!
From the webpage:
This is a special edition of OrientDB with these TinkerPop technologies in bundle:
- Blueprints provides a collection of interfaces and implementations to common, complex data structures. In short, Blueprints provides a one stop shop for implemented interfaces to help developers create software without being tied to particular underlying data management systems.
- Gremlin is a Turing-complete, graph-based programming language designed for key/value-pair multi-relational graphs. Gremlin makes use of an XPath-like syntax to support complex graph traversals. This language has application in the areas of graph query, analysis, and manipulation.
- Pipes is a graph-based data flow framework for Java 1.6+. A process graph is composed of a set of process vertices connected to one another by a set of communication edges. Pipes supports the splitting, merging, and transformation of data from input to output.
The graph community just keeps getting stronger.
From the post:
What do graphs and brains have in common? First, they both share a relatively similar structure: Vertices/neurons are connected to each other by edges/axons. Second, they both share a similar process: traversers/action potentials propagate to effect some computation that is a function of the topology of the structure. If there exists a mapping between two domains, then its possible to apply the processes of one domain (the brain) to the structure of the other (the graph). The purpose of this post is to explore the application of neural algorithms to graph systems.
As only Marko could answer the question: “What do graphs and brains have in common?”
I am particularly interested in the the use of spreading activation for subject recognition. How do we capture such a recognition and/or communicate it to others?
Instructions on creating a local copy of the Gremlin wiki (posted to the email@example.com mailing list by Pierre De Wilde).
The instructions (with minor formatting changes) from his post:
For those who want a local copy of Gremlin wiki:
git clone https://github.com/tinkerpop/gremlin.wiki.git
Open your browser at http://localhost:4567 and ta-da…
Moreover, the wiki is searchable and (unlike the github version) it’s printer-fiendly.
Gollum is a simple wiki system built on top of Git that powers GitHub Wikis.
To install Gollum, use RubyGems (http://rubygems.org/):
[sudo] gem install gollumcd cd
Of course, the same procedure may be applied for other Tinkerpop repositories (blueprints, pipes, frames, rexster, rexster-kibbles).
Unfortunately, gollum cannot access multiple repositories at once, so you will need to launch several versions with a different port (gollum -port xxxx)