Archive for the ‘Gremlin’ Category

Visualizing your Titan graph database:…

Friday, June 17th, 2016

Visualizing your Titan graph database: An update by Marco Liberati.

From the post:

Last summer, we wrote a blog with our five simple steps to visualizing your Titan graph database with KeyLines. Since then TinkerPop has emerged from the Apache Incubator program with TinkerPop3, and the Titan team have released v1.0 of their graph database:

  • TinkerPop3 is the latest major reincarnation of the graph proje­­­ct, pulling together the multiple ventures into a single united ecosystem.
  • Titan 1.0 is the first stable release of the Titan graph database, based on the TinkerPop3 stack.

We thought it was about time we updated our five-step process, so here’s:

Not exactly five (5) steps because you have to acquire a KeyLines trial key, etc.

A great endorsement of much improved installation process for TinkerPop3 and Titan 1.0.


Nine Inch Gremlins

Saturday, April 23rd, 2016

Nine Inch Gremlins


Stephen Mallette writes:

On the back of TinkerPop 3.1.2-incubating comes TinkerPop 3.2.0-incubating. Yes – a dual release – an unprecedented and daring move you’ve come to expect and not expect from the TinkerPop clan! Be sure to review the upgrade documentation in full as you may find some changes that introduce some incompatibilities.

The release artifacts can be found at this location:

The online docs can be found here: (user docs) (upgrade docs) (core javadoc) (full javadoc)

The release notes are available here:

The Central Maven repo has sync’d as well:

Another impressive release!

In reading the documentation I discovered that Ketrina Yim is responsible for drawing Gremlin and his TinkerPop friends.

I was relieved to find that Marko was only responsible for the Gremlin/TinkerPop code/prose and not the graphics as well. That would be too much talent for any one person! 😉


Planet TinkerPop [+ 2 New Graph Journals]

Tuesday, April 12th, 2016

Planet TinkerPop

From the webpage:

Planet TinkerPop is a vendor-agnostic, community-driven site aimed at advancing graph technology in general and Apache TinkerPop™ in particular. Graph technology is used to manage, query, and analyze complex information topologies composed of numerous heterogenous relationships and is currently benefiting companies such as Amazon, Google, and Facebook. For all companies to ultimately adopt graph technology, vendor-agnostic graph standards and graph knowledge must be promulgated. For the former, TinkerPop serves as an Apache Software Foundation governed community that develops a standard graph data model (the property graph) and query language (Gremlin). Apache TinkerPop is a widely supported graph computing framework that has been adopted by leading graph system vendors and interfaced with by numerous graph-based applications across various industries. For educating the public on graphs, Planet TinkerPop’s Technology journal publishes articles about TinkerPop-related graph research and development. The Use Cases journal promotes articles on the industrial use of graphs and TinkerPop. The articles are contributed by members of the Apache TinkerPop community and additional contributions are welcomed and strongly encouraged. We hope you enjoy your time learning about graphs here at Planet TinkerPop.

If you are reading about Planet TinkerPop I can skip the usual “graphs are…” introductory comments. 😉

Planet TinkerPop is a welcome addition to the online resources on graphs in general and TinkerPop in particular.

So they aren’t buried in the prose, let me highlight two new journals at Planet TinkerPop:

TinkerPop Technology journal  publishes articles about TinkerPop-related graph research and development.

TinkerPop Use Cases journal  promotes articles on the industrial use of graphs and TinkerPop.

Both are awaiting your contributions!


PS: I prepended “TinkerPop” to the journal names and suggest an ISSN ( would be appropriate for both journals.

Gremlin Users – Beware the Double-Underscore!

Wednesday, February 3rd, 2016

A user recently posted this example from the Gremlin documentation:

out()).values(‘name’) [apologies for the line wrap]

which returned:

“No such property: _ for class: Script121”

Marko Rodriguez responded:

Its a double underscore, not a single underscore.

__ vs. _

I mention this to benefit beginning Gremlin users who haven’t developed an underscore stutter but also as a plea for sanity in syntax design.

It’s is easy to type two successive underscores but the obviousness of a double underscore versus a single underscore depends on local typography.

To say nothing that what might be obvious to the eyes of a twenty-something may not be as obvious to the eyes of a fifty-something+.

In syntax design, answer the question:

Do you want to be clever or clear?

A Practical Guide to Graph Databases

Wednesday, January 20th, 2016

A Practical Guide to Graph Databases by Matthias Broecheler.

Slides from Graph Day 2016 @ Austin.

If you notice any of the “trash talking” on social media about graphs and graph databases, you will find slide 15 quite amusing.

Not everyone agrees on the relative position of graph products. 😉

I haven’t seen a video of Matthias’ presentation. If you happen across one, give me a ping. Thanks!

Quantum Walks with Gremlin [Graph Day, Austin]

Wednesday, November 25th, 2015

Quantum Walks with Gremlin by Marko A. Rodiguez, Jennifer H. Watkins.


A quantum walk places a traverser into a superposition of both graph location and traversal “spin.” The walk is defined by an initial condition, an evolution determined by a unitary coin/shift-operator, and a measurement based on the sampling of the probability distribution generated from the quantum wavefunction. Simple quantum walks are studied analytically, but for large graph structures with complex topologies, numerical solutions are typically required. For the quantum theorist, the Gremlin graph traversal machine and language can be used for the numerical analysis of quantum walks on such structures. Additionally, for the graph theorist, the adoption of quantum walk principles can transform what are currently side-effect laden traversals into pure, stateless functional flows. This is true even when the constraints of quantum mechanics are not fully respected (e.g. reversible and unitary evolution). In sum, Gremlin allows both types of theorist to leverage each other’s constructs for the advancement of their respective disciplines.

Best not to tackle this new paper on Gremlin and quantum graph walks after a heavy meal. 😉

Marko will be presenting at Graph Day, 17 January 2016, Austin, Texas. Great opportunity to hear him speak along with other cutting edge graph folks.

The walk Marko describes is located in a Hilbert space. Understandable because numerical solutions require the use of a metric space.

However, if you are modeling semantics in difference universes of discourse, realize that semantics don’t possess metric spaces. Semantics lie outside of metric space, although I concede that many have imposed varying arbitrary metrics on semantics.

For example, if I am mapping the English term for “black,” as in a color to the term “schwartz” in German, I need a “traverser” that enables the existence of both terms at separate locations, one for each universe in the graph.

You may protest that is overly complex for the representation of synonyms, but consider that “schwartz” occupies a different location in the universe of German and etymology from “black.”

For advertising, subtleties of language may not be useful, but for reading medical or technical works, an “approximate” or “almost right” meaning may be more damaging than helpful.

Who knows? Perhaps quantum computers will come closer to modeling semantics across domains better than any computer to date. Not perfectly but closer.

Query the Northwind Database as a Graph Using Gremlin

Wednesday, October 21st, 2015

Query the Northwind Database as a Graph Using Gremlin by Mark Kromer.

From the post:

One of the most popular and interesting topics in the world of NoSQL databases is graph. At DataStax, we have invested in graph computing through the acquisition of Aurelius, the company behind TitanDB, and are especially committed to ensuring the success of the Gremlin graph traversal language. Gremlin is part of the open source Apache TinkerPop graph framework project and is a graph traversal language used by many different graph databases.

I wanted to introduce you to a superb web site that our own Daniel Kuppitz maintains called “SQL2Gremlin” ( which I think is great way to start learning how to query graph databases for those of us who come from the traditional relational database world. It is full of excellent sample SQL queries from the popular public domain RDBMS dataset Northwind and demonstrates how to produce the same results by using Gremlin. For me, learning by example has been a great way to get introduced to graph querying and I think that you’ll find it very useful as well.

I’m only going to walk through a couple of examples here as an intro to what you will find at the full site. But if you are new to graph databases and Gremlin, then I highly encourage you to visit the sql2gremlin site for the rest of the complete samples. There is also a nice example of an interactive visualization / filtering, search tool here that helps visualize the Northwind data set as it has been converted into a graph model.

I’ve worked with (and worked for) Microsoft SQL Server for a very long time. Since Daniel’s examples use T-SQL, we’ll stick with SQL Server for this blog post as an intro to Gremlin and we’ll use the Northwind samples for SQL Server 2014. You can download the entire Northwind sample database here. Load that database into your SQL Server if you wish to follow along.

When I first saw the title to this post,

Query the Northwind Database as a Graph Using Gremlin (emphasis added)

I thought this was something else. A database about the Northwind album.

Little did I suspect that the Northwind Database is a test database for SQL Server 2005 and SQL Server 2008. Yikes!

Still, I thought some of you might have access to such legacy software and so I am pointing you to this post. 😉


Support for SQL Server 2005 ends April 16, 2016 (that’s next April)

Support for SQL Server 2008 ended July 8, 2014 Ouch! You are more than a year into a dangerous place. Upgrade, migrate or get another job. Hard times are coming and blame will be assigned.

The Gremlin Graph Traversal Language (slides)

Wednesday, August 19th, 2015

The Gremlin Graph Traversal Language by Marko Rodriguez.

Forty-Five (45) out of fifty (50) slides have working Gremlin code!

Ninety percent (90%) of the slides have code you can enter!

It isn’t as complete as The Gremlin Graph Traversal Machine and Language, but on the other hand, it is a hell of a lot easier to follow along.


The Gremlin Graph Traversal Machine and Language

Tuesday, August 18th, 2015

The Gremlin Graph Traversal Machine and Language by Marko A. Rodriguez.


Gremlin is a graph traversal machine and language designed, developed, and distributed by the Apache TinkerPop project. Gremlin, as a graph traversal machine, is composed of three interacting components: a graph G, a traversal \Psi, and a set of traversers T. The traversers move about the graph according to the instructions specified in the traversal, where the result of the computation is the ultimate locations of all halted traversers. A Gremlin machine can be executed over any supporting graph computing system such as an OLTP graph database and/or an OLAP graph processor. Gremlin, as a graph traversal language, is a functional language implemented in the user’s native programming language and is used to define the $\Psi$ of a Gremlin machine. This article provides a mathematical description of Gremlin and details its automaton and functional properties. These properties enable Gremlin to naturally support imperative and declarative querying, host language agnosticism, user-defined domain specific languages, an extensible compiler/optimizer, single- and multi-machine execution models, hybrid depth-and breadth-first evaluation, as well as the existence of a Universal Gremlin Machine and its respective entailments.

Why Marko wants to overload terms, like Gremlin, I don’t know. (Hi! Marko!) Despite that overloading, if you are fond of Gremlin (in any sense) and looking for a challenging read, you have found it.

This won’t be a quick read. 😉

Gremlin3 is a declarative query language. (Excellent!)

Static results are an edge case of the dynamic systems they purport to represent. (An insight suggested to me by Sam Hunting.)

Think about it for a moment. There are no non-dynamic systems, only non-dynamic representations of dynamic systems. Which makes non-dynamic representations false, albeit sometimes useful, but also an edge case.

Sorry, didn’t mean to get distracted.

Deeply recommend this “Gremlin” overloaded paper and that you check into the recently released TinkerPop3 release!

Modified 25 November 2015 to point out Gremlin3 is a declarative query language. Thanks to Marko for the catch!

Graphs, Source Code Auditing, Vulnerabilities

Tuesday, August 18th, 2015

While I was skimming the Praeorian website, I ran across this blog entry: Why You Should Add Joern to Your Source Code Audit Toolkit by Kelby Ludwig.

From the post:

What is Joern?

Joern is a static analysis tool for C / C++ code. It builds a graph that models syntax. The graphs are built out using Joern’s fuzzy parser. The fuzzy parser allows for Joern to parse code that is not necessarily in a working state (i.e., does not have to compile). Joern builds this graph with multiple useful properties that allow users to define meaningful traversals. These traversals can be used to identify potentially vulnerable code with a low false-positive rate.

Joern is easy to set up and import code with. The graph traversals, which are written using a graph database query language called Gremlin, are simple to write and easy to understand.

Why use Joern?

Joern builds a Code Property Graph out of the imported source code. Code Property Graphs combine the properties of Abstract Syntax Trees, Control Flow Graphs, and Program Dependence Graphs. By leveraging various properties from each of these three source code representations, Code Property Graphs can model many different types of vulnerabilities. Code Property Graphs are explained in much greater detail in the whitepaper on the subject. Example queries can be found in a presentation on Joern’s capabilities. While the presentation does an excellent job of demonstrating the impact of running Joern on the source code for the Linux kernel (running two queries led to seven 0-days out of the 11 total results!), we will be running a slightly more general query on a simple code snippet. By following the query outlined in the presentation, we can write similar queries for other potentially dangerous methods.

There are graphs, Gremlin, discovery of zero-day vulnerabilities, this is a post that pushes so many buttons!

Consider it to be a “lite” introduction to Joern, which I have mentioned before.

TinkerPop 3.0.0.M6 Released — A Gremlin Rāga in 7/16 Time

Tuesday, December 2nd, 2014

TinkerPop 3.0.0.M6 Released — A Gremlin Rāga in 7/16 Time by Marko A. Rodriguez.

From post:

Dear ladies and gentlemen of the TinkerPop,

TinkerPop productions, in association with Gremlin Studios, presents a Gremlin-Users codebase, featuring TinkerPop-Contributors…TinkerPop 3.0.0.M6. Staring, Gremlin as himself.




Gremlin Console:
Gremlin Server:

If you want a better sense of graphs than “Everything is a graph!” type promotionals, see: How Whitepages turned the phone book into a graph using Titan and Cassandra. BTW, the Whitepages offer an API for email verification.

Don’t be the last one to submit a bug for this milestone release!

At the same time, checkout the Whitepages API.

Bye-bye Giraph-Gremlin, Hello Hadoop-Gremlin with GiraphGraphComputer Support

Tuesday, November 25th, 2014

Bye-bye Giraph-Gremlin, Hello Hadoop-Gremlin with GiraphGraphComputer Support by Marko A. Rodriguez.

There are days when I wonder if Marko ever sleeps or if the problem of human cloning has already been solved.

This is one of those day:

The other day Dan LaRocque and I were working on a Hadoop-based GraphComputer for Titan so we could do bulk loading into Titan. First we wrote the BulkLoading VertexProgram: bulkloader/
…and then realized, “huh, we can just execute this with GiraphGraph. Huh! We can just execute this with TinkerGraph!” In fact, as a side note, the BulkLoaderVertexProgram is general enough to work for any TinkerPop Graph.

So great, we can just use GiraphGraph (or any other TinkerPop implementation that has a GraphComputer (e.g. TinkerGraph)). However, Titan is all about scale and when the size of your graph is larger than the total RAM in your cluster, we will still need a MapReduce-based GraphComputer. Thinking over this, it was realized: Giraph-Gremlin is very little Giraph and mostly just Hadoop — InputFormats, HDFS interactions, MapReduce wrappers, Configuration manipulations, etc. Why not make GiraphGraphComputer just a particular GraphComputer supported by Gremlin-Hadoop (a new package).

With that, Giraph-Gremlin no longer exists. Hadoop-Gremlin now exists. Hadoop-Gremlin behaves the exact same way as Giraph-Gremlin, save that we will be adding a MapReduceGraphComputer to Hadoop-Gremlin. In this way, Hadoop-Gremlin will support two GraphComputer: GiraphGraphComputer and MapReduceGraphComputer.

The master/ branch is updated and the docs for Giraph have been re-written, though I suspect there will be some dangling references in the docs here and there for a while.

Up next, Matthias and I will create MapReduceGraphComputer that is smart about “partitioned vertices” — so you don’t get the Faunus scene where if a vertex doesn’t fit in memory, an exception. This will allow vertices with as many edges as you want (though your data model is probably shotty if you have 100s of millions of edges on one vertex 😉 ……………….. Matthias will be driving that effort and I’m excited to learn about the theory of vertex partitioning (i.e. splitting a single vertex across machines).


TinkerPop 3.0.0.M4 Released (A Gremlin Rāga in 7/16 Time)

Tuesday, October 21st, 2014

TinkerPop 3.0.0.M4 Released (A Gremlin Rāga in 7/16 Time) by Marko Rodriguez.

From the post:

TinkerPop ( is happy to announce the release of TinkerPop 3.0.0.M4.



User Documentation:
Core JavaDoc: [user javadocs]
Full JavaDoc : [vendor javadocs]


Gremlin Console:
Gremlin Server:

There were lots of updates in this release — with a lot of valuable feedback provided by Titan (Matthias), Titan-Hadoop (Dan), FoundationDB (Mike), PostgreSQL-Gremlin (Pieter), and Gremlin-Scala (Mike).

We are very close to a GA. We think that either there will be a “minor M5” or the next release will be GA. Why the delay? We are currently working closely with the Titan team to see if there are any problems in our interfaces/test-suites/etc. The benefit of working with the Titan team is that they are doing both OLTP and OLAP so are covering the full gamut of the TinkerPop3 API. Of course, we have had lots of experience with these APIs for both Neo4j (OTLP) and Giraph (OLAP), but to see it standup to yet another vendor’s requirements will be a confidence boost for GA. If you are vendor, please feel free to join the conversation as your input is crucial to making sure GA meets everyone’s needs.

A few important notes for users:
1. The GremlinKryo serialization format is not guaranteed to be stable from MX to MY. By GA it will be locked.
2. Neo4j-Gremlin’s disk representation is not guaranteed to be stable from MX to MY. By GA it will be locked.
3. Giraph-Gremlin’s Hadoop Writable specification is not guaranteed to be stable from MX to MY. By GA it will be locked.
4. VertexProgram, Memory, Step, SideEffects, etc. hidden and system labels may change between MX and MY. By GA they will be locked.
5. Package and class names might change from MX to MY. By GA they will be locked.

Thank you everyone. Please play and provide feedback. This is the time to get your ideas into TinkerPop3 as once it goes GA, sweeping changes are going to be more difficult.

TinkerPop 3.0.0.M3 Released (A Gremlin Rāga in 7/16 Time)

Monday, October 6th, 2014

TinkerPop 3.0.0.M3 Released (A Gremlin Rāga in 7/16 Time) by Marko Rodriguez.

From the post:

TinkerPop 3.0.0.M3 has been released. This release has numerous core bug-fixes/optimizations/features. We were anxious to release M3 due to some changes in the Process API. These changes should not effect the user, only vendors providing a Gremlin language variant (e.g. Gremlin-Scala, Gremlin-JavaScript, etc.). From what I hear, it “just worked” for Gremlin-Scala so that is good. Here are links to the release:

– Gremlin-Console:
– Gremlin-Server:

Are you going to accept Marko’s anecdotal assurances, it “just worked” for Gremlin-Scala or will you put this release to the test? 😉

I am sure Marko and others would like to know!

TinkerPop3 3.0.0.M1

Tuesday, August 12th, 2014

TinkerPop3 3.0.0.M1 Released — A Gremlin Raga in 7/16 Time by Marko A. Rodriguez.

From the post:

TinkerPop3 3.0.0.M1 “A Gremlin Rāga in 7/16 Time” is now released and ready for use. (downloads and docs) (changelog)

IMPORTANT: TinkerPop3 requires Java8.

We would like both developers and vendors to play with this release and provide feedback as we move forward towards M2, …, then GA.

  1. Is the API how you like it?
  2. Is it easy to implement the interfaces for your graph engine?
  3. Is the documentation clear?
  4. Are there VertexProgram algorithms that you would like to have?
  5. Are there Gremlin steps that you would like to have?
  6. etc…

For the above, as well as for bugs, the issue tracker is open and ready for submissions:

TinkerPop3 is the culmination of a huge effort from numerous individuals. You can see the developers and vendors that have provided their support through the years.
(the documentation may take time to load due to all the graphics in the single HTML)

If you haven’t looked at the TinkerPop3 docs in a while, take a quick look. Tweets on several sections have recently pointed out very nice documentation.

Gremlin and Visualization with Gephi [Death of Import/Export?]

Wednesday, June 25th, 2014

Gremlin and Visualization with Gephi by Stephen Mallette.

From the post:

We are often asked how to go about graph visualization in TinkerPop. We typically refer folks to Gephi or Cytoscape as the standard desktop data visualization tools. The process of using those tools involves: getting your graph instance, saving it to GraphML (or the like) then importing it to those tools

TinkerPop3 now does two things to help make that process easier:

  1. A while back we introduced the “subgraph” step which allows you to pop-off a Graph instance from a Traversal, which help greatly simplify the typical graph visualization process with Gremlin, where you are trying to get a much smaller piece of your large graph to focus the visualization effort.
  2. Today we introduce a new :remote command in the Console. Recall that :remote is used to configure a different context where Gremlin will be evaluated (e.g. Gremlin Server). For visualization, that remote is called “gephi” and it configures the :submit command to take any Graph instance and push it through to the Gephi Streaming API. No more having to import/export files!

This rocks!

How do you imagine processing your data when import/export goes away?

Of course, this doesn’t have anything on *nix pipes but it is nice to see good ideas come back around.

The Rise of Gremlitron

Tuesday, June 3rd, 2014

TinkerPop3 RFClease — The Rise of Gremlitron by Marko A. Rodriguez.

From the post:

TinkerPop3’s SNAPSHOT release is now ready for review, comments, and brave souls wishing to do implementations.

There are lots of new things about TinkerPop3 and I would like to take the time to review some of the best parts here:

1. Blueprints, Frames, Pipes, Furnace, and Rexster are no longer terms…
– Blueprints => Gremlin Structure
– Blueprints/Pipes => Gremlin Process
– Frames => Gremlin DSLs
– Furnace => Gremlin OLAP (GraphComputer)
– Rexster => Gremlin Server



Marko has always had a way with images!

In order to appreciate all the changes in this release of Gremlin, you will need to take the test drive. Reading the short descriptions or kicking the wheels is no substitute for trying it against your existing or anticipated graphs.

I would call out the obvious topic map issue, that of changing the traditional names to “Gremlin + (some string).”

I rather doubt anyone is going to hunt down existing email, documentation, notes, presentations, etc. and clean up all the references to Blueprints, Frames, Pipes, Furnace and Rexster. How important is that? Hard to say right now but it is the sort of issue that topic maps were designed to solve.

Could be important in terms of researching prior art, assuming that U.S. patent law continues to deteriorate. I’m thinking about patenting numerical order. Opps! Should not have said that! 😉


Powers of Ten – Part II

Monday, June 2nd, 2014

Powers of Ten – Part II by Stephen Mallette.

From the post:

“‘Curiouser and curiouser!’ cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English); ‘now I’m opening out like the largest telescope that ever was!”
    — Lewis CarrollAlice’s Adventures in Wonderland

It is sometimes surprising to see just how much data is available. Much like Alice and her sudden increase in height, in Lewis Carroll’s famous story, the upward growth of data can happen quite quickly and the opportunity to produce a multi-billion edge graph becomes immediately present. Luckily, Titan is capable of scaling to accommodate such size and with the right strategies for loading this data, the development efforts can more rapidly shift to the rewards of massive scale graph analytics.

This article represents the second installment in the two part Powers of Ten series that discusses bulk loading data into Titan at varying scales. For purposes of this series, the “scale” is determined by the number of edges to be loaded. As it so happens, the strategies for bulk loading tend to change as the scale increases over powers of ten, which creates a memorable way to categorize different strategies. “Part I” of this series, looked at strategies for loading millions and tens of millions of edges and focused on usage of Gremlin to do so. This part of the series will focus on hundreds of millions and billions of edges and will focus on the usage of Faunus as the loading tool.

Note: By Titan 0.5.0, Faunus will be pulled into the Titan project under the name Titan/Hadoop.

Scaling to graph processing to hundreds of millions and billions of edges.

Deeply interesting work but I am left with multiple questions:

  • Hundreds of millions and billions of edges, to load. Any other graph metrics? Traversal for example?
  • Does loading performance scale with more servers? Instead of m2.4xlarge EC2 instances, what is the performance with 8x?
  • What kind of knob tuning was useful with a social network dataset?

I am sure there are other questions but those are the first ones that came to mind.

Powers of Ten – Part I

Saturday, May 31st, 2014

Powers of Ten – Part I by Stephen Mallette.

From the post:

“No, no! The adventures first,’ said the Gryphon in an impatient tone: ‘explanations take such a dreadful time.”
    — Lewis CarrollAlice’s Adventures in Wonderland

It is often quite simple to envision the benefits of using Titan. Developing complex graph analytics over a multi-billion edge distributed graph represent the adventures that await. Like the Gryphon from Lewis Carroll’s tale, the desire to immediately dive into the adventures can be quite strong. Unfortunately and quite obviously, the benefits of Titan cannot be realized until there is some data present within it. Consider the explanations that follow; they are the strategies by which data is bulk loaded to Titan enabling the adventures to ensue.

There are a number of different variables that might influence the approach to loading data into a graph, but the attribute that provides the best guidance in making a decision is size. For purposes of this article, “size” refers to the estimated number of edges to be loaded into the graph. The strategy used for loading data tends to change in powers of ten, where the strategy for loading 1 million edges is different than the approach for 10 million edges.

Given this neat and memorable way to categorize batch loading strategies, this two-part article outlines each strategy starting with the smallest at 1 million edges or less and continuing in powers of ten up to 1 billion and more. This first part will focus on 1 million and 10 million edges, which generally involves common Gremlin operations. The second part will focus on 100 million and 1 billion edges, which generally involves the use of Faunus.

Great guidance on loading relatively small data sets using Gremlin. Looking forward to seeing the harder tests with 100 million and 1 billion edge sets.

Bigdata and Blueprints

Tuesday, May 27th, 2014

Bigdata and Blueprints

From the webpage:

Blueprints is an open-source property graph model interface useful for writing applications on top of a graph database. Gremlin is a domain specific language for traversing property graphs that comes with an excellent REPL useful for interacting with a Blueprints database. Rexster exposes a Blueprints database as a web service and comes with a web-based workbench application called DogHouse.

To get started with bigdata via Blueprints, Gremlin, and Rexster, start by getting your bigdata server running per the instructions here.

Then, go and download some sample GraphML data. The Tinkerpop Property Graph is a good starting point.

Just in case you aren’t familiar with bigdata(R):

bigdata(R) is a scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates. The bigdata RDF/graph database can load 1B edges in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal), highly available replication cluster mode (HAJournalServer), and a horizontally sharded cluster mode (BigdataFederation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion edges. The HAJournalServer adds replication, online backup, horizontal scaling of query, and high availability. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation. (

So, this is a major event for Blueprints.

I first saw this in a tweet by Marko A. Rodriguez.

Plato, Shiva and A Social Graph

Monday, April 21st, 2014

The Social Graph of the Los Alamos National Laboratory by Marko A. Rodriguez.

From the post:

The web is composed of numerous web sites tailored to meet the information, consumption, and social needs of its users. Within many of these sites, references are made to the same platonic “thing” though different facets of the thing are expressed. For example, in the movie industry, there is a movie called John Carter by Disney. While the movie is an abstract concept, it has numerous identities on the web (which are technically referenced by a URI).

Aurelius collaborated with the Digital Library Research and Prototyping Group of the Los Alamos National Laboratory (LANL) to develop EgoSystem atop the distributed graph database Titan. The purpose of this system is best described by the introductory paragraph of the April 2014 publication on EgoSystem.

I heavily commend Marko’s post and the Egosystem publication for your reading. That despite my cautions concerning some of the theoretical aspects of the project.

Statements like:

references are made to the same platonic “thing” though different facets of the thing are expressed.

have always troubled me. In part because it involves a claim, usually by the speaker, to have freed themselves from Plato’s cave such that they and they alone can see things aright. Which consigns the rest of us to be the pitiful lot still confined to the cave.

Which of course leads to Marko’s:

There are two categories of vertices in EgoSystem.

  1. Platonic: Denotes an abstract concept devoid of interpretation.
  2. Identity: Denotes a particular interpretation of a platonic.

Every platonic vertex is of a particular type: a person, institution, artifact, or concept. Next, every platonic has one or more identities as referenced by a URL on the web. The platonic types and the location of their web identities are itemized below. As of EgoSystem 1.0, these are the only sources from which data is aggregated, though extending it to support more services (e.g. Facebook, Quorum, etc.) is feasible given the system’s modular architecture.

A structure where English labels, remarkably enough, are places on “Platonic” vertices. Not that we would attribute any identity or semantics to a “Platonic” vertex. 😉

Rather than “Platonic” vertices, they are better described as boundary vertices. That is they circumscribe what can be represented in a particular graph, without making claims on a “higher” reality.

I say that not to be pedantic but to illustrate how a “Platonic” vertex prevents us from meaningful merger with graphs with differing “Platonic” vertices.

No doubt Shiva’s1 other residence, Arzamas-16, could benefit from a similar “alumni” graph but I rather doubt it is going to use English labels for its “Platonic” vertices which:

Denote[…] an abstract concept devoid of interpretation.

If I have no “interpretation,” which I takes to mean no properties (key/value pairs), how will I combine social graphs from Los Alamos and Arzamas-16?

I could cheat and secretly look up properties for the alleged “Platonic” nodes and combine them together but then how would you check my work? The end result would be opaque to anyone other than myself.

That isn’t a criticism of using the EgoSystem. I am sure it meets the needs of Los Alamos quite nicely.

However, it can prevent us from capturing the information necessary to expand the boundary of our graph at some future date or merging it with other graphs.

From a philosophical standpoint, we should not claim access to Platonic ideals when we are actually recording our views of shadows on the cave wall. Of which, intersections between graphs/shadows are just a subset.

1. Those of you old enough to remember Robert Oppenheimer will recognize the reference.

…Graph Analytics

Wednesday, December 11th, 2013

Big Data in Security – Part III: Graph Analytics by Levi Gundert.

In interview form with Michael Howe and Preetham Raghunanda.

You will find two parts of the exchange particularly interesting:

You mention very large technology companies, obviously Cisco falls into this category as well — how is TRAC using graph analytics to improve Cisco Security products?

Michael: How we currently use graph analytics is an extension of the work we have been doing for some time. We have been pulling data from different sources like telemetry and third-party feeds in order to look at the relationships between them, which previously required a lot of manual work. We would do analysis on one source and analysis on another one and then pull them together. Now because of the benefits of graph technology we can shift that work to a common view of the data and give people the ability to quickly access all the data types with minimal overhead using one tool. Rather than having to query multiple databases or different types of data stores, we have a polyglot store that pulls data in from multiple types of databases to give us a unified view. This allows us two avenues of investigation: one, security investigators now have the ability to rapidly analyze data as it arrives in an ad hoc way (typically used by security response teams) and the response times dramatically drop as they can easily view related information in the correlations. Second are the large-scale data analytics. Folks with traditional machine learning backgrounds can apply algorithms that did not work on previous data stores and now they can apply those algorithms across a well-defined data type – the graph.

For intelligence analysts, being able to pivot quickly across multiple disparate data sets from a visual perspective is crucial to accelerating the process of attribution.

Michael: Absolutely. Graph analytics is enabling a much more agile approach from our research and analysis teams. Previously when something of interest was identified there was an iterative process of query, analyze the results, refine the query, wash, rinse, and repeat. This process moves from taking days or hours down to minutes or seconds. We can quickly identify the known information, but more importantly, we can identify what we don’t know. We have a comprehensive view that enables us to identify data gaps to improve future use cases.

Did you catch the “…to a common view of the data…” caveat In the third sentence of Michael’s first reply.

Not to deny the usefulness of Titan (the graph solution being discussed) but to point out that current graphs require normalization of data.

For Cisco, that is a winning solution.

But then Cisco can use a closed solution based on normalized data.

Importing, analyzing and then returning results to heterogeneous clients could require a different approach.

Or if you have legacy data that spans centuries.

Or even agencies, departments, or work groups.

Boutique Graph Data with Titan

Wednesday, November 27th, 2013

Boutique Graph Data with Titan by Marko A. Rodriguez.

From the post:

Titan is a distributed graph database capable of supporting graphs on the order of 100 billion edges and sustaining on the order of 1 billion transactions a day (see Educating the Planet with Pearson). Software architectures that leverage such Big Graph Data typically have 100s of application servers traversing a distributed graph represented across a multi-machine cluster. These architectures are not common in that perhaps only 1% of applications written today require that level of software/machine power to function. The other 99% of applications may only require a single machine to store and query their data (with a few extra nodes for high availability). Such boutique graph applications, which typically maintain on the order of 100 million edges, are more elegantly served by Titan 0.4.1+. In Titan 0.4.1, the in-memory caches have been advanced to support faster traversals which makes Titan’s single-machine performance comparable to other single machine-oriented graph databases. Moreover, as the application scales beyond the confines of a single machine, simply adding more nodes to the Titan cluster allows boutique graph applications to seamlessly grow to become Big Graph Data applications (see Single Server to Highly Available Cluster).

A short walk on the technical side of Titan.

I would replace “boutique” with “big data” and say Titan allows customers to seamlessly transition from “big data” to “bigger data.”

Having “big data” is like having a large budget under your control.

What matters is the user is the status of claiming to possess it.

Let’s not disillusion them. 😉

Using AWS to Build a Graph-based…

Friday, November 22nd, 2013

Using AWS to Build a Graph-based Product Recommendation System by Andre Fatala and Renato Pedigoni.

From the description:

Magazine Luiza, one of the largest retail chains in Brazil, developed an in-house product recommendation system, built on top of a large knowledge Graph. AWS resources like Amazon EC2, Amazon SQS, Amazon ElastiCache and others made it possible for them to scale from a very small dataset to a huge Cassandra cluster. By improving their big data processing algorithms on their in-house solution built on AWS, they improved their conversion rates on revenue by more than 25 percent compared to market solutions they had used in the past.

Not a lot of technical details but a good success story to repeat if you are pushing graph-based services.

I first saw this in a tweet by Marko A. Rodriguez.

3 Myths about graph query languages…

Tuesday, October 15th, 2013

3 Myths about graph query languages. Busted by Pixy. by Sridhar Ramachandran.

A very short slide deck that leaves you wanting more information.

It got my attention because I didn’t know there were any myths about graph query languages. 😉

I think the sense of “myth” here is more “misunderstanding” or simply “incorrect information.”

Some references that may be helpful while reading these slides:

I must confess that “Myth #3: GQLs [Graph Query Languages] can’t be relational” has always puzzled me.

In part because hypergraphs have been used to model databases for quite some time.

For example:

Making use of arguments from information theory it is shown that a boolean function can represent multivalued dependencies. A method is described by which a hypergraph can be constructed to represent dependencies in a relation. A new normal form called generalized Boyce-Codd normal form is defined. An explicit formula is derived for representing dependencies that would remain in a projection of a relation. A definition of join is given which makes the derivation of many theoretical results easy. Another definition given is that of information in a relation. The information gets conserved whenever lossless decompositions are involved. It is shown that the use of null elements is important in handling data.

Would you believe: Some analytic tools for the design of relational database systems by K. K. Nambiar in 1980?

So far as I know, hypergraphs are a form of graph so it isn’t true that “graphs can only express binary relations/predicates.”

One difference (there are others) is that a hypergraph database doesn’t require derivation of relationships because those relationships are already captured by a hyperedge.

Moreover, a vertex can (whether it “may” or not in a particular hypergraph is another issue) be a member of more than one hyperedge.

Determination of common members becomes a straight forward query as opposed to two or more derivations of associations and then calculation of any intersection.

For all of that, it remains important as notice of a new declarative graph query language (GQL).

TinkerPop 2.4.0 Released (Gremlin Without a Cause)

Friday, August 9th, 2013

TinkerPop 2.4.0 Released (Gremlin Without a Cause) by Marko A. Rodriguez.

From the post:

TinkerPop 2.4.0 has been released under the name “Gremlin without a Cause” (see attached logo). The last release was back in March of 2013, so there are lots of new features/bugfixes/optimizations in the latest 2.4.0 release. Here is the best-of-the-best of each project along with the full release notes.

NOTE: 2.4.0 jars have been deployed to Apache Central Repo and ready for inclusion.

Another offering for your summer holiday enjoyment!

TinkerPop 2.3.0 has been unleashed

Thursday, March 21st, 2013

TinkerPop 2.3.0 has been unleashed by Marko A. Rodriguez.

Release notes for:







Distributed Graph Computing with Gremlin

Thursday, March 7th, 2013

Distributed Graph Computing with Gremlin by Marko A. Rodriguez.

From the post:

The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.

  • Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
  • Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.

Another must read post from Marko A. Rodriguez!

Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.


Wednesday, December 26th, 2012

Titan-Android by David Wu.

From the webpage:

Titan-Android is a port/fork of Titan for the Android platform. It is meant to be a light-weight implementation of a graph database on mobile devices. The port removes HBase and Cassandra support as their usage make little sense on a mobile device (convince me otherwise!). Gremlin is only supported via the Java interface as I have not been able to port groovy successfully. Nevertheless, Titan-Android supports local storage backend via BerkeleyDB and supports the Tinkerpop stack natively.

Just in case there was an Android under the tree!

I first saw this in a tweet by Marko A. Rodriguez.

Big Graph Data on Hortonworks Data Platform

Thursday, December 13th, 2012

Big Graph Data on Hortonworks Data Platform by Marko Rodriguez.

The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.

If you like graphs at all or have been looking at graph solutions, you are going to like this post.