Archive for the ‘Graphs’ Category

Loading the Galaxy Network of the “Cosmic Web” into Neo4j

Saturday, April 23rd, 2016

Loading the Galaxy Network of the “Cosmic Web” into Neo4j by Michael Hunger.

Cypher script for loading “Cosmic Web” into Neo4j.

You remember “Cosmic Web:”



Nine Inch Gremlins

Saturday, April 23rd, 2016

Nine Inch Gremlins


Stephen Mallette writes:

On the back of TinkerPop 3.1.2-incubating comes TinkerPop 3.2.0-incubating. Yes – a dual release – an unprecedented and daring move you’ve come to expect and not expect from the TinkerPop clan! Be sure to review the upgrade documentation in full as you may find some changes that introduce some incompatibilities.

The release artifacts can be found at this location:

The online docs can be found here: (user docs) (upgrade docs) (core javadoc) (full javadoc)

The release notes are available here:

The Central Maven repo has sync’d as well:

Another impressive release!

In reading the documentation I discovered that Ketrina Yim is responsible for drawing Gremlin and his TinkerPop friends.

I was relieved to find that Marko was only responsible for the Gremlin/TinkerPop code/prose and not the graphics as well. That would be too much talent for any one person! 😉


Planet TinkerPop [+ 2 New Graph Journals]

Tuesday, April 12th, 2016

Planet TinkerPop

From the webpage:

Planet TinkerPop is a vendor-agnostic, community-driven site aimed at advancing graph technology in general and Apache TinkerPop™ in particular. Graph technology is used to manage, query, and analyze complex information topologies composed of numerous heterogenous relationships and is currently benefiting companies such as Amazon, Google, and Facebook. For all companies to ultimately adopt graph technology, vendor-agnostic graph standards and graph knowledge must be promulgated. For the former, TinkerPop serves as an Apache Software Foundation governed community that develops a standard graph data model (the property graph) and query language (Gremlin). Apache TinkerPop is a widely supported graph computing framework that has been adopted by leading graph system vendors and interfaced with by numerous graph-based applications across various industries. For educating the public on graphs, Planet TinkerPop’s Technology journal publishes articles about TinkerPop-related graph research and development. The Use Cases journal promotes articles on the industrial use of graphs and TinkerPop. The articles are contributed by members of the Apache TinkerPop community and additional contributions are welcomed and strongly encouraged. We hope you enjoy your time learning about graphs here at Planet TinkerPop.

If you are reading about Planet TinkerPop I can skip the usual “graphs are…” introductory comments. 😉

Planet TinkerPop is a welcome addition to the online resources on graphs in general and TinkerPop in particular.

So they aren’t buried in the prose, let me highlight two new journals at Planet TinkerPop:

TinkerPop Technology journal  publishes articles about TinkerPop-related graph research and development.

TinkerPop Use Cases journal  promotes articles on the industrial use of graphs and TinkerPop.

Both are awaiting your contributions!


PS: I prepended “TinkerPop” to the journal names and suggest an ISSN ( would be appropriate for both journals.

NSA-grade surveillance software: IBM i2 Analyst’s Notebook (Really?)

Tuesday, April 5th, 2016

I stumbled across Revealed: Denver Police Using NSA-Grade Surveillance Software which had this description of “NSA-grade surveillance software…:”

Intelligence gathered through Analyst’s Notebook is also used in a more active way to guide decision making, including with deliberate targeting of “networks” which could include loose groupings of friends and associates, as well as more explicit social organizations such as gangs, businesses, and potentially political organizations or protest groups. The social mapping done with Analyst’s Notebook is used to select leads, targets or points of intervention for future actions by the user. According to IBM, the i2 software allows the analyst to “use integrated social network analysis capabilities to help identify key individuals and relationships within networks” and “aid the decision-making process and optimize resource utilization for operational activities in network disruption, surveillance or influencing.” Product literature also boasts that Analyst’s Notebook “includes Social Network Analysis capabilities that are designed to deliver increased comprehension of social relationships and structures within networks of interest.”

Analyst’s Notebook is also used to conduct “call chaining” (show who is talking to who) and analyze telephone metadata. A software extension called Pattern Tracer can be used for “quickly identifying potential targets”. In the same vein, the Esri Edition of Analyst’s Notebook integrates powerful geo-spatial mapping, and allows the analyst to conduct “Pattern-of-Life Analysis” against a target. A training video for Analyst’s Notebook Esri Edition demonstrates the deployment of Pattern of Life Analysis in a military setting against an example target who appears appears to be a stereotyped generic Muslim terrorism suspect:

Perhaps I’m overly immune to IBM marketing pitches but I didn’t see anything in this post that could not be done with Python, R and standard visualization techniques.

I understand that IBM markets the i2 Analyst’s Notebook (and training too) as:

…deliver[ing] timely, actionable intelligence to help identify, predict, prevent and disrupt criminal, terrorist and fraudulent activities.

to a reported tune of over 2,500 organizations worldwide.

However, you have to bear in mind the software isn’t delivering that value-add but rather the analyst plus the right data and the IBM software. That is the software is at best only one third of what is required for meaningful results.

That insight seems to have gotten lost in IBM’s marketing pitch for the i2 Analyst’s Notebook and its use by the Denver police.

But to be fair, I have included below the horizontal bar, the complete list of features for the i2 Analyst’s Notebook.

Do you see any that can’t be duplicated with standard software?

I don’t.

That’s another reason to object to the Denver Police falling into the clutches of maintenance agreements/training on software that is likely irrelevant to their day to day tasks.

IBM® i2® Analyst’s Notebook® is a visual intelligence analysis environment that can optimize the value of massive amounts of information collected by government agencies and businesses. With an intuitive and contextual design it allows analysts to quickly collate, analyze and visualize data from disparate sources while reducing the time required to discover key information in complex data. IBM i2 Analyst’s Notebook delivers timely, actionable intelligence to help identify, predict, prevent and disrupt criminal, terrorist and fraudulent activities.

i2 Analyst’s Notebook helps organizations to:

Rapidly piece together disparate data

Identify key people, events, connections and patterns

Increase understanding of the structure, hierarchy and method of operation

Simplify the communication of complex data

Capitalize on rapid deployment that delivers productivity gains quickly

Be sure to leave a comment if you see “NSA-grade” capabilities. We would all like to know what those are.

Game of Thrones – Network Analysis

Thursday, March 31st, 2016


You can read the popular account of this network analysis of the Game of Thrones in Mathematicians mapped out every “Game of Thrones” relationship to find the main character by Adam Epstein or, you can try Network of Thrones by Andrew Beveridge and Jie Shan.

There are a number of choices you may want to re-visit if you explore the Game of Thrones as a graph/network, not the least of which is expanding the data beyond volume 3, characterizing the type of “relationships” (edges) found between characters and how you would capture the time aspect of the development of the “relationships” you do find.

Great work that will hopefully spur others to similar explorations.

Announcing the Structr Knowledge Graph

Friday, March 4th, 2016

Announcing the Structr Knowledge Graph by Alex Morgner.

From the post:

The Structr Knowledge Graph is the new one-stop resource base where all information about and around Structr are connected.

Besides the official manual, you will find Getting Started articles, FAQ, guides and tutorials, as well as links to external resources like StackOverflow questions, GitHub issues, or discussion threads.

The Knowledge Graph isn’t just another static content platform where information is stored once and then outdates over time. It is designed and built as a living structure, being updated not only by the Structr team but also semi-automatically by user activities in the support system.

By using a mixture of manual categorization and natural language processing, information is being extracted from the content origins to update and extend the graph. The SKG will replace our old documentation site

And of course, the SKG provides an interactive graph browser, full-text search and an article tree.

I was confused when I first saw this because I think of Structr as a knowledge graph so why the big splash? Then I saw a tweet saying 386 articles on and it suddenly made sense.

This is a knowledge graph about the knowledge graph software known as Structr.

OK, I’m straight now. I think. 😉

With a topic map it would be trivial to distinguish between “Structr Knowledge Graph” in the sense of using the Structr software versus a knowledge graph about Structr, which is also known as the Structr Knowledge Graph.

Momentary cognitive dissonance, well, not so momentary but I wasn’t devoting a lot of effort to it, but not a serious problem.

More serious when the cognitive dissonance is confusion a child’s name in transliterated Arabic with that of a sanctioned target being sought by a U.S. drone.

Graph Encryption: Going Beyond Encrypted Keyword Search [Subject Identity Based Encryption]

Wednesday, March 2nd, 2016

Graph Encryption: Going Beyond Encrypted Keyword Search by Xiarui Meng.

From the post:

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) store, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph database have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny’s previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today’s world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

2. Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, which can lead to many interesting research directions.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

Apologies for the long quote but I thought this topic might be new to some readers. Xianrui goes on to describe a solution for efficient queries over encrypted graphs.

Chase and Kamara remark in Structured Encryption and Controlled Disclosure, CK10:

To address this problem we introduce the notion of structured encryption. A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key. In addition, the query process reveals no useful information about either the query or the data. An important consideration in this context is the efficiency of the query operation on the server side. In fact, in the context of cloud storage, where one often works with massive datasets, even linear time operations can be infeasible. (emphasis in original)

With just a little nudging, their:

A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key.

could be re-stated as:

A subject identity encryption scheme leaves out merging data in such a way that the resulting topic map can only be queried with knowledge of the subject identity merging key.

You may have topics that represent diagnoses such as cancer, AIDS, sexual contacts, but if none of those can be associated with individuals who are also topics in the map, there is no more disclosure than census results for a metropolitan area and a list of the citizens therein.

That is you are missing the critical merging data that would link up (associate) any diagnosis with a given individual.

Multi-property subject identities would make the problem even harder, so say nothing of conferring properties on the basis of supplied properties as part of the merging process.

One major benefit of a subject identity based approach is that without the merging key, any data set, however sensitive the information, is just a data set, until you have the basis for solving its subject identity riddle.

PS: With the usual caveats of not using social security numbers, birth dates and the like as your subject identity properties. At least not in the map proper. I can think of several ways to generate keys for merging that would be resistant to even brute force attacks.

Ping me if you are interested in pursuing that on a data set.

networkD3: D3 JavaScript Network Graphs from R

Monday, February 15th, 2016

networkD3: D3 JavaScript Network Graphs from R by Christopher Gandrud, JJ Allaire, & Kent Russell.

From the post:

This is a port of Christopher Gandrud’s R package d3Network for creating D3 network graphs to the htmlwidgets framework. The htmlwidgets framework greatly simplifies the package’s syntax for exporting the graphs, improves integration with RStudio’s Viewer Pane, RMarkdown, and Shiny web apps. See below for examples.

It currently supports three types of network graphs:

I haven’t compared this to GraphViz but the Sankey diagram option is impressive!

Google Embeds 2D & 3D Plots on Search!

Friday, February 12th, 2016

Jake Vanderplas tweeted:

Whoah… @google now embeds interactive 2D & 3D plots when you search for a function!

Seeing is believing (with controls no less):



or, sin(x+y)


In case you want to know more:

Go to: Calculator & unit converter and select How to graph equations and Geometry Calculator.

If your browser supports WebGL, Google will render 3d graphs.

What functions have you used Google to render?

Gremlin Users – Beware the Double-Underscore!

Wednesday, February 3rd, 2016

A user recently posted this example from the Gremlin documentation:

out()).values(‘name’) [apologies for the line wrap]

which returned:

“No such property: _ for class: Script121”

Marko Rodriguez responded:

Its a double underscore, not a single underscore.

__ vs. _

I mention this to benefit beginning Gremlin users who haven’t developed an underscore stutter but also as a plea for sanity in syntax design.

It’s is easy to type two successive underscores but the obviousness of a double underscore versus a single underscore depends on local typography.

To say nothing that what might be obvious to the eyes of a twenty-something may not be as obvious to the eyes of a fifty-something+.

In syntax design, answer the question:

Do you want to be clever or clear?

Twitter Graph Analytics From NodeXL (With Questions)

Friday, January 29th, 2016

I’m sure you have seen this rather impressive Twitter graphic:


And you can see a larger version, with a link to the interactive version here:

Impressive visualization but…, tell me, what can you learn from these tweets about big data?

I mean, visualization is a great tool but if I am not better informed after using the visualization than before, what’s the point?

If you go to the interactive version, you will find lists derived from the data, such as “top 10 vertices, ranked by Betweeness Centrality,” top 10 URLs in the graph and groups in the graph, top domains in the graph and groups in the graph, etc.

None of which is evident from casual inspection of the graph. (Top influencers might be if I could get the interactive version to resize but difficult unless the step between #11 and #10 was fairly large.

Nothing wrong with eye candy but for touting the usefulness of visualization, let’s look for more intuitive visualizations.

I saw this particular version in a tweet by Kirk D. Borne.

A Practical Guide to Graph Databases

Wednesday, January 20th, 2016

A Practical Guide to Graph Databases by Matthias Broecheler.

Slides from Graph Day 2016 @ Austin.

If you notice any of the “trash talking” on social media about graphs and graph databases, you will find slide 15 quite amusing.

Not everyone agrees on the relative position of graph products. 😉

I haven’t seen a video of Matthias’ presentation. If you happen across one, give me a ping. Thanks!

Building Web Apps Using Flask and Neo4j [O’Reilly Market Pricing For Content?]

Saturday, January 16th, 2016

Building Web Apps Using Flask and Neo4j

When I first saw this on one of my incoming feeds today I thought it might be of interest.

When I followed the link, I found an O’Reilly video, which broke out to be:

25:23 free minutes and 133:01 minutes for $59.99.

Rounding down that works out to about $30/hour for the video.

When you compare that to other links I saw today:

What is the value proposition that sets the price on an O’Reilly video?

So far as I can tell, pricing for content on the Internet is similar the pricing of seats on airlines.

Pricing for airline seats is beyond “arbitrary” or “capricious.” More akin to “absurd” and/or “whatever a credulous buyer will pay.”

Speculations on possible pricing models O’Reilly is using?

Suggestions on a viable pricing model for content?

You Up To Improving Traversals/DSLs/OLAP in TinkerPop 3.2.0?

Wednesday, January 13th, 2016

Big ideas for Traversals/DSLs/OLAP in TinkerPop 3.2.0 by Marko A. Rodriguez.

Marko posted a not earlier today that reads in part:

There is currently no active development on TinkerPop 3.2.0, however, in my spare time I’ve been developing (on paper) some new ideas that should make traversals, DSLs, and OLAP even better.

Problem #1: The Builder pattern for TraversalSources is lame. []

Problem #2: It is not natural going from OLTP to OLAP to OLTP to OLAP. []

I mention this because it has been almost seven (7) hours since Marko posted this note and its not like he is covered up with responses!

Myself included but I’m not qualified to comment on his new ideas. One or more of you are. Take up the challenge!

TinkerPop, the community and you will be better for it.


…[N]ew “GraphStore” core – Gephi

Monday, January 11th, 2016

Gephi boosts its performance with new “GraphStore” core by Mathieu Bastian.

From the post:

Gephi is a graph visualization and analysis platform – the entire tool revolves around the graph the user is manipulating. All modules (e.g. filter, ranking, layout etc.) touch the graph in some way or another and everything happens in real-time, reflected in the visualization. It’s therefore extremely important to rely on a robust and fast underlying graph structure. As explained in this article we decided in 2013 to rewrite the graph structure and started the GraphStore project. Today, this project is mostly complete and it’s time to look at some of the benefits GraphStore is bringing into Gephi (which its 0.9 release is approaching).

Performance is critical when analyzing graphs. A lot can be done to optimize how graphs are represented and accessed in the code but it remains a hard problem. The first versions of Gephi didn’t always shine in that area as the graphs were using a lot of memory and some operations such as filter were slow on large networks. A lot was learnt though and when the time came to start from scratch we knew what would move the needle. Compared to the previous implementation, GraphStore uses simpler data structures (e.g. more arrays, less maps) and cache-friendly collections to make common graph operations faster. Along the way, we relied on many micro-benchmarks to understand what was expensive and what was not. As often with Java, this can lead to surprises but it’s a necessary process to build a world-class graph library.

What shall we say about the performance numbers?


The test were against “two different classic graphs, one small (1.5K nodes, 19K edges) and one medium (83K nodes, 68K edges).”

Less than big data size graphs but isn’t the goal of big data analysis to extract the small portion of relevant data from the big data?


Maybe there should be an axiom about gathering of irrelevant data into a big data pile, only to be excluded again.

Or premature graphification of largely irrelevant data.

Something to think about as you contribute to the further development of this high performing graph library.


Webinar: Image Similarity: Deep Learning and Beyond (January 12th/Register for Recording)

Monday, January 11th, 2016

Webinar: Image Similarity: Deep Learning and Beyond by Dato.

From the webpage:

In this talk, we will extract features from the convolutional networks applied to real estate images to build a similarity graph and then do label propagation on the images to label different images in our dataset.

Recommended for:

  • Data scientists and engineers
  • Developers and technical team managers
  • Technical product managers

What you’ll learn:

  • How to extract features from a convolutional network using GraphLab Create
  • How to build similarity graphs using nearest neighbors
  • How to implement graph algorithms such as PageRank using GraphLab Create

What we’ll cover:

  • Extracting features from convolutional networks
  • Building similarity graphs using nearest neighbors
  • Clustering: kmeans and beyond
  • Graph algorithms: PageRank and label propagation

I had mixed results with webinars in 2015.

Looking forward to this one because of the coverage of similarity graphs.

From a subject identity perspective, how much similarity do you need to be the “same” subject?

If I have two books, one printed within the copyright period and another copy printed after the work came into the public domain, are they the same subject?

For some purposes yes and for other purposes not.

The strings we give web browsers, usually starting with “https://” these days, are crude measures of subject identity, don’t you think?

I say “the strings we give web browsers” as the efforts of TBL and his cronies to use popularity as a measure of success, continue their efforts to conflate URI, IRI, and URL into only URL. The simplification doesn’t bother me as much as the attempts to conceal it.

It’s one way to bolster a claim to have anyways been right, just re-write the records that anyone is likely to remember. I prefer my history with warts and all.

Prismatic Closes December 20th, 2015

Saturday, December 12th, 2015

Writing the next chapter for Prismatic

From the post:

Four years ago, we set out to build a personalized news reader that would change the way people consume content. For many of you, we did just that. But we also learned content distribution is a tough business and we’ve failed to grow at a rate that justifies continuing to support our Prismatic News products.

Beginning December 20th, 2015, our iOS, Android and web news reader products will no longer be available and access to our Interest Graph APIs will be discontinued.

Once the product is shut down, you will no longer have access to the stories you’ve read or saved within Prismatic. We recommend saving anything you want to remember by adding the story to Pocket, emailing it to yourself or copying and pasting the links into another document.

Thanks to all of you who supported us during this journey; we hope you’ve learned lots of interesting things along the way.

The Prismatic Team

December 20, 2015 will be here sooner than you think.

Much could be said about viable Internet business models, economies, etc., but for the moment, concentrate on preserving content that will go dark by December 20, 2015.

Repost and tweet this to your friends and followers.

Yes, I think this is sad. In the short term, however, preservation of content should be the overriding concern.

Best of luck to the Prismatic Team.

Quantum Walks with Gremlin [Graph Day, Austin]

Wednesday, November 25th, 2015

Quantum Walks with Gremlin by Marko A. Rodiguez, Jennifer H. Watkins.


A quantum walk places a traverser into a superposition of both graph location and traversal “spin.” The walk is defined by an initial condition, an evolution determined by a unitary coin/shift-operator, and a measurement based on the sampling of the probability distribution generated from the quantum wavefunction. Simple quantum walks are studied analytically, but for large graph structures with complex topologies, numerical solutions are typically required. For the quantum theorist, the Gremlin graph traversal machine and language can be used for the numerical analysis of quantum walks on such structures. Additionally, for the graph theorist, the adoption of quantum walk principles can transform what are currently side-effect laden traversals into pure, stateless functional flows. This is true even when the constraints of quantum mechanics are not fully respected (e.g. reversible and unitary evolution). In sum, Gremlin allows both types of theorist to leverage each other’s constructs for the advancement of their respective disciplines.

Best not to tackle this new paper on Gremlin and quantum graph walks after a heavy meal. 😉

Marko will be presenting at Graph Day, 17 January 2016, Austin, Texas. Great opportunity to hear him speak along with other cutting edge graph folks.

The walk Marko describes is located in a Hilbert space. Understandable because numerical solutions require the use of a metric space.

However, if you are modeling semantics in difference universes of discourse, realize that semantics don’t possess metric spaces. Semantics lie outside of metric space, although I concede that many have imposed varying arbitrary metrics on semantics.

For example, if I am mapping the English term for “black,” as in a color to the term “schwartz” in German, I need a “traverser” that enables the existence of both terms at separate locations, one for each universe in the graph.

You may protest that is overly complex for the representation of synonyms, but consider that “schwartz” occupies a different location in the universe of German and etymology from “black.”

For advertising, subtleties of language may not be useful, but for reading medical or technical works, an “approximate” or “almost right” meaning may be more damaging than helpful.

Who knows? Perhaps quantum computers will come closer to modeling semantics across domains better than any computer to date. Not perfectly but closer.

DegDB (Open Source Distributed Graph Database) [Tackling Who Pays For This Data?]

Tuesday, November 17th, 2015

DegDB (Open Source Distributed Graph Database) (GitHub)

The Design Doc/Ramble reads in part:

Problems With Existing Graph Databases

  • Owned by private companies with no incentive to share.
  • Public databases are used by few people with no incentive to contribute.
  • Large databases can’t fit on one machine and are expensive to traverse.
  • Existing distributed graph databases require all nodes to be trusted.

Incentivizing Hosting of Data

Every request will have either a debit (with attached bitcoin) or credit (with bitcoin promised on delivery) payment system. The server nodes will attempt to estimate how much it will cost to serve the data and if there isn’t enough bitcoin attached, will drop the request. This makes large nodes want to serve as much popular data as possible, because it allows for faster responses as well as not having to pay other nodes for their data. At the same time, little used data will cost more to access due to requiring more hops to find the data and “cold storage” servers can inflate the prices thus making it profitable for them.

Incentivizing Creation of Data

Data Creation on Demand

A system for requesting certain data to be curated can be employed. The requestor would place a bid for a certain piece of data to be curated, and after n-sources add the data to the graph and verify its correctness the money would be split between them.
This system could be ripe for abuse by having bots automatically fulfilling every request with random data.

Creators Paid on Usage

This method involves the consumers of the data keeping track of their data sources and upon usage paying them. This is a trust based model and may end up not paying creators anything.

The one “wow” factor of this project is the forethought to put the discussion of “who pays for this data?” up front and center.

We have all seen the failing model that starts with:

For only $35.00 (U.S.) you can view this article for 24 hours.

That makes you feel like you are almost robbing the publisher at that price. (NOT!)

Right. I’m tracking down a citation to make sure a quote or data is correct and I am going to pay $35.00 (U.S.) to have access for 24 hours. Considering that the publishers with those pricing models have already made back their costs of production and publication plus a profit from institutional subscribers (challenge them for the evidence if they deny), a very low micro-payment would be more suitable. Say $00.01 per paragraph or something on that order. Payable out of a deposit with the publisher.

I would amend the Creators Paid on Usage section to have created content unlocked only upon payment (set by the creator). Over time, creators would develop reputations for the value of their data and if you choose to buy from a private seller with no online history, that’s just your bad.

Imagine that for the Paris incident (hypothetical, none of the following is true), I had the school records for half of the people carrying out that attack. Not only do I have the originals but I also have them translated into English, assuming some or all of them are in some other language. I could cast that data (I’m not fond of the poverty of triples) into a graph format and make it know as part of a distributed graph system.

Some of the data, such as the identities of the people for who I had records, would appear in the graphs of others as “new” data. Up to the readers of the graph to decide if the data and the conditions for seeing it are acceptable to them.

Data could even carry a public price tag. That is if you want to pay a large enough sum, then the data in question will be opened up for everyone to have access to it.

I don’t know of any micropayment systems that are eating at the foundations of traditional publishers now but there will be many attempts before one eviscerates them one and all.

The choices we face now of “free” (read unpaid for research, writing and publication, which excludes many) versus the “pay-per-view” model that supports early 20th century models of sloth, cronyism and gate-keeping, aren’t the only ones. We need to actively seek out better and more nuanced choices.

Announcing Gephi 0.9 release date

Monday, November 2nd, 2015

Announcing Gephi 0.9 release date by Mathieu Bastian.

From the post:

Gephi has an amazing community of passionate users and developers. In the past few years, they have been very dedicated creating tutorials, developing new plugins or helping out on GitHub. They also have been patiently waiting for a new Gephi release! Today we’re happy to share with you that the wait will come to an end December 20th with the release of Gephi 0.9 for Windows, MacOS X and Linux.

We’re very excited about this upcoming release and developers are hard at work to deliver its roadmap before the end of 2015. This release will resolve a serie of compatibility issues as well as improve features and performance.

Our vision for Gephi remains focused on a few fundamentals, which were already outlined in our Manifesto back in 2009. Gephi should be a software for everyone, powerful yet easy to learn. In many ways, we still have the impression that we’ve only scratched the surface and want to continue to focus on making each module of Gephi better. As part of this release, we’ve undertaken one of the most difficult project we’ve worked on and completely rewrote the core of Gephi. Although not very visible for the end-user, this brings new capabilities, better performance and a level of code quality we can be proud of. This ensure a very solid foundation for the future of this software and paves the way for a future 1.0 version.

Below is an overview of the new features and improvements the 0.9 version will bring.

The list of highlights includes:

  • Java and MacOS compatibility
  • New redeveloped core
  • New Appearance module
  • Timestamp support
  • GEXF 1.3 support
  • Multiple files import
  • Multi-graph support (visualization n a future release)
  • New workspace selection UI
  • Giphi Toolkit release (soon after 0.9)

Enough new features to keep you busy over the long holiday season!


Query the Northwind Database as a Graph Using Gremlin

Wednesday, October 21st, 2015

Query the Northwind Database as a Graph Using Gremlin by Mark Kromer.

From the post:

One of the most popular and interesting topics in the world of NoSQL databases is graph. At DataStax, we have invested in graph computing through the acquisition of Aurelius, the company behind TitanDB, and are especially committed to ensuring the success of the Gremlin graph traversal language. Gremlin is part of the open source Apache TinkerPop graph framework project and is a graph traversal language used by many different graph databases.

I wanted to introduce you to a superb web site that our own Daniel Kuppitz maintains called “SQL2Gremlin” ( which I think is great way to start learning how to query graph databases for those of us who come from the traditional relational database world. It is full of excellent sample SQL queries from the popular public domain RDBMS dataset Northwind and demonstrates how to produce the same results by using Gremlin. For me, learning by example has been a great way to get introduced to graph querying and I think that you’ll find it very useful as well.

I’m only going to walk through a couple of examples here as an intro to what you will find at the full site. But if you are new to graph databases and Gremlin, then I highly encourage you to visit the sql2gremlin site for the rest of the complete samples. There is also a nice example of an interactive visualization / filtering, search tool here that helps visualize the Northwind data set as it has been converted into a graph model.

I’ve worked with (and worked for) Microsoft SQL Server for a very long time. Since Daniel’s examples use T-SQL, we’ll stick with SQL Server for this blog post as an intro to Gremlin and we’ll use the Northwind samples for SQL Server 2014. You can download the entire Northwind sample database here. Load that database into your SQL Server if you wish to follow along.

When I first saw the title to this post,

Query the Northwind Database as a Graph Using Gremlin (emphasis added)

I thought this was something else. A database about the Northwind album.

Little did I suspect that the Northwind Database is a test database for SQL Server 2005 and SQL Server 2008. Yikes!

Still, I thought some of you might have access to such legacy software and so I am pointing you to this post. 😉


Support for SQL Server 2005 ends April 16, 2016 (that’s next April)

Support for SQL Server 2008 ended July 8, 2014 Ouch! You are more than a year into a dangerous place. Upgrade, migrate or get another job. Hard times are coming and blame will be assigned.

Faster Graph Processing [Almost Linear Time Construction Of Spectral Sparsifier For Any Graph]

Tuesday, October 20th, 2015

Constructing Linear-Sized Spectral Sparsification in Almost-Linear Time by Yin Tat Lee, He Sun.


We present the first almost-linear time algorithm for constructing linear-sized spectral sparsification for graphs. This improves all previous constructions of linear-sized spectral sparsification, which requires \Omega(n^2) time.

A key ingredient in our algorithm is a novel combination of two techniques used in literature for constructing spectral sparsification: Random sampling by effective resistance, and adaptive constructions based on barrier functions.

Apologies to the paper authors for my liberties with their title: Constructing Linear-Sized Spectral Sparsification in Almost-Linear Time but I wanted to capture eyes that might glaze past their more formal title.

The PR release where I saw this article reads as follows:

In the second paper, Constructing linear-sized spectral sparsification in almost-linear time, Dr He Sun, Lecturer in Computer Science in the University’s Department of Computer Science and Yin Tat Lee, a PhD student from MIT, have presented the first algorithm for constructing linear-sized spectral sparsifiers that runs in almost-linear time.

More and more applications from today’s big data scenario need to deal with graphs of millions of vertices. While traditional algorithms can be applied directly in these massive graphs, these algorithms are usually too slow to be practical when the graph contains millions of vertices. Also, storing these practical massive graphs are very expensive.

Dr He Sun said: “Over the past decade, there have been intensive studies in order to overcome these two bottlenecks. One notable approach is through the intermediate step called spectral sparsification, which is the approximation of any input graph by a very sparse graph that inherits many properties of the input graph. Since most algorithms run faster in sparse graphs, spectral sparsification is used as a key intermediate step in speeding up the runtime of many practical graph algorithms, including finding approximate maximum flows in an undirected graph, and approximately solving linear systems, among many others.”

Using spectral sparsification, the researchers ran many algorithms in a sparse graph, and obtained approximately the correct results as well. This general framework allowed them to speed up the runtime of a wide range of algorithms by a magnitude. However, to make the overall approach practical, a key issue was to find faster constructions of spectral sparsification with fewer edges in the resulting sparsifiers. There have been many studies looking at this area in the past decade.

The researchers have proved that, for any graph, they can construct in almost-linear time its spectral sparsifier, and in the output sparsifier every vertex has only constant number of vertices. This result is almost optimal respect to time complexity of the algorithm, and the number of edges in the spectral sparsifier.

Very heavy sledding in the paper but you don’t have to be able to originate the insight in order to take advantage of the technique.


Congressional PageRank… [How To Avoid Bribery Charges]

Saturday, October 17th, 2015

Congressional PageRank – Analyzing US Congress With Neo4j and Apache Spark by William Lyon.

From the post:

As we saw previously, legis-graph is an open source software project that imports US Congressional data from Govtrack into the Neo4j graph database. This post shows how we can apply graph analytics to US Congressional data to find influential legislators in Congress. Using the Mazerunner open source graph analytics project we are able to use Apache Spark GraphX alongside Neo4j to run the PageRank algorithm on a collaboration graph of US Congress.

While Neo4j is a powerful graph database that allows for efficient OLTP queries and graph traversals using the Cypher query language, it is not optimized for global graph algorithms, such as PageRank. Apache Spark is a distributed in-memory large-scale data processing engine with a graph processing framework called GraphX. GraphX with Apache Spark is very efficient at performing global graph operations, like the PageRank algorithm. By using Spark alongside Neo4j we can enhance our analysis of US Congress using legis-graph.

Excellent walk-through to get you started on analyzing influence in congress, with modern data analysis tools. Getting a good grip on all these tools with be valuable.

Political scientists, among others, have studied the question of influence in Congress for decades so if you don’t want to repeat the results of others, being by consulting the American Political Science Review for prior work in this area.

An article that reports counter-intuitive results is: The Influence of Campaign Contributions on the Legislative Process by Lynda W. Powell.

From the introduction:

Do campaign donors gain disproportionate influence in the legislative process? Perhaps surprisingly, political scientists have struggled to answer this question. Much of the research has not identified an effect of contributions on policy; some political scientists have concluded that money does not matter; and this bottom line has been picked up by reporters and public intellectuals.1 It is essential to answer this question correctly because the result is of great normative importance in a democracy.

It is important to understand why so many studies find no causal link between contributions and policy outcomes. (emphasis added)

Linda cites much of the existing work on the influence of donations on process so her work makes a great starting point for further research.

As far as the lack of a “casual link between contributions and policy outcomes,” I think the answer is far simpler than Linda suspects.

The existence of a quid-pro-quo, the exchange of value for a vote on a particular bill, is the essence of the crime of public bribery. For the details (in the United States), see: 18 U.S. Code § 201 – Bribery of public officials and witnesses

What isn’t public bribery is to donate funds to an office holder on a regular basis, unrelated to any particular vote or act on the part of that official. Think of it as bribery on an installment plan.

When U.S. officials, such as former Secretary of State Hillary Clinton complain of corruption in other governments, they are criticizing quid-pro-quo bribery and not installment plan bribery as it is practiced in the United States.

Regular contributions gains ready access to legislators and, not surprisingly, more votes will go in your favor than random chance would allow.

Regular contributions are more expensive than direct bribes but avoiding the “causal link” is essential for all involved.

CyGraph: Cybersecurity Situational Awareness…

Thursday, October 15th, 2015

CyGraph: Cybersecurity Situational Awareness That’s More Scalable, Flexible & Comprehensive by Steven Noel. (MITRE Corporation, if you can’t tell from the title.)

From the post:

Preventing and reacting to attacks in cyberspace involves a complex and rapidly changing milieu of factors, requiring a flexible architecture for advanced analytics, queries and graph visualization.

Information Overload in Security Analytics

Cyber warfare is conducted in complex environments, with numerous factors contributing to attack success and mission impacts. Network topology, host configurations, vulnerabilities, firewall settings, intrusion detection systems, mission dependencies and many other elements all play important parts.

To go beyond rudimentary assessments of security posture and attack response, organizations need to merge isolated data into higher-level knowledge of network-wide attack vulnerability and mission readiness in the face of cyber threats.

Network environments are always changing, with machines added and removed, patches applied, applications installed, firewall rules changed, etc., all with potential impact on security posture. Intrusion alerts and anti-virus warnings need attention, and even seemingly benign events such as logins, service connections and file share accesses could be associated with adversary activity.

The problem is not lack of information, but rather the ability to assemble disparate pieces of information into an overall analytic picture for situational awareness, optimal courses of action and maintaining mission readiness.

CyGraph: Turning Cybersecurity Information into Knowledge

To address these challenges, researchers at the MITRE Corporation are developing CyGraph, a tool for cyber warfare analytics, visualization and knowledge management.

Graph databases, Neo4j being one of many, can be very useful in managing complex security data.

However, as I mentioned earlier today, one of the primary issues in cybersecurity is patch management, with a full 76% of applications remaining unpatched more than two years after vulnerabilities have been discovered. (Yet Another Flash Advisory (YAFA) [Patch Due 19 October 2015])

If you haven’t taken basic steps on an issue like patch management, as in evaluating and installing patches in a timely manner, a rush to get the latest information is mis-placed.

Just in case you are wondering, if you do visit MITRE Corporation, you will find that a search for “CyGraph” comes up empty. Must not be quite to the product stage just yet.

Watch for name conflicts:

and others of course.

Neo4j 2.3 RC1 is out!

Wednesday, October 14th, 2015

I saw a tweet from Michael Hunger saying Neo4j 2.3 RC1 is out.

For development only – check here.

Comment early and often!

Titan Graph DB Performance Tips

Saturday, October 10th, 2015

Titan Graph DB Performance Tips

From the post:

In Hawkular Inventory, we use the Tinkerpop API (version 2 for the time being) to store our inventory model in a graph database. We chose Titan as the storage engine configured to store the data in the Cassandra cluster that is also backing Hawkular Metrics and Alerts. This blog post will guide you through some performance-related lessons with Titan that we learned so far.

Inventory is under heavy development with a lot of redesign and refactoring going on between releases so we took quite a naive approach to storing and querying data from the graph database. That is, we store entities from our model as vertices in the graph and the relationships between the entities as edges in the graph. Quite simple and a school book example of how it should look like.

We did declare a couple of indices in the database on the read-only aspects of the vertices (i.e. a “type” of the entity the vertex corresponds to) but we actually didn’t pay too much attention to the performance. We wanted to have the model right first.

Fast forward a couple of months and of course, the performance started to be a real problem. The Hawkular agent for Wildfly is inserting a non-trivial amount of entities and not only inserting them but also querying them has seen a huge performance degradation compared to the simple examples we were unit testing with (due to number of vertices and edges stored).

The time has come to think about how to squeeze some performance out of Titan as well as how to store the data and query it more intelligently.

Several performance tips but the one that caught my eye and resulted in an order of magnitude performance gain:

3. Mirror Properties on The Edges

This is the single most important optimization we’ve done so far. The rationale is this. To jump from a vertex to another vertex over an edge is a fairly expensive operation. Titan uses the adjacency lists to store the vertices and their edges in wide rows in Cassandra. It uses another adjacency list for edges and their target vertices.

So to go from vertex to vertex, Titan actually has to do 2 queries. It would be much easier if we could avoid that at least in some cases.

The solution here is to copy the values of some (frequently used and, in our case, immutable) properties from the “source” and “target” vertices directly to the edges. This helps especially in the cases where you do some kind of filtering on the target vertices that you instead can do directly on the edges. If there is a high number of edges to go through, this helps tremendously because you greatly reduce the number of times you have to do the “second hop” to the target vertex.

I am curious what is being stored on the vertex that requires a second search to jump to the target vertex?

That is if you have moved “popular” vertex properties to the edge, why not move other properties of the node there?


Using Graph Structure Record Linkage on Irish Census

Friday, October 9th, 2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j by Brian Underwood.

From the post:

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911 Irish censuses, I hoped I would be able to find a way to reliably link resident records from the two together to identify the same residents.

Since then I’ve learned a bit about master data management and record linkage and so I thought I would give it another stab.

Here I’d like to talk about how I’ve been matching records based on the local data space around objects to improve my record linkage scoring.

An interesting issue that has currency with intelligence agencies slurping up digital debris at every opportunity. So you have trillions of records. Which ones have you been able to reliably match up?

From a topic map perspective, I could not help but notice that in the 1901 census, the categories for Marriage were:

  • Married
  • Widower
  • Widow
  • Not Married

Whereas the 1911 census records:

  • Married
  • Widower
  • Widow
  • Single

As you know, one of the steps in record linkage is normalization of the data headings and values before you apply the standard techniques to link records together.

In traditional record linkage, the shift from “not married” to “single” is lost in the normalization.

May not be meaningful for your use case but could be important for someone studying shifts in marital relationship language. Or shifts in religious, ethnic, or racist language.

Or for that matter, shifts in the names of database column headers and/or tables. (Like anyone thinks those are stable.)

Pay close attention to how Brian models similarity candidates.

Once you move beyond string equivalent identifiers (TMDM), you are going to be facing the same issues.

Tracking Congressional Whores

Wednesday, September 30th, 2015

Introducing legis-graph – US Congressional Data With Govtrack and Neo4j by William Lyon.

From the post:

Interactions among members of any large organization are naturally a graph, yet the tools we use to collect and analyze data about these organizations often ignore the graphiness of the data and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time to analyze the data. Collaboration networks are a perfect example. So let’s focus on one of the most powerful collaboration networks in the world: the US Congress.

Introducing legis-graph: US Congress in a graph

The goal of legis-graph is to provide a way of converting the data provided by Govtrack into a rich property-graph model that can be inserted into Neo4j for analysis or integrating with other datasets.

The code and instructions are available in this Github repo. The ETL workflow works like this:

  1. A shell script is used to rsync data from Govtrack for a specific Congress (i.e. the 114th Congress). The Govtrack data is a mix of JSON, CSV, and YAML files. It includes information about legislators, committees, votes, bills, and much more.
  2. To extract the pieces of data we are interested in for legis-graph a series of Python scripts are used to extract and transform the data from different formats into a series of flat CSV files.
  3. The third component is a series of Cypher statements that make use of LOAD CSV to efficiently import this data into Neo4j.

To get started with legis-graph in Neo4j you can follow the instructions here. Alternatively, a Neo4j data store dump is available here for use without having to execute the import scripts. We are currently working to streamline the ETL process so this may change in the future, but any updates will be appear in the Github README.

This project began during preparation for a graph database focused Meetup presentation. George Lesica and I wanted to provide an interesting dataset in Neo4j for new users to explore.

Whenever the U.S. Congress is mentioned, I am reminded of the Obi-Wan Kenobi’s line about Mos Eisley:

You will never find a more wretched hive of scum and villainy. We must be cautious.

The data model for William’s graph:


As you can see from the diagram above, the datamodel is rich and captures quite a bit of information about the actions of legislators, committees, and bills in Congress. Information about what properties are currently included is available here.

A great starting place that can be extended and enriched.

In terms of the data model, note that “subject” is now the title of a bill. Definitely could use some enrichment there.

Another association for the bill, “who_benefits.”

If you are really ambitious, try developing information on what individuals or groups make donations to the legislator on an annual basis.

To clear noise out of the data set, drop everyone who doesn’t contribute annually and even then, any total less than $5,000. Remember that members of congress depend on regular infusions of cash so erratic or one-time donors may get a holiday card but they are not on the ready access list.

The need for annual cash is one reason why episodic movements may make the news but they rarely make a difference. To make a difference requires years of steady funding and grooming of members of congress and improving your access, plus your influence.

Don’t be disappointed if you can “prove” member of congress X is in the pocket of Y or Z organization/corporation and nothing happens. More likely than not, such proof will increase their credibility for fund raising.

As Leonard Cohen says (Everybody Knows):

Everybody knows the dice are loaded, Everybody rolls with their fingers crossed

Neo4j 2.3.0 Milestone 3 Release

Monday, September 28th, 2015

Neo4j 2.3.0 Milestone 3 Release by Andreas Kollegger.

From the post:

Great news for all those Neo4j alpha-developers out there: The third (and final) milestone release of Neo4j 2.3 is now ready for download! Get your copy of Neo4j 2.3.0-M03 here.

Quick clarification: Milestone releases like this one are for development and experimentation only, and not all features are in their finalized form. Click here for the most fully stable version of Neo4j (2.2.5).

So, what cool new features made it into this milestone release of Neo4j? Let’s dive in.

If you want to “kick the tires” on Neo4j 2.3.0 before the production version arrives, now would be a good time.

Andreas covers new features in Neo4j 2.3.0, such as triadic selection, constraints, deletes and others.

As I read Andreas’ explanation of “triadic selection” and its role in recommendations, I started to wonder how effective recommendations are outside of movies, books, fast food, etc.

The longer I thought about it, the harder I was pressed to come up with a buying decision that isn’t influenced by recommendations, either explicit or implicit.

Can a “rational” market exist if we are all more or less like lemmings when it comes to purchasing (and other) decisions?

You don’t have to decide that question to take Neo4j 2.3.0 for a spin.

Guesstimating the Future

Thursday, September 24th, 2015

I ran across some introductory slides on Neo4j with the line:

Forrester estimates that over 25% of enterprises will be using graph databases by 2017.

Well, Forrester also predicted that tablet sales would over take laptops sales in 2015: Forrester: Tablet Sales Will Eclipse Laptop Sales by 2015.

You might want to check that prediction against: Laptop sales ‘stronger than ever’ versus tablets – PCR Retail Advisory Board.

The adage “It is difficult to make predictions, especially about the future.,” remains appropriate.

Neo4j doesn’t need lemming-like behavior among consumers of technology to make a case for itself.

Compare Neo4j and its query language, Cypher, to your use cases and I think you will agree.