## Archive for the ‘Graphs’ Category

### Graph Encryption: Going Beyond Encrypted Keyword Search [Subject Identity Based Encryption]

Wednesday, March 2nd, 2016

From the post:

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) store, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph database have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny’s previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today’s world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

2. Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, which can lead to many interesting research directions.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

Apologies for the long quote but I thought this topic might be new to some readers. Xianrui goes on to describe a solution for efficient queries over encrypted graphs.

Chase and Kamara remark in Structured Encryption and Controlled Disclosure, CK10:

To address this problem we introduce the notion of structured encryption. A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key. In addition, the query process reveals no useful information about either the query or the data. An important consideration in this context is the efficiency of the query operation on the server side. In fact, in the context of cloud storage, where one often works with massive datasets, even linear time operations can be infeasible. (emphasis in original)

With just a little nudging, their:

A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key.

could be re-stated as:

A subject identity encryption scheme leaves out merging data in such a way that the resulting topic map can only be queried with knowledge of the subject identity merging key.

You may have topics that represent diagnoses such as cancer, AIDS, sexual contacts, but if none of those can be associated with individuals who are also topics in the map, there is no more disclosure than census results for a metropolitan area and a list of the citizens therein.

That is you are missing the critical merging data that would link up (associate) any diagnosis with a given individual.

Multi-property subject identities would make the problem even harder, so say nothing of conferring properties on the basis of supplied properties as part of the merging process.

One major benefit of a subject identity based approach is that without the merging key, any data set, however sensitive the information, is just a data set, until you have the basis for solving its subject identity riddle.

PS: With the usual caveats of not using social security numbers, birth dates and the like as your subject identity properties. At least not in the map proper. I can think of several ways to generate keys for merging that would be resistant to even brute force attacks.

Ping me if you are interested in pursuing that on a data set.

### networkD3: D3 JavaScript Network Graphs from R

Monday, February 15th, 2016

From the post:

This is a port of Christopher Gandrud’s R package d3Network for creating D3 network graphs to the htmlwidgets framework. The htmlwidgets framework greatly simplifies the package’s syntax for exporting the graphs, improves integration with RStudio’s Viewer Pane, RMarkdown, and Shiny web apps. See below for examples.

It currently supports three types of network graphs:

I haven’t compared this to GraphViz but the Sankey diagram option is impressive!

### Google Embeds 2D & 3D Plots on Search!

Friday, February 12th, 2016

Jake Vanderplas tweeted:

Whoah… @google now embeds interactive 2D & 3D plots when you search for a function! https://goo.gl/KOGdBq

Seeing is believing (with controls no less):

sin(x)+cos(y)

or, sin(x+y)

In case you want to know more:

Go to: Calculator & unit converter and select How to graph equations and Geometry Calculator.

If your browser supports WebGL, Google will render 3d graphs.

What functions have you used Google to render?

### Gremlin Users – Beware the Double-Underscore!

Wednesday, February 3rd, 2016

A user recently posted this example from the Gremlin documentation:

g.V().hasLabel(‘person’).choose(values(‘age’)).option(27,_in()).option(32,_.
out()).values(‘name’) [apologies for the line wrap]

which returned:

“No such property: _ for class: Script121”

Marko Rodriguez responded:

Its a double underscore, not a single underscore.

__ vs. _

I mention this to benefit beginning Gremlin users who haven’t developed an underscore stutter but also as a plea for sanity in syntax design.

It’s is easy to type two successive underscores but the obviousness of a double underscore versus a single underscore depends on local typography.

To say nothing that what might be obvious to the eyes of a twenty-something may not be as obvious to the eyes of a fifty-something+.

In syntax design, answer the question:

Do you want to be clever or clear?

### Twitter Graph Analytics From NodeXL (With Questions)

Friday, January 29th, 2016

I’m sure you have seen this rather impressive Twitter graphic:

And you can see a larger version, with a link to the interactive version here: https://nodexlgraphgallery.org/Pages/Graph.aspx?graphID=61591

Impressive visualization but…, tell me, what can you learn from these tweets about big data?

I mean, visualization is a great tool but if I am not better informed after using the visualization than before, what’s the point?

If you go to the interactive version, you will find lists derived from the data, such as “top 10 vertices, ranked by Betweeness Centrality,” top 10 URLs in the graph and groups in the graph, top domains in the graph and groups in the graph, etc.

None of which is evident from casual inspection of the graph. (Top influencers might be if I could get the interactive version to resize but difficult unless the step between #11 and #10 was fairly large.

Nothing wrong with eye candy but for touting the usefulness of visualization, let’s look for more intuitive visualizations.

I saw this particular version in a tweet by Kirk D. Borne.

### A Practical Guide to Graph Databases

Wednesday, January 20th, 2016

Slides from Graph Day 2016 @ Austin.

If you notice any of the “trash talking” on social media about graphs and graph databases, you will find slide 15 quite amusing.

Not everyone agrees on the relative position of graph products. 😉

I haven’t seen a video of Matthias’ presentation. If you happen across one, give me a ping. Thanks!

### Building Web Apps Using Flask and Neo4j [O’Reilly Market Pricing For Content?]

Saturday, January 16th, 2016

Building Web Apps Using Flask and Neo4j

When I first saw this on one of my incoming feeds today I thought it might be of interest.

When I followed the link, I found an O’Reilly video, which broke out to be:

25:23 free minutes and 133:01 minutes for $59.99. Rounding down that works out to about$30/hour for the video.

When you compare that to other links I saw today:

What is the value proposition that sets the price on an O’Reilly video?

So far as I can tell, pricing for content on the Internet is similar the pricing of seats on airlines.

Pricing for airline seats is beyond “arbitrary” or “capricious.” More akin to “absurd” and/or “whatever a credulous buyer will pay.”

Speculations on possible pricing models O’Reilly is using?

Suggestions on a viable pricing model for content?

### You Up To Improving Traversals/DSLs/OLAP in TinkerPop 3.2.0?

Wednesday, January 13th, 2016

Big ideas for Traversals/DSLs/OLAP in TinkerPop 3.2.0 by Marko A. Rodriguez.

Marko posted a not earlier today that reads in part:

There is currently no active development on TinkerPop 3.2.0, however, in my spare time I’ve been developing (on paper) some new ideas that should make traversals, DSLs, and OLAP even better.

Problem #1: The Builder pattern for TraversalSources is lame. [https://issues.apache.org/jira/browse/TINKERPOP-971]

Problem #2: It is not natural going from OLTP to OLAP to OLTP to OLAP. [https://issues.apache.org/jira/browse/TINKERPOP-570]

I mention this because it has been almost seven (7) hours since Marko posted this note and its not like he is covered up with responses!

Myself included but I’m not qualified to comment on his new ideas. One or more of you are. Take up the challenge!

TinkerPop, the community and you will be better for it.

Enjoy!

### …[N]ew “GraphStore” core – Gephi

Monday, January 11th, 2016

From the post:

Gephi is a graph visualization and analysis platform – the entire tool revolves around the graph the user is manipulating. All modules (e.g. filter, ranking, layout etc.) touch the graph in some way or another and everything happens in real-time, reflected in the visualization. It’s therefore extremely important to rely on a robust and fast underlying graph structure. As explained in this article we decided in 2013 to rewrite the graph structure and started the GraphStore project. Today, this project is mostly complete and it’s time to look at some of the benefits GraphStore is bringing into Gephi (which its 0.9 release is approaching).

Performance is critical when analyzing graphs. A lot can be done to optimize how graphs are represented and accessed in the code but it remains a hard problem. The first versions of Gephi didn’t always shine in that area as the graphs were using a lot of memory and some operations such as filter were slow on large networks. A lot was learnt though and when the time came to start from scratch we knew what would move the needle. Compared to the previous implementation, GraphStore uses simpler data structures (e.g. more arrays, less maps) and cache-friendly collections to make common graph operations faster. Along the way, we relied on many micro-benchmarks to understand what was expensive and what was not. As often with Java, this can lead to surprises but it’s a necessary process to build a world-class graph library.

What shall we say about the performance numbers?

IMPRESSIVE!

The test were against “two different classic graphs, one small (1.5K nodes, 19K edges) and one medium (83K nodes, 68K edges).”

Less than big data size graphs but isn’t the goal of big data analysis to extract the small portion of relevant data from the big data?

Yes?

Maybe there should be an axiom about gathering of irrelevant data into a big data pile, only to be excluded again.

Or premature graphification of largely irrelevant data.

Something to think about as you contribute to the further development of this high performing graph library.

Enjoy!

### Webinar: Image Similarity: Deep Learning and Beyond (January 12th/Register for Recording)

Monday, January 11th, 2016

From the webpage:

In this talk, we will extract features from the convolutional networks applied to real estate images to build a similarity graph and then do label propagation on the images to label different images in our dataset.

Recommended for:

• Data scientists and engineers
• Developers and technical team managers
• Technical product managers

What you’ll learn:

• How to extract features from a convolutional network using GraphLab Create
• How to build similarity graphs using nearest neighbors
• How to implement graph algorithms such as PageRank using GraphLab Create

What we’ll cover:

• Extracting features from convolutional networks
• Building similarity graphs using nearest neighbors
• Clustering: kmeans and beyond
• Graph algorithms: PageRank and label propagation

I had mixed results with webinars in 2015.

Looking forward to this one because of the coverage of similarity graphs.

From a subject identity perspective, how much similarity do you need to be the “same” subject?

If I have two books, one printed within the copyright period and another copy printed after the work came into the public domain, are they the same subject?

For some purposes yes and for other purposes not.

The strings we give web browsers, usually starting with “https://” these days, are crude measures of subject identity, don’t you think?

I say “the strings we give web browsers” as the efforts of TBL and his cronies to use popularity as a measure of success, continue their efforts to conflate URI, IRI, and URL into only URL. https://url.spec.whatwg.org/ The simplification doesn’t bother me as much as the attempts to conceal it.

It’s one way to bolster a claim to have anyways been right, just re-write the records that anyone is likely to remember. I prefer my history with warts and all.

### Prismatic Closes December 20th, 2015

Saturday, December 12th, 2015

Writing the next chapter for Prismatic

From the post:

Four years ago, we set out to build a personalized news reader that would change the way people consume content. For many of you, we did just that. But we also learned content distribution is a tough business and we’ve failed to grow at a rate that justifies continuing to support our Prismatic News products.

Beginning December 20th, 2015, our iOS, Android and web news reader products will no longer be available and access to our Interest Graph APIs will be discontinued.

Once the product is shut down, you will no longer have access to the stories you’ve read or saved within Prismatic. We recommend saving anything you want to remember by adding the story to Pocket, emailing it to yourself or copying and pasting the links into another document.

Thanks to all of you who supported us during this journey; we hope you’ve learned lots of interesting things along the way.

The Prismatic Team

December 20, 2015 will be here sooner than you think.

Much could be said about viable Internet business models, economies, etc., but for the moment, concentrate on preserving content that will go dark by December 20, 2015.

Repost and tweet this to your friends and followers.

Yes, I think this is sad. In the short term, however, preservation of content should be the overriding concern.

Best of luck to the Prismatic Team.

### Quantum Walks with Gremlin [Graph Day, Austin]

Wednesday, November 25th, 2015

Abstract:

A quantum walk places a traverser into a superposition of both graph location and traversal “spin.” The walk is defined by an initial condition, an evolution determined by a unitary coin/shift-operator, and a measurement based on the sampling of the probability distribution generated from the quantum wavefunction. Simple quantum walks are studied analytically, but for large graph structures with complex topologies, numerical solutions are typically required. For the quantum theorist, the Gremlin graph traversal machine and language can be used for the numerical analysis of quantum walks on such structures. Additionally, for the graph theorist, the adoption of quantum walk principles can transform what are currently side-effect laden traversals into pure, stateless functional flows. This is true even when the constraints of quantum mechanics are not fully respected (e.g. reversible and unitary evolution). In sum, Gremlin allows both types of theorist to leverage each other’s constructs for the advancement of their respective disciplines.

Best not to tackle this new paper on Gremlin and quantum graph walks after a heavy meal. 😉

Marko will be presenting at Graph Day, 17 January 2016, Austin, Texas. Great opportunity to hear him speak along with other cutting edge graph folks.

The walk Marko describes is located in a Hilbert space. Understandable because numerical solutions require the use of a metric space.

However, if you are modeling semantics in difference universes of discourse, realize that semantics don’t possess metric spaces. Semantics lie outside of metric space, although I concede that many have imposed varying arbitrary metrics on semantics.

For example, if I am mapping the English term for “black,” as in a color to the term “schwartz” in German, I need a “traverser” that enables the existence of both terms at separate locations, one for each universe in the graph.

You may protest that is overly complex for the representation of synonyms, but consider that “schwartz” occupies a different location in the universe of German and etymology from “black.”

For advertising, subtleties of language may not be useful, but for reading medical or technical works, an “approximate” or “almost right” meaning may be more damaging than helpful.

Who knows? Perhaps quantum computers will come closer to modeling semantics across domains better than any computer to date. Not perfectly but closer.

### DegDB (Open Source Distributed Graph Database) [Tackling Who Pays For This Data?]

Tuesday, November 17th, 2015

The Design Doc/Ramble reads in part:

Problems With Existing Graph Databases

• Owned by private companies with no incentive to share.
• Public databases are used by few people with no incentive to contribute.
• Large databases can’t fit on one machine and are expensive to traverse.
• Existing distributed graph databases require all nodes to be trusted.

Incentivizing Hosting of Data

Every request will have either a debit (with attached bitcoin) or credit (with bitcoin promised on delivery) payment system. The server nodes will attempt to estimate how much it will cost to serve the data and if there isn’t enough bitcoin attached, will drop the request. This makes large nodes want to serve as much popular data as possible, because it allows for faster responses as well as not having to pay other nodes for their data. At the same time, little used data will cost more to access due to requiring more hops to find the data and “cold storage” servers can inflate the prices thus making it profitable for them.

Incentivizing Creation of Data

Data Creation on Demand

A system for requesting certain data to be curated can be employed. The requestor would place a bid for a certain piece of data to be curated, and after n-sources add the data to the graph and verify its correctness the money would be split between them.
This system could be ripe for abuse by having bots automatically fulfilling every request with random data.

Creators Paid on Usage

This method involves the consumers of the data keeping track of their data sources and upon usage paying them. This is a trust based model and may end up not paying creators anything.

The one “wow” factor of this project is the forethought to put the discussion of “who pays for this data?” up front and center.

We have all seen the failing model that starts with:

For only $35.00 (U.S.) you can view this article for 24 hours. That makes you feel like you are almost robbing the publisher at that price. (NOT!) Right. I’m tracking down a citation to make sure a quote or data is correct and I am going to pay$35.00 (U.S.) to have access for 24 hours. Considering that the publishers with those pricing models have already made back their costs of production and publication plus a profit from institutional subscribers (challenge them for the evidence if they deny), a very low micro-payment would be more suitable. Say $00.01 per paragraph or something on that order. Payable out of a deposit with the publisher. I would amend the Creators Paid on Usage section to have created content unlocked only upon payment (set by the creator). Over time, creators would develop reputations for the value of their data and if you choose to buy from a private seller with no online history, that’s just your bad. Imagine that for the Paris incident (hypothetical, none of the following is true), I had the school records for half of the people carrying out that attack. Not only do I have the originals but I also have them translated into English, assuming some or all of them are in some other language. I could cast that data (I’m not fond of the poverty of triples) into a graph format and make it know as part of a distributed graph system. Some of the data, such as the identities of the people for who I had records, would appear in the graphs of others as “new” data. Up to the readers of the graph to decide if the data and the conditions for seeing it are acceptable to them. Data could even carry a public price tag. That is if you want to pay a large enough sum, then the data in question will be opened up for everyone to have access to it. I don’t know of any micropayment systems that are eating at the foundations of traditional publishers now but there will be many attempts before one eviscerates them one and all. The choices we face now of “free” (read unpaid for research, writing and publication, which excludes many) versus the “pay-per-view” model that supports early 20th century models of sloth, cronyism and gate-keeping, aren’t the only ones. We need to actively seek out better and more nuanced choices. ### Announcing Gephi 0.9 release date Monday, November 2nd, 2015 From the post: Gephi has an amazing community of passionate users and developers. In the past few years, they have been very dedicated creating tutorials, developing new plugins or helping out on GitHub. They also have been patiently waiting for a new Gephi release! Today we’re happy to share with you that the wait will come to an end December 20th with the release of Gephi 0.9 for Windows, MacOS X and Linux. We’re very excited about this upcoming release and developers are hard at work to deliver its roadmap before the end of 2015. This release will resolve a serie of compatibility issues as well as improve features and performance. Our vision for Gephi remains focused on a few fundamentals, which were already outlined in our Manifesto back in 2009. Gephi should be a software for everyone, powerful yet easy to learn. In many ways, we still have the impression that we’ve only scratched the surface and want to continue to focus on making each module of Gephi better. As part of this release, we’ve undertaken one of the most difficult project we’ve worked on and completely rewrote the core of Gephi. Although not very visible for the end-user, this brings new capabilities, better performance and a level of code quality we can be proud of. This ensure a very solid foundation for the future of this software and paves the way for a future 1.0 version. Below is an overview of the new features and improvements the 0.9 version will bring. The list of highlights includes: • Java and MacOS compatibility • New redeveloped core • New Appearance module • Timestamp support • GEXF 1.3 support • Multiple files import • Multi-graph support (visualization n a future release) • New workspace selection UI • Giphi Toolkit release (soon after 0.9) Enough new features to keep you busy over the long holiday season! Enjoy! ### Query the Northwind Database as a Graph Using Gremlin Wednesday, October 21st, 2015 From the post: One of the most popular and interesting topics in the world of NoSQL databases is graph. At DataStax, we have invested in graph computing through the acquisition of Aurelius, the company behind TitanDB, and are especially committed to ensuring the success of the Gremlin graph traversal language. Gremlin is part of the open source Apache TinkerPop graph framework project and is a graph traversal language used by many different graph databases. I wanted to introduce you to a superb web site that our own Daniel Kuppitz maintains called “SQL2Gremlin” (http://sql2gremlin.com) which I think is great way to start learning how to query graph databases for those of us who come from the traditional relational database world. It is full of excellent sample SQL queries from the popular public domain RDBMS dataset Northwind and demonstrates how to produce the same results by using Gremlin. For me, learning by example has been a great way to get introduced to graph querying and I think that you’ll find it very useful as well. I’m only going to walk through a couple of examples here as an intro to what you will find at the full site. But if you are new to graph databases and Gremlin, then I highly encourage you to visit the sql2gremlin site for the rest of the complete samples. There is also a nice example of an interactive visualization / filtering, search tool here that helps visualize the Northwind data set as it has been converted into a graph model. I’ve worked with (and worked for) Microsoft SQL Server for a very long time. Since Daniel’s examples use T-SQL, we’ll stick with SQL Server for this blog post as an intro to Gremlin and we’ll use the Northwind samples for SQL Server 2014. You can download the entire Northwind sample database here. Load that database into your SQL Server if you wish to follow along. When I first saw the title to this post, Query the Northwind Database as a Graph Using Gremlin (emphasis added) I thought this was something else. A database about the Northwind album. Little did I suspect that the Northwind Database is a test database for SQL Server 2005 and SQL Server 2008. Yikes! Still, I thought some of you might have access to such legacy software and so I am pointing you to this post. 😉 PSA: Support for SQL Server 2005 ends April 16, 2016 (that’s next April) Support for SQL Server 2008 ended July 8, 2014 Ouch! You are more than a year into a dangerous place. Upgrade, migrate or get another job. Hard times are coming and blame will be assigned. ### Faster Graph Processing [Almost Linear Time Construction Of Spectral Sparsifier For Any Graph] Tuesday, October 20th, 2015 Abstract: We present the first almost-linear time algorithm for constructing linear-sized spectral sparsification for graphs. This improves all previous constructions of linear-sized spectral sparsification, which requires $\Omega(n^2)$ time. A key ingredient in our algorithm is a novel combination of two techniques used in literature for constructing spectral sparsification: Random sampling by effective resistance, and adaptive constructions based on barrier functions. Apologies to the paper authors for my liberties with their title: Constructing Linear-Sized Spectral Sparsification in Almost-Linear Time but I wanted to capture eyes that might glaze past their more formal title. The PR release where I saw this article reads as follows: In the second paper, Constructing linear-sized spectral sparsification in almost-linear time, Dr He Sun, Lecturer in Computer Science in the University’s Department of Computer Science and Yin Tat Lee, a PhD student from MIT, have presented the first algorithm for constructing linear-sized spectral sparsifiers that runs in almost-linear time. More and more applications from today’s big data scenario need to deal with graphs of millions of vertices. While traditional algorithms can be applied directly in these massive graphs, these algorithms are usually too slow to be practical when the graph contains millions of vertices. Also, storing these practical massive graphs are very expensive. Dr He Sun said: “Over the past decade, there have been intensive studies in order to overcome these two bottlenecks. One notable approach is through the intermediate step called spectral sparsification, which is the approximation of any input graph by a very sparse graph that inherits many properties of the input graph. Since most algorithms run faster in sparse graphs, spectral sparsification is used as a key intermediate step in speeding up the runtime of many practical graph algorithms, including finding approximate maximum flows in an undirected graph, and approximately solving linear systems, among many others.” Using spectral sparsification, the researchers ran many algorithms in a sparse graph, and obtained approximately the correct results as well. This general framework allowed them to speed up the runtime of a wide range of algorithms by a magnitude. However, to make the overall approach practical, a key issue was to find faster constructions of spectral sparsification with fewer edges in the resulting sparsifiers. There have been many studies looking at this area in the past decade. The researchers have proved that, for any graph, they can construct in almost-linear time its spectral sparsifier, and in the output sparsifier every vertex has only constant number of vertices. This result is almost optimal respect to time complexity of the algorithm, and the number of edges in the spectral sparsifier. Very heavy sledding in the paper but you don’t have to be able to originate the insight in order to take advantage of the technique. Enjoy! ### Congressional PageRank… [How To Avoid Bribery Charges] Saturday, October 17th, 2015 From the post: As we saw previously, legis-graph is an open source software project that imports US Congressional data from Govtrack into the Neo4j graph database. This post shows how we can apply graph analytics to US Congressional data to find influential legislators in Congress. Using the Mazerunner open source graph analytics project we are able to use Apache Spark GraphX alongside Neo4j to run the PageRank algorithm on a collaboration graph of US Congress. While Neo4j is a powerful graph database that allows for efficient OLTP queries and graph traversals using the Cypher query language, it is not optimized for global graph algorithms, such as PageRank. Apache Spark is a distributed in-memory large-scale data processing engine with a graph processing framework called GraphX. GraphX with Apache Spark is very efficient at performing global graph operations, like the PageRank algorithm. By using Spark alongside Neo4j we can enhance our analysis of US Congress using legis-graph. Excellent walk-through to get you started on analyzing influence in congress, with modern data analysis tools. Getting a good grip on all these tools with be valuable. Political scientists, among others, have studied the question of influence in Congress for decades so if you don’t want to repeat the results of others, being by consulting the American Political Science Review for prior work in this area. An article that reports counter-intuitive results is: The Influence of Campaign Contributions on the Legislative Process by Lynda W. Powell. From the introduction: Do campaign donors gain disproportionate influence in the legislative process? Perhaps surprisingly, political scientists have struggled to answer this question. Much of the research has not identified an effect of contributions on policy; some political scientists have concluded that money does not matter; and this bottom line has been picked up by reporters and public intellectuals.1 It is essential to answer this question correctly because the result is of great normative importance in a democracy. It is important to understand why so many studies find no causal link between contributions and policy outcomes. (emphasis added) Linda cites much of the existing work on the influence of donations on process so her work makes a great starting point for further research. As far as the lack of a “casual link between contributions and policy outcomes,” I think the answer is far simpler than Linda suspects. The existence of a quid-pro-quo, the exchange of value for a vote on a particular bill, is the essence of the crime of public bribery. For the details (in the United States), see: 18 U.S. Code § 201 – Bribery of public officials and witnesses What isn’t public bribery is to donate funds to an office holder on a regular basis, unrelated to any particular vote or act on the part of that official. Think of it as bribery on an installment plan. When U.S. officials, such as former Secretary of State Hillary Clinton complain of corruption in other governments, they are criticizing quid-pro-quo bribery and not installment plan bribery as it is practiced in the United States. Regular contributions gains ready access to legislators and, not surprisingly, more votes will go in your favor than random chance would allow. Regular contributions are more expensive than direct bribes but avoiding the “causal link” is essential for all involved. ### CyGraph: Cybersecurity Situational Awareness… Thursday, October 15th, 2015 CyGraph: Cybersecurity Situational Awareness That’s More Scalable, Flexible & Comprehensive by Steven Noel. (MITRE Corporation, if you can’t tell from the title.) From the post: Preventing and reacting to attacks in cyberspace involves a complex and rapidly changing milieu of factors, requiring a flexible architecture for advanced analytics, queries and graph visualization. Information Overload in Security Analytics Cyber warfare is conducted in complex environments, with numerous factors contributing to attack success and mission impacts. Network topology, host configurations, vulnerabilities, firewall settings, intrusion detection systems, mission dependencies and many other elements all play important parts. To go beyond rudimentary assessments of security posture and attack response, organizations need to merge isolated data into higher-level knowledge of network-wide attack vulnerability and mission readiness in the face of cyber threats. Network environments are always changing, with machines added and removed, patches applied, applications installed, firewall rules changed, etc., all with potential impact on security posture. Intrusion alerts and anti-virus warnings need attention, and even seemingly benign events such as logins, service connections and file share accesses could be associated with adversary activity. The problem is not lack of information, but rather the ability to assemble disparate pieces of information into an overall analytic picture for situational awareness, optimal courses of action and maintaining mission readiness. CyGraph: Turning Cybersecurity Information into Knowledge To address these challenges, researchers at the MITRE Corporation are developing CyGraph, a tool for cyber warfare analytics, visualization and knowledge management. Graph databases, Neo4j being one of many, can be very useful in managing complex security data. However, as I mentioned earlier today, one of the primary issues in cybersecurity is patch management, with a full 76% of applications remaining unpatched more than two years after vulnerabilities have been discovered. (Yet Another Flash Advisory (YAFA) [Patch Due 19 October 2015]) If you haven’t taken basic steps on an issue like patch management, as in evaluating and installing patches in a timely manner, a rush to get the latest information is mis-placed. Just in case you are wondering, if you do visit MITRE Corporation, you will find that a search for “CyGraph” comes up empty. Must not be quite to the product stage just yet. Watch for name conflicts: and others of course. ### Neo4j 2.3 RC1 is out! Wednesday, October 14th, 2015 I saw a tweet from Michael Hunger saying Neo4j 2.3 RC1 is out. For development only – check here. Comment early and often! ### Titan Graph DB Performance Tips Saturday, October 10th, 2015 Titan Graph DB Performance Tips From the post: In Hawkular Inventory, we use the Tinkerpop API (version 2 for the time being) to store our inventory model in a graph database. We chose Titan as the storage engine configured to store the data in the Cassandra cluster that is also backing Hawkular Metrics and Alerts. This blog post will guide you through some performance-related lessons with Titan that we learned so far. Inventory is under heavy development with a lot of redesign and refactoring going on between releases so we took quite a naive approach to storing and querying data from the graph database. That is, we store entities from our model as vertices in the graph and the relationships between the entities as edges in the graph. Quite simple and a school book example of how it should look like. We did declare a couple of indices in the database on the read-only aspects of the vertices (i.e. a “type” of the entity the vertex corresponds to) but we actually didn’t pay too much attention to the performance. We wanted to have the model right first. Fast forward a couple of months and of course, the performance started to be a real problem. The Hawkular agent for Wildfly is inserting a non-trivial amount of entities and not only inserting them but also querying them has seen a huge performance degradation compared to the simple examples we were unit testing with (due to number of vertices and edges stored). The time has come to think about how to squeeze some performance out of Titan as well as how to store the data and query it more intelligently. Several performance tips but the one that caught my eye and resulted in an order of magnitude performance gain: 3. Mirror Properties on The Edges This is the single most important optimization we’ve done so far. The rationale is this. To jump from a vertex to another vertex over an edge is a fairly expensive operation. Titan uses the adjacency lists to store the vertices and their edges in wide rows in Cassandra. It uses another adjacency list for edges and their target vertices. So to go from vertex to vertex, Titan actually has to do 2 queries. It would be much easier if we could avoid that at least in some cases. The solution here is to copy the values of some (frequently used and, in our case, immutable) properties from the “source” and “target” vertices directly to the edges. This helps especially in the cases where you do some kind of filtering on the target vertices that you instead can do directly on the edges. If there is a high number of edges to go through, this helps tremendously because you greatly reduce the number of times you have to do the “second hop” to the target vertex. I am curious what is being stored on the vertex that requires a second search to jump to the target vertex? That is if you have moved “popular” vertex properties to the edge, why not move other properties of the node there? Suggestions? ### Using Graph Structure Record Linkage on Irish Census Friday, October 9th, 2015 From the post: For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911 Irish censuses, I hoped I would be able to find a way to reliably link resident records from the two together to identify the same residents. Since then I’ve learned a bit about master data management and record linkage and so I thought I would give it another stab. Here I’d like to talk about how I’ve been matching records based on the local data space around objects to improve my record linkage scoring. An interesting issue that has currency with intelligence agencies slurping up digital debris at every opportunity. So you have trillions of records. Which ones have you been able to reliably match up? From a topic map perspective, I could not help but notice that in the 1901 census, the categories for Marriage were: • Married • Widower • Widow • Not Married Whereas the 1911 census records: • Married • Widower • Widow • Single As you know, one of the steps in record linkage is normalization of the data headings and values before you apply the standard techniques to link records together. In traditional record linkage, the shift from “not married” to “single” is lost in the normalization. May not be meaningful for your use case but could be important for someone studying shifts in marital relationship language. Or shifts in religious, ethnic, or racist language. Or for that matter, shifts in the names of database column headers and/or tables. (Like anyone thinks those are stable.) Pay close attention to how Brian models similarity candidates. Once you move beyond string equivalent identifiers (TMDM), you are going to be facing the same issues. ### Tracking Congressional Whores Wednesday, September 30th, 2015 From the post: Interactions among members of any large organization are naturally a graph, yet the tools we use to collect and analyze data about these organizations often ignore the graphiness of the data and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time to analyze the data. Collaboration networks are a perfect example. So let’s focus on one of the most powerful collaboration networks in the world: the US Congress. Introducing legis-graph: US Congress in a graph The goal of legis-graph is to provide a way of converting the data provided by Govtrack into a rich property-graph model that can be inserted into Neo4j for analysis or integrating with other datasets. The code and instructions are available in this Github repo. The ETL workflow works like this: 1. A shell script is used to rsync data from Govtrack for a specific Congress (i.e. the 114th Congress). The Govtrack data is a mix of JSON, CSV, and YAML files. It includes information about legislators, committees, votes, bills, and much more. 2. To extract the pieces of data we are interested in for legis-graph a series of Python scripts are used to extract and transform the data from different formats into a series of flat CSV files. 3. The third component is a series of Cypher statements that make use of LOAD CSV to efficiently import this data into Neo4j. To get started with legis-graph in Neo4j you can follow the instructions here. Alternatively, a Neo4j data store dump is available here for use without having to execute the import scripts. We are currently working to streamline the ETL process so this may change in the future, but any updates will be appear in the Github README. This project began during preparation for a graph database focused Meetup presentation. George Lesica and I wanted to provide an interesting dataset in Neo4j for new users to explore. Whenever the U.S. Congress is mentioned, I am reminded of the Obi-Wan Kenobi’s line about Mos Eisley: You will never find a more wretched hive of scum and villainy. We must be cautious. The data model for William’s graph: As you can see from the diagram above, the datamodel is rich and captures quite a bit of information about the actions of legislators, committees, and bills in Congress. Information about what properties are currently included is available here. A great starting place that can be extended and enriched. In terms of the data model, note that “subject” is now the title of a bill. Definitely could use some enrichment there. Another association for the bill, “who_benefits.” If you are really ambitious, try developing information on what individuals or groups make donations to the legislator on an annual basis. To clear noise out of the data set, drop everyone who doesn’t contribute annually and even then, any total less than$5,000. Remember that members of congress depend on regular infusions of cash so erratic or one-time donors may get a holiday card but they are not on the ready access list.

The need for annual cash is one reason why episodic movements may make the news but they rarely make a difference. To make a difference requires years of steady funding and grooming of members of congress and improving your access, plus your influence.

Don’t be disappointed if you can “prove” member of congress X is in the pocket of Y or Z organization/corporation and nothing happens. More likely than not, such proof will increase their credibility for fund raising.

As Leonard Cohen says (Everybody Knows):

Everybody knows the dice are loaded, Everybody rolls with their fingers crossed

### Neo4j 2.3.0 Milestone 3 Release

Monday, September 28th, 2015

From the post:

Great news for all those Neo4j alpha-developers out there: The third (and final) milestone release of Neo4j 2.3 is now ready for download! Get your copy of Neo4j 2.3.0-M03 here.

Quick clarification: Milestone releases like this one are for development and experimentation only, and not all features are in their finalized form. Click here for the most fully stable version of Neo4j (2.2.5).

So, what cool new features made it into this milestone release of Neo4j? Let’s dive in.

If you want to “kick the tires” on Neo4j 2.3.0 before the production version arrives, now would be a good time.

Andreas covers new features in Neo4j 2.3.0, such as triadic selection, constraints, deletes and others.

As I read Andreas’ explanation of “triadic selection” and its role in recommendations, I started to wonder how effective recommendations are outside of movies, books, fast food, etc.

The longer I thought about it, the harder I was pressed to come up with a buying decision that isn’t influenced by recommendations, either explicit or implicit.

Can a “rational” market exist if we are all more or less like lemmings when it comes to purchasing (and other) decisions?

You don’t have to decide that question to take Neo4j 2.3.0 for a spin.

### Guesstimating the Future

Thursday, September 24th, 2015

I ran across some introductory slides on Neo4j with the line:

Forrester estimates that over 25% of enterprises will be using graph databases by 2017.

Well, Forrester also predicted that tablet sales would over take laptops sales in 2015: Forrester: Tablet Sales Will Eclipse Laptop Sales by 2015.

You might want to check that prediction against: Laptop sales ‘stronger than ever’ versus tablets – PCR Retail Advisory Board.

The adage “It is difficult to make predictions, especially about the future.,” remains appropriate.

Neo4j doesn’t need lemming-like behavior among consumers of technology to make a case for itself.

Compare Neo4j and its query language, Cypher, to your use cases and I think you will agree.

### Graphs in the world: Modeling systems as networks

Tuesday, September 15th, 2015

Graphs in the world: Modeling systems as networks by Russel Jurney.

From the post:

Networks of all kinds drive the modern world. You can build a network from nearly any kind of data set, which is probably why network structures characterize some aspects of most phenomenon. And yet, many people can’t see the networks underlying different systems. In this post, we’re going to survey a series of networks that model different systems in order to understand different ways networks help us understand the world around us.

We’ll explore how to see, extract, and create value with networks. We’ll look at four examples where I used networks to model different phenomenon, starting with startup ecosystems and ending in network-driven marketing.

Loaded with successful graph modeling stories Russel’s post will make you anxious to find a data set to model as a graph.

Which is a good thing.

Combining two inboxes (Russel’s and his brother’s) works because you can presume that identical email addresses belong to the same user. But what about different email addresses that belong to the same user?

For data points that will become nodes in your graph, what “properties” do you see in them that make them separate nodes? Have you captured those properties on those nodes? Ditto for relationships that will become arcs in your graph.

How easy is it for someone other than yourself to combine a graph you make with a graph made by a third person?

Data, whether represented as a graph or not, is nearly always “transparent” to its creator. Beyond modeling, the question for graphs is have you enabled transparency for others?

I first saw this in a tweet by Kirk Borne.

### Scheduling Tasks and Drawing Graphs…

Thursday, September 3rd, 2015

From the post:

When an algorithm developed for one problem domain applies almost verbatim to another, completely unrelated domain, that is the type of insight, beauty and depth that makes computer science a science on its own, and not a branch of something else, namely mathematics, like many professionals educated in the field mistakenly believe. For example, one of the common algorithmic problems during the 60s was the scheduling of tasks on multiprocessor machines. The problem is, you are given a large set of tasks, some of which depend on others, that have to be scheduled for processing on N number of processors in such a way as to maximize processor use. A well-known algorithm for this problem is the Coffman-Graham algorithm. It assumes that there are no circular dependencies between the tasks, as is usually the case when it comes to real world tasks, except in catch 22 situations at some bureaucracies run amok! To do that, the tasks and their dependencies are modeled as a DAG (a directed acyclic graph). In mathematics, this is also known as a partial order: if a tasks T1 depends on T2, we say that T2 preceeds T1, and we write T2 < T1. The ordering is called partial because not all tasks are related in this precedence relation, some are simply independent of each other and can be safely carried out in parallel.

The post is a 5 minute read and ends beautifully. I promise.

### Sturctr Restructuring

Tuesday, September 1st, 2015

Over several recent tweets I have seen that Structr is restructuring its websites to better separate structr.com (commercial) from structr.org (open source) communities.

The upcoming structr release is 2.0 and is set for just before @GraphConnect (Oct. 21, 2015). Another tweet says there are lots of new features, full-text file and URL indexing, SSH/SCP support, a new Files UI etc.)

That makes October seem further away that it was just moments ago. 😉

### Network motif discovery: A GPU approach [subgraph isomorphism and topics]

Sunday, August 30th, 2015

Network motif discovery: A GPU approach by Wenqing Lin ; Xiaokui Xiao ; Xing Xie ; Xiao-Li Li.

Abstract:

The identification of network motifs has important applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires enumerating subgraphs from a real-life graph, and computing the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days to derive network motifs from biological networks with only a few thousand vertices. To address this problem, this paper presents a novel study on network motif discovery using Graphical Processing Units (GPUs). The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in computing subgraph frequencies from random graphs, so as to reduce the overall computation time of network motif discovery. We explore the design space of GPU-based subgraph matching algorithms, with careful analysis of several crucial factors that affect the performance of GPU programs. Based on our analysis, we develop a GPU-based solution that (i) considerably differs from existing CPU-based methods, and (ii) exploits the strengths of GPUs in terms of parallelism while mitigating their limitations in terms of the computation power per GPU core. With extensive experiments on a variety of biological networks, we show that our solution is up to two orders of magnitude faster than the best CPU-based approach, and is around 20 times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used.

For those of you who need to dodge the paywall: Network motif discovery: A GPU approach

From the introduction (topic map comments interspersed):

Given a graph $G$, a network motif in $G$ is a subgraph $g$ of $G$, such that $g$ appears much more frequently in $G$ than in random graphs whose degree distributions are similar to that of $G$ [1]. The identification of network motifs finds important applications in numerous domains. For example, network motifs are used (i) in system biology to predict protein interactions in biological networks and discover functional sub-units [2], (ii) in electronic engineering to understand the characteristics of circuits [3], and (iii) in brain science to study the functionalities of brain networks [4].

Unlike a topic map considered as graph $G$, the network motif problem is to discover subgraphs of $G$. In a topic map, the subgraph constituting a topic and its defined isomophism with other topics, is declarable.

…Roughly speaking, all existing techniques adopt a common two-phase framework as follows:

Subgraph Enumeration: Given a graph $G$ and a parameter $k$, enumerate the subgraphs $g$ of $G$ with $k$ vertices each;

Frequency Estimation:

Rather than enumerating subgraph $g$ of $G$ (a topic map), we need only collect those subgraphs for the isomorphism test.

…To compute the frequency of $g$ in a random graph $G$, however, we need to derive the number of subgraphs of $G$ that are isomorphic to $g$ – this requires a large number of subgraph isomorphism tests [14], which are known to be computationally expensive.

Footnote 14 is a reference to: Practical graph isomorphism, II by Brendan D. McKay, Adolfo Piperno.

See also: nauty and Traces by Brendan McKay and Adolfo Piperno.

nauty and Traces are programs for computing automorphism groups of graphs and digraphs [*]. They can also produce a canonical label. They are written in a portable subset of C, and run on a considerable number of different systems.

This is where topic map subgraph $g$ isomorphism issue intersects with the more general subgraph isomorphism case.

Any topic can have properties (nee internal occurrences) in addition to those thought to make it “isomorphic” to another topic. Or play roles in associations that are not played by other topics representing the same subject.

Motivated by the deficiency of existing work, we present an in-depth study on efficient solutions for network motif discovery. Instead of focusing on the efficiency of individual subgraph isomorphism tests, we propose to utilize Graphics Processing Units (GPUs) to parallelize a large number of isomorphism tests, in order to reduce the computation time of the frequency estimation phase. This idea is intuitive, and yet, it presents a research challenge since there is no existing algorithm for testing subgraph isomorphisms on GPUs. Furthermore, as shown in Section III, existing CPU-based algorithms for subgraph isomorphism tests cannot be translated into efficient solutions on GPUs, as the characteristics of GPUs make them inherently unsuitable for several key procedures used in CPU-based algorithms.

The punch line is that the authors present a solution that:

…is up to two orders of magnitude faster than the best CPU-based approach, and is around 20 times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used.

Since topic isomophism is defined as opposed to discovered, this looks like a fruitful avenue to explore for topic map engine performance.

### TinkerPop3 Promo in One Paragraph

Saturday, August 22nd, 2015

Marko A. Rodriguez tweeted https://news.ycombinator.com/item?id=10104282 as a “single paragraph” explanation of why you should prefer TinkerPop3 over TinkerPop2.

Of course, I didn’t believe the advantages could be contained in a single paragraph but you be the judge:

Check out http://tinkerpop.com. Apache TinkerPop 3.0.0 was released in June 2015 and it is a quantum leap forward. Not only is it now apart of the Apache Software Foundation, but the Gremlin3 query language has advanced significantly since Gremlin2. The language is much cleaner, provides declarative graph pattern matching constructs, and it supports both OLTP graph databases (e.g. Titan, Neo4j, OrientDB) and OLAP graph processors (e.g. Spark, Giraph). With most every graph vendor providing TinkerPop-connectivity, this should make it easier for developers as they don’t have to learn a new query language for each graph system and developers are less prone to experience vendor lock-in as their code (like JDBC/SQL) can just move to another underlying graph system.

Are my choices developer lock-in versus vendor lock-in? That’s a tough call. 😉

Do check out TinkerPop3!

### The Gremlin Graph Traversal Language (slides)

Wednesday, August 19th, 2015

The Gremlin Graph Traversal Language by Marko Rodriguez.

Forty-Five (45) out of fifty (50) slides have working Gremlin code!

Ninety percent (90%) of the slides have code you can enter!

It isn’t as complete as The Gremlin Graph Traversal Machine and Language, but on the other hand, it is a hell of a lot easier to follow along.

Enjoy!