Archive for the ‘Graphs’ Category

GraphSON and TinkerPop systems

Tuesday, August 8th, 2017

Tips for working with GraphSON and TinkerPop systems by Noah Burrell.

From the post:

If you are working with the Apache TinkerPop™ framework for graph computing, you might want to produce, edit, and save graphs, or parts of graphs, outside the graph database. To accomplish this, you might want a standardized format for a graph representation that is both machine- and human-readable. You might want features for easily moving between that format and the graph database itself. You might want to consider using GraphSON.

GraphSON is a JSON-based representation for graphs. It is especially useful to store graphs that are going to be used with TinkerPop™ systems, because Gremlin (the query language for TinkerPopTM graphs) has a GraphSON Reader/Writer that can be used for bulk upload and download in the Gremlin console. Gremlin also has a Reader/Writer for GraphML (XML-based) and Gryo (Kryo-based).

Unfortunately, I could not find any sort of standardized documentation for GraphSON, so I decided to compile a summary of my research into a single document that would help answer all the questions I had when I started working with it.

Bookmark or better yet, copy-n-paste “Vertex Rules and Conventions” to print on one page and then print “Edge Rules and Conventions” on the other.

Could possibly get both on one page but I like larger font sizes. 😉

Type in the “Example GraphSON Structure” to develop finger knowledge of the format.

Watch for future posts from Noah Burrell. This is useful.

It’s more than just overlap: Text As Graph

Wednesday, August 2nd, 2017

It’s more than just overlap: Text As Graph – Refining our notion of what text really is—this time for sure! by Ronald Haentjens Dekker and David J. Birnbaum.


The XML tree paradigm has several well-known limitations for document modeling and processing. Some of these have received a lot of attention (especially overlap), and some have received less (e.g., discontinuity, simultaneity, transposition, white space as crypto-overlap). Many of these have work-arounds, also well known, but—as is implicit in the term “work-around”—these work-arounds have disadvantages. Because they get the job done, however, and because XML has a large user community with diverse levels of technological expertise, it is difficult to overcome inertia and move to a technology that might offer a more comprehensive fit with the full range of document structures with which researchers need to interact both intellectually and programmatically. A high-level analysis of why XML has the limitations it has can enable us to explore how an alternative model of Text as Graph (TAG) might address these types of structures and tasks in a more natural and idiomatic way than is available within an XML paradigm.

Hyperedges, texts and XML, what more could you need? 😉

This paper merits a deep read and testing by everyone interested in serious text modeling.

You can’t read the text but here is a hypergraph visualization of an excerpt from Lewis Carroll’s “The hunting of the Snark:”

The New Testament, the Hebrew Bible, to say nothing of the Rabbinic commentaries on the Hebrew Bible and centuries of commentary on other texts could profit from this approach.

Put your text to the test and share how to advance this technique!

Neo4j 3.3.0-alpha02 (Graphs For Schemas?)

Friday, June 30th, 2017

Neo4j 3.3.0-alpha02

A bit late (release was 06/15/2017) but give Neo4j 3.3.0-alpha02 a spin over the weekend.

From the post:

Detailed Changes and Docs

For the complete list of all changes, please see the changelog. Look for 3.3 Developer manual here, and 3.3 Operations manual here.

Neo4j is one of the graph engines a friend wants to use for analysis/modeling of the ODF 1.2 schema. The traditional indented list is only one tree visualization out of the four major ones.

(From: Trees & Graphs by Nathalie Henry Riche, Microsoft Research)

Riche’s presentation covers a number of other ways to visualize trees and if you relax the “tree” requirement for display, interesting graph visualizations that may give insight into a schema design.

The slides are part of the materials for CSE512 Data Visualization (Winter 2014), so references for visualizing trees and graphs need to be updated. Check the course resources link for more visualization resources.

Trillion-Edge Graphs – Dodging Cost and the NSA

Tuesday, May 30th, 2017

Mosaic: processing a trillion-edge graph on a single machine by Adrian Colyer.

From the post:

Mosaic: Processing a trillion-edge graph on a single machine Maass et al., EuroSys’17

Unless your graph is bigger than Facebook’s, you can process it on a single machine.

With the inception of the internet, large-scale graphs comprising web graphs or social networks have become common. For example, Facebook recently reported their largest social graph comprises 1.4 billion vertices and 1 trillion edges. To process such graphs, they ran a distributed graph processing engine, Giraph, on 200 machines. But, with Mosaic, we are able to process large graphs, even proportional to Facebook’s graph, on a single machine.

In this case it’s quite a special machine – with Intel Xeon Phi coprocessors and NVMe storage. But it’s really not that expensive – the Xeon Phi used in the paper costs around $549, and a 1.2TB Intel SSD 750 costs around $750. How much do large distributed clusters cost in comparison? Especially when using expensive interconnects and large amounts of RAM.

So Mosaic costs less, but it also consistently outperforms other state-of-the-art out of core (secondary storage) engines by 3.2x-58.6x, and shows comparable performance to distributed graph engines. At one trillion edge scale, Mosaic can run an iteration of PageRank in 21 minutes (after paying a fairly hefty one-off set-up cost).

(And remember, if you have a less-than-a-trillion edges scale problem, say just a few billion edges, you can do an awful lot with just a single thread too!).

Another advantage of the single machine design, is a much simpler approach to fault tolerance:

… handling fault tolerance is as simple as checkpointing the intermediate stale data (i.e., vertex array). Further, the read-only vertex array for the current iteration can be written to disk parallel to the graph processing; it only requires a barrier on each superstep. Recovery is also trivial; processing can resume with the last checkpoint of the vertex array.

There’s a lot to this paper. Perhaps the two most central aspects are design sympathy for modern hardware, and the Hilbert-ordered tiling scheme used to divide up the work. So I’m going to concentrate mostly on those in the space available.

A publicly accessible version of the paper: Mosaic: Processing a trillion-edge graph on a single machine. Presentation slides.

Definitely a paper for near the top of my reading list!

Shallow but broad graphs (think telephone surveillance data) are all the rage but how would relatively narrow but deep graphs fare when being processed by Mosaic?

Using top-end but not uncommon hardware may enable your processing requirements to escape the notice of the NSA. Another benefit to commodity hardware.


Network analysis of Game of Thrones family ties [A Timeless Network?]

Monday, May 15th, 2017

Network analysis of Game of Thrones family ties by Shirin Glander.

From the post:

In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.

The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.

Glander improves network data for the Game of Thrones and walks you through the use of R to analyze that network.

It’s useful work and will repay close study.

Network analysis can used with all social groups, activists, bankers, hackers, members of Congress (U.S.), terrorists, etc.

But just as Ned Stark has no relationship with dire wolves when the story begins, networks of social groups develop, change, evolve if you will, over time.

Moreover, events, interactions, involving one or more members of the network, occur in time sequence. A social network that fails to capture those events and their sequencing, from one or more points of view, is a highly constrained network.

A useful network as Glander demonstrates but one cannot answer simple questions about the order in which characters gained knowledge that a particular character hurled another character from a very high window.

If I were investigating say a leak of NSA cybertools, time sequencing like that would be one of my top priorities.


Network datasets (@Ognyanova)

Tuesday, May 9th, 2017

Network datasets by Katherine Ognyanova.

From the post:

Since I started posting network tutorials on this site, people will occasionally write to ask me about the included example datasets. I also get e-mails from people asking where they might find network data to use for a project or in teaching. Seems like a good idea to post a quick reply here.

The datasets included in my tutorials are mostly synthetic (or trimmed and heavily manipulated) in order to illustrate various visualization aspects in a manageable way. Feel free to use those datasets (citing or linking to the source is appreciated), but keep in mind that they are artificially generated and not the result of actual data collection. When I do use empirical data, the download files include documentation (if the data is collected by me) or clearly point to the source (if the data was collected by someone else).

If you are looking for network data, large or small, there are a number of excellent open online repositories that you can take a look at. Below is a short list (feel free to e-mail me if you have other good links, and I will add them here).

Links to ten (10) collections of network datasets, plus suggestions on software for collecting and analyzing social network data.

Considering following her: @Ognyanova. See her website, for additional resources.

AI Brain Scans

Monday, March 13th, 2017

‘AI brain scans’ reveal what happens inside machine learning

The ResNet architecture is used for building deep neural networks for computer vision and image recognition. The image shown here is the forward (inference) pass of the ResNet 50 layer network used to classify images after being trained using the Graphcore neural network graph library

Credit Graphcore / Matt Fyles

The image is great eye candy, but if you want to see images annotated with information, check out: Inside an AI ‘brain’ – What does machine learning look like? (Graphcore)

From the product overview:

Poplar™ is a scalable graph programming framework targeting Intelligent Processing Unit (IPU) accelerated servers and IPU accelerated server clusters, designed to meet the growing needs of both advanced research teams and commercial deployment in the enterprise. It’s not a new language, it’s a C++ framework which abstracts the graph-based machine learning development process from the underlying graph processing IPU hardware.

Poplar includes a comprehensive, open source set of Poplar graph libraries for machine learning. In essence, this means existing user applications written in standard machine learning frameworks, like Tensorflow and MXNet, will work out of the box on an IPU. It will also be a natural basis for future machine intelligence programming paradigms which extend beyond tensor-centric deep learning. Poplar has a full set of debugging and analysis tools to help tune performance and a C++ and Python interface for application development if required.

The IPU-Appliance for the Cloud is due out in 2017. I have looked at Graphcore but came up dry on the Poplar graph libraries and/or an emulator for the IPU.

Perhaps those will both appear later in 2017.

Optimized hardware for graph calculations sounds promising but rapidly processing nodes that may or may not represent the same subject seems like a defect waiting to make itself known.

Many approaches rapidly process uncertain big data but being no more ignorant than your competition is hardly a selling point.

JanusGraph (Linux Foundation Graph Player Rides Into Town)

Wednesday, February 22nd, 2017


From the homepage:

JanusGraph is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
JanusGraph is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.

In addition, JanusGraph provides the following features:

You can clone JanusGraph from GitHub.
Read the JanusGraph documentation and join the users or developers mailing lists.

Follow the Getting Started with JanusGraph guide for a step-by-step introduction.

Supported by Google, IBM and Hortonworks, among others.

Three good reasons to pay attention to JanusGraph early and often.


Clinton/Podesta Emails – Towards A More Complete Graph (Part 3) New Dump!

Sunday, October 30th, 2016

As you may recall from Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I didn’t check to see if “|” was in use as a separator in the extracted emails subject lines so when I tried to create node lists based on “|” as a separator, it failed.

That happens. More than many people are willing to admit.

In the meantime, a new dump of emails has arrived so I created the new DKIM-incomplete-podesta-1-22.txt.gz file. Which mean picking a new separator to use for the resulting file.

Advice: Check your proposed separator against the data file before using it. I forgot, you shouldn’t.

My new separator? |/|

Which I checked against the file to make sure there would be no conflicts.

The sed commands to remove < and > are the same as in Part 2.

Sigh, back to failure land, again.

Just as one sample:

awk 'FS="|/|" { print $7}'

where is:

9991 00013434.eml|/|False|/|2015-11-21 17:15:25-05:00|/|Eryn Sepp|/|John Podesta|/|Re: Nov 30 / Future Plans / Etc.!|/|


Future Plans

I also checked that with gawk and nawk, with the same result.

For some unknown (to me) reason, all three are treating the first “/” in field 6 (by my count) as a separator, along with the second “/” in that field.

To test that theory, what do you think { print $8 } will return?

You’re right!


So with the “|/|” separator, I’m going to have up to at least 9 fields, perhaps more, varying depending on whether “/” characters occur in the subject line.


That’s not going to work.

OK, so I toss the 10+ MB DKIM-complete-podesta-1-22.txt.gz into Emacs, whose regex treatment I trust, and change “|/|” to “@@@@@” and save that file as DKIM-complete-podesta-1-22-03.txt.

Another sanity check, which got us into all this trouble last time:

awk 'FS="@@@@@" { print $7}' podesta-1-22-03.txt | grep @ | wc -l

returns 36504, which plus the 16 files I culled as failures, equals 36520, the number of files in the Podesta 1-22 release.

Recall that all message-ids contain an @ sign to the correct answer on the number of files gives us confidence the file is ready for further processing.

Apologies for it taking this much prose to go so little a distance.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Our first node for the node list (Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)) was to capture the emails themselves.

Using Message-Id (field 7) as the identifier and Subject (field 6) as its label.

We are about to encounter another problem but let’s walk through it.

An example of what we are expecting:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg;”Re: Tomorrow”;

We have the Message-Id with a closing “;”, followed by the Subject, surrounded in double quote marks and also terminated by a “;”.

FYI: Mixing single and double quotes in awk is a real pain. I struggled with it but then was reminded I can declare variables:

-v dq='"'

which allows me to do this:

awk -v dq='"' 'FS="@@@@@" { print $7 ";" dq $6 dq ";"}' podesta-1-22-03.txt

The awk variable trick will save you considerable puzzling over escape sequences and the like.

Ah, now we are to the problem I mentioned above.

In the part 1 post I mentioned that while:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;;”Re: Tomorrow”;


but having:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;;;”Re: Tomorrow”;;

with Wikileaks links is more convenient for readers.

As you may recall, the last two lines read:

9998 00022160.eml@@@@@False@@@@@2015-06-23 23:01:55-05:00@@@@@Jerome Tatar Tatar Jerome Knotes@@@@@CAC9z1zL9vdT+9FN7ea96r
9999 00013746.eml@@@@@False@@@@@2015-04-03 01:14:56-04:00@@@@@Eryn Sepp Podesta

Which means in addition to printing Message-Id and Subject as fields one and two, we need to split ID on the space and use the result to create the URL back to Wikileaks.

It’s late so I am going to leave you with DKIM-incomplete-podesta-1-22.txt.gz. This is complete save for 16 files that failed to parse. Will repost tomorrow with those included.

I have the first node file script working and that will form the basis for the creation of the edge lists.

PS: Look forward to running awk files tomorrow. It makes a number of things easier.

Clinton/Podesta Emails – Towards A More Complete Graph (Part 2)

Friday, October 28th, 2016

I assume you are starting with DKIM-complete-podesta-1-18.txt.gz.

If you are starting with another source, you will need different instructions. 😉

First, remember from Clinton/Podesta Emails – Towards A More Complete Graph (Part 1) that I wanted to delete all the < and > signs from the text.

That’s easy enough (uncompress the text first):

sed 's/<//g' DKIM-complete-podesta-1-18.txt > DKIM-complete-podesta-1-18-01.txt

followed by:

sed 's/>//g' DKIM-complete-podesta-1-18-01.txt > DKIM-complete-podesta-1-18-02.txt

Here’s where we started:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner <>|”‘'” <>|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|<A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F>
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin <>|hrcrapid <>|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|<>

Here’s the result after the first two sed scripts:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner|”‘'”|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin|hrcrapid|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|

BTW, I increment the numbers of my result files, DKIM-complete-podesta-1-18-01.txt, DKIM-complete-podesta-1-18-02.txt, because when I don’t, I run different sed commands on the same original file, expecting a cumulative result.

That’s spelled – disappointment and wasted effort looking for problems that aren’t problems. Number your result files.

The nodes and edges mentioned in Clinton/Podesta Emails – Towards A More Complete Graph (Part 1):


  • Emails, message-id is ID and subject is label, make Wikileaks id into link
  • From/To, email addresses are ID and name is label
  • True/False, true/false as ID, True/False as labels
  • Date, truncated to 2015-07-24 (example), date as id and label


  • To/From – Edges with message-id (source) email-address (target) to/from (label)
  • Verify – Edges with message-id (source) true/false (target) verify (label)
  • Date – Edges with message-id (source) – date (target) date (label)

Am I missing anything? The longer I look at problems like this the more likely my thinking/modeling will change.

What follows is very crude, command line creation of the node and edge files. Something more elaborate could be scripted/written in any number of languages.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

You don’t have to take my word for it, try this:

awk 'FS="|" { print $7}' DKIM-complete-podesta-1-18-02.txt

The output prints to the console. Those all look like message-ids to me, well, with the exception of the one that reads ” 24 September.”

How much dirty data do you think is in the message-id field?

A crude starting assumption is that any message-id field without the “@” character is dirty.

Let’s try:

awk ‘FS = “|” { print $7} DKIM-complete-podesta-1-18-02.txt | grep -v @ | wc -l

Which means we are going to extract the 7th field, search (grep) over those results for the “@” sign, where the -v switch means only print lines that DO NOT match, and we will count those lines with wc -l.

Ready? Press return.

I get 594 “dirty” message-ids.

Here is a sampling:

Rolling Stone
MSNBC, Jeff Weaver interview on Sanders health care plan and his Wall St. ad
Texas Tribune
20160120 Burlington, IA Organizing Event
start spreadin’ the news…..
Building Trades Union (Keystone XL)
Rubio Hits HRC On Cuba, Russia
Sourcing Story
Drivers Licenses
Day 1
H1N1 Flu Shot Available

Those look an awful lot like subject lines to me. You?

I suspect those subject lines had the separator character “|” in those lines, before we extracted the data from the .eml files.

I’ve tried to repair the existing files but the cleaner solution is to return to the extraction script and the original email files.

More on that tomorrow!

Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)

Friday, October 28th, 2016

Gephi is a great tool, but it’s only as good as its input.

The Gephi 8.2 email importer (missing in Gephi 9.*) is lossy, informationally speaking, as I have mentioned before.

Here’s a sample from the verification results on podesta-release-1-18:

9981 00045326.eml|False|2015-07-24 12:58:16-04:00|Oren Shur |John Podesta , Robby Mook , Joel Benenson , “Margolis, Jim” , Mandy Grunwald , David Binder , Teddy Goff , Jennifer Palmieri , Kristina Schake , Christina Reynolds , Katie Connolly , “Kaye, Anson” , Peter Brodnitz , “Rimel, John” , David Dixon , Rich Davis , Marlon Marshall , Michael Halle , Matt Paul , Elan Kriegel , Jake Sullivan |FW: July IA Poll Results|<>

The Gephi 8.2 mail importer fails to create a node representing an email message.

I propose we cure that failure by taking the last field, here:

and the next to last field:

FW: July IA Poll Results

and putting them as id and label, respectively in a node list:; “FW: July IA Poll Results”;

As part of the transformation, we need to remove the < and > signs around the message ID, then add a ; to mark the end of the ID field and put double quote ” “ around the subject to use it as a label. Then close the second field with another ;.

While we are talking about nodes, all the email addresses change from:

Oren Shur

to:; “Oren Shur”;

which are ID and label of the node list, respectively.

I could remove the < and > characters as part of the extraction script but will use sed at the command line instead.

Reminder: Always work on a copy of your data, never the original.

Then we need to create an edge list, one that represents the relationships between the email (as node) to the sender and receivers of the email (also nodes). For this first iteration, I’m going to use labels on the edges to distinguish between senders and receivers.

Assuming my first row of the edges file reads:

Source; Target; Role (I did not use “Type” because I suspect that is a controlled term for Gephi.)

Then the first few edges would read:;>; from;;>; to;;; to;;; to;;; to;

As you can see, this is going to be a “busy” graph! 😉

Filtering is going to play an important role in exploring this graph, so let’s add nodes that will help with that process.

I propose we add to the node list:

true; True
false; False

as id and labels.

Which means for the edge list we can have:; true; verify;

Do you have an opinion on the order, source/target for true/false?

Thinking this will enable us to filter nodes that have not been verified or to include only those that have failed verification.

For experimental purposes, I think we need to rework the date field:

2015-07-24 12:58:16-04:00

I would truncate that to:


and add such truncated dates to the node list:

2015-07-24; 2015-07-24;

as ID and label, respectively.

Then for the edge list:; 2015-07-24; date;

Reasoning that we can filter to include/exclude nodes based on dates, which if you add enough email boxes, could help visualize the reaction to and propagation of emails.

Even assuming named participants in these emails have “deleted” their inboxes, there are always automatic backups. It’s just a question of persistence before the rest of this network can be fleshed out.

Oh, one last thing. You have probably notice the Wikileaks “ID” that forms part of the filename?

9981 00045326.eml

The first part forms the end of a URL to link to the original post at Wikileaks.

Thus, in this example, 9981 becomes:

The general form being:

For the convenience of readers/users, I want to modify my earlier proposal for the email node list entry from:; “FW: July IA Poll Results”;

to:; “FW: July IA Poll Results”;;

Where the third field is “link.”

I am eliding over lots of relationships and subjects but I’m not reluctant to throw it all away and start over.

Your investment in a model isn’t lost by tossing the model, you learn something with every model you build.

Scripting underway, a post on that experience and the node/edge lists to follow later today.

Podesta/Clinton Emails: Filtering by Email Address (Pimping Bill Clinton)

Thursday, October 27th, 2016

The Bill Clinton, Inc. story reminds me of:

Although I steadfastly resist imaging either Bill or Hillary in that video. Just won’t go there!

Where a graph display can make a difference is that instead of just the one email/memo from Bill’s pimp, we can rapidly survey all of the emails in which he appears, in any role.


I ran that on Gephi 8.2 against podesta-release-1-18 but the results were:

Nodes 0, Edges 0.

Hmmm, there is something missing, possibly the CSV file?

I checked and podesta-release-1-18 has 393 emails where appears.

Could try to find the “right” way to filter on email addresses but for now, let’s take a dirty short-cut.

I created a directory to hold all emails with and ran into all manner of difficulties because the file names are plagued with spaces!

So much so that I unintentionally (read “by mistake”) saved all the posts from podesta-release-1-18 to a different folder than the ones from podesta-release-19.


Well, but there is a happy outcome and an illustration of yet another Gephi capability.

I build the first graph from the posts from podesta-release-1-18 and then with that graph open, imported the from podesta-release-19 and appended those results to the open graph.

How cool is that!

Imagine doing that across data sets, assuming you paid close attention to identifiers, etc.

Sorry, back to the graphs, here is the random layout once the graphs were combined:


Applying the Yifan Hu network layout:


I ran network statistics on network diameter and applied colors based on betweenness:


And finally, adjusted the font and turned on the labels:


I have spent a fair amount of time just moving stuff about but imagine if you could interactively explore the emails, creating and trashing networks based on to:, from:, cc:, dates, content, etc.

The limits of Gephi imports were a major source of pain today.

I’m dodging those tomorrow in favor of creating node and adjacency tables with scripts.

PS: Don’t John Podesta and Doug Band look like two pimps in a pod? 😉

PPS: If you haven’t read the pimping Bill Clinton memo. (I think it has some other official title.)

No Frills Gephi (8.2) Import of Clinton/Podesta Emails (1-18)

Wednesday, October 26th, 2016

Using Gephi 8.2, you can create graphs of the Clinton/Podesta emails based on terms in subject lines or the body of the emails. You can interactively work with all 30K+ (as of today) emails and extract networks based on terms in the posts. No programming required. (Networks based on terms will appear tomorrow.)

If you have Gephi 8.2 (I can’t find the import spigot in 9.0 or 9.1), you can import the Clinton/Podesta Emails (1-18) for analysis as a network.

To save you the trouble of regressing to Gephi 8.2, I performed a no frills/default import and exported that file as podesta-1-18-network.gephi.gz.

Download and uncompress podesta-1-18-network.gephi.gz, then you can pickup at timemark 3.49.

Open the file (your location may differ):


Obligatory hair-ball graph visualization. 😉


Considerably less appealing that Jennifer Golbeck’s but be patient!

First step, Layout -> Yifan Hu. My results:


Second step, Network Diameter statistics (right side, run).

No visible impact on the graph but, now you can change the color and size of nodes in the graph. That is they have attributes on which you can base the assignment of color and size.

Tutorial gotcha: Not one of Jennifer’s tutorials but I was watching a Gephi tutorial that skipped the part about running statistics on the graph prior to assignment of color and size. Or I just didn’t hear it. The menu options appear in documentation but you can’t access them unless and until you run network statistics or have attributes for the assignment of color and size. Run statistics first!

Next, assign colors based on betweenness centrality:


The densest node is John Podesta, but if you remove his node, rerun the network statistics and re-layout the graph, here is part of what results:


A no frills import of 31,819 emails results in a graph of 3235 nodes and 11,831 edges.

That’s because nodes and edges combine (merge to you topic map readers) when they have the same identifier or for edges are between the same nodes.

Subject to correction, when that combining/merging occurs, the properties on the respective nodes/edges are accumulated.

Topic mappers already realize there are important subjects missing, some 31,819 of them. That is the emails themselves don’t by default appear as nodes in the network.

Ian Robinson, Jim Webber & Emil Eifrem illustrate this lossy modeling in Graph Databases this way:


Modeling emails without the emails is rather lossy. 😉

Other nodes/subjects we might want:

  • Multiple to: emails – Is who was also addressed important?
  • Multiple cc: emails – Same question as with to:.
  • Date sent as properties? So evolution of network/emails can be modeled.
  • Capture “reply-to” for relationships between emails?

Other modeling concerns?

Bear in mind that we can suppress a large amount of the detail so you can interactively explore the graph and only zoom into/display data after finding interesting patterns.

Some helpful links: The email collection as bulk download, thanks to Michael Best, @NatSecGeek. Where you can grab a copy of Gephi 8.2.

Hair Ball Graphs

Friday, August 26th, 2016

An example of a non-useful “hair ball” graph visualization:


That image is labeled as “standard layout” at a site that offers this cohesion adapted layout alternative:


The full-size image is quite impressive.

If you were attempting to visualize vulnerabilities, which one would you pick?

Simit: A Language for Physical Simulation

Sunday, August 14th, 2016

Simit: A Language for Physical Simulation by Fredrik Kjolstad, et al.


With existing programming tools, writing high-performance simulation code is labor intensive and requires sacrificing readability and portability. The alternative is to prototype simulations in a high-level language like Matlab, thereby sacrificing performance. The Matlab programming model naturally describes the behavior of an entire physical system using the language of linear algebra. However, simulations also manipulate individual geometric elements, which are best represented using linked data structures like meshes. Translating between the linked data structures and linear algebra comes at significant cost, both to the programmer and to the machine. High-performance implementations avoid the cost by rephrasing the computation in terms of linked or index data structures, leaving the code complicated and monolithic, often increasing its size by an order of magnitude.

In this article, we present Simit, a new language for physical simulations that lets the programmer view the system both as a linked data structure in the form of a hypergraph and as a set of global vectors, matrices, and tensors depending on what is convenient at any given time. Simit provides a novel assembly construct that makes it conceptually easy and computationally efficient to move between the two abstractions. Using the information provided by the assembly construct, the compiler generates efficient in-place computation on the graph. We demonstrate that Simit is easy to use: a Simit program is typically shorter than a Matlab program; that it is high performance: a Simit program running sequentially on a CPU performs comparably to hand-optimized simulations; and that it is portable: Simit programs can be compiled for GPUs with no change to the program, delivering 4 to 20× speedups over our optimized CPU code.

Very deep sledding ahead but consider the contributions:

Simit is the first system that allows the development of physics code that is simultaneously:

Concise. The Simit language has Matlab-like syntax that lets algorithms be implemented in a compact, readable form that closely mirrors their mathematical expression. In addition, Simit matrices assembled from hypergraphs are indexed by hypergraph elements like vertices and edges rather than by raw integers, significantly simplifying indexing code and eliminating bugs.

Expressive. The Simit language consists of linear algebra operations augmented with control flow that let developers implement a wide range of algorithms ranging from finite elements for deformable bodies to cloth simulations and more. Moreover, the powerful hypergraph abstraction allows easy specification of complex geometric data structures.

Fast. The Simit compiler produces high-performance executable code comparable to that of hand-optimized end-to-end libraries and tools, as validated against the state-of-the-art SOFA [Faure et al. 2007] and Vega [Sin et al. 2013] real-time simulation frameworks. Simulations can now be written as easily as a traditional prototype and yet run as fast as a high-performance implementation without manual optimization.

Performance Portable. A Simit program can be compiled to both CPUs and GPUs with no additional programmer effort, while generating efficient code for each architecture. Where Simit delivers performance comparable to hand-optimized CPU code on the same processor, the same simple Simit program delivers roughly an order of magnitude higher performance on a modern GPU in our benchmarks, with no changes to the program.

Interoperable. Simit hypergraphs and program execution are exposed as C++ APIs, so developers can seamlessly integrate with existing C++ programs, algorithms, and libraries.
(emphasis in original)

Additional resources:

Getting Started

Simit mailing list

Source code (MIT license)


Node XL (641 Pins)

Friday, August 5th, 2016

Node XL

Just a quick sample:


That’s only a sample, another 629 await your viewing (perhaps more by the time you read this post).

I have a Pineterest account but this is the first set of pins I have chosen to follow.

Suggestions of similar visualization boards at Pinterest?


OnionRunner, ElasticSearch & Maltego

Wednesday, August 3rd, 2016

OnionRunner, ElasticSearch & Maltego by Adam Maxwell.

From the post:

Last week Justin Seitz over at released OnionRunner which is basically a python wrapper (because Python is awesome) for the OnionScan tool (

At the bottom of Justin’s blog post he wrote this:

For bonus points you can also push those JSON files into Elasticsearch (or modify to do so on the fly) and analyze the results using Kibana!

Always being up for a challenge I’ve done just that. The script outputs each scan result as a json file, you have two options for loading this into ElasticSearch. You can either load your results after you’ve run a scan or you can load them into ElasticSearch as a scan runs. Now this might sound scary but it’s not, lets tackle each option separately.

A great enhancement to Justin’s original OnionRunner!

You will need a version of Maltego to perform the visualization as described. Not a bad idea to become familiar with Maltego in general.

Data is just data, until it is analyzed.


Visualizing your Titan graph database:…

Friday, June 17th, 2016

Visualizing your Titan graph database: An update by Marco Liberati.

From the post:

Last summer, we wrote a blog with our five simple steps to visualizing your Titan graph database with KeyLines. Since then TinkerPop has emerged from the Apache Incubator program with TinkerPop3, and the Titan team have released v1.0 of their graph database:

  • TinkerPop3 is the latest major reincarnation of the graph proje­­­ct, pulling together the multiple ventures into a single united ecosystem.
  • Titan 1.0 is the first stable release of the Titan graph database, based on the TinkerPop3 stack.

We thought it was about time we updated our five-step process, so here’s:

Not exactly five (5) steps because you have to acquire a KeyLines trial key, etc.

A great endorsement of much improved installation process for TinkerPop3 and Titan 1.0.


Incubate No Longer! Tinkerpop™!

Monday, May 23rd, 2016

The Apache Software Foundation Announces Apache® TinkerPop™ as a Top-Level Project

From the post:

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® TinkerPop™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.

Apache TinkerPop is a graph computing framework that provides developers the tools required to build modern graph applications in any application domain and at any scale.

“Graph databases and mainstream interest in graph applications have seen tremendous growth in recent years,” said Stephen Mallette, Vice President of Apache TinkerPop. “Since its inception in 2009, TinkerPop has been helping to promote that growth with its Open Source graph technology stack. We are excited to now do this same work as a top-level project within the Apache Software Foundation.”

As a graph computing framework for both real-time, transactional graph databases (OLTP) and and batch analytic graph processors (OLAP), TinkerPop is useful for working with small graphs that fit within the confines of a single machine, as well as massive graphs that can only exist partitioned and distributed across a multi-machine compute cluster.

TinkerPop unifies these highly varied graph system models, giving developers less to learn, faster time to development, and less risk associated with both scaling their system and avoiding vendor lock-in.

In addition to that good news, the announcement also answers the inevitable question about scaling:

Apache TinkerPop is in use at organizations such as DataStax and IBM, among many others. is currently using TinkerPop and Gremlin to process its order fullfillment graph which contains approximately one trillion edges. (emphasis added)

A trillion edges, unless you are a stealth Amazon, Tinkerpop™ will scale for you.

Congratulations to the Tinkerpop™ community!

MOOGI – The Film Discovery Engine

Wednesday, May 11th, 2016

MOOGI – The Film Discovery Engine

Not the most recent movie I have seen but under genre I entered:

movies about B.C.

Thinking that it would return (rather quickly):

One Million Years B.C. (1966)

Possibly just load on this alpha site but after a couple of minutes, I just reloaded the homepage.

Using “keyword,” just typing “B.C.” brought up a pick list where One Million Years B.C. (1966) was eight in the list. Without any visible delay.

The keyword categories are interesting and many.

Learned a new word, canuxploitation! There is an entire site devoted to Canadian B-movies, i.e., Canuxploitation! – Your Complete Guide to Canadian B-Film.

You will recognize most of the other keywords.

If not, check the New York Times or the Washington Post and include the term plus “member of congress.” You will get several stories that will flesh out the meaning of “erotic,” “female nudity,” “drugs,” “prostitution,” “monster,” “hotel,” “adultery” and the like.

If search isn’t your strong point, try the “explore” option. You can search for movies “similar to” some named movie.

Just for grins, I typed in:

The Dirty Dozen. When I saw it during its first release, it had been given a “condemned” rating by Catholic movie rating service. Had no redeeming qualities at all. No one should see it.

I miss those lists because they were great guides to what movies to go see! 😉

One of five (5) results was The Dirty Dozen: The Deadly Mission (1987).

When I chose that movie, the system failed so I closed out the window and tried again. Previous quick response is taking a good bit of time, suspect load/alpha quality. (I will revisit fairly soon and update this report.)

In terms of aesthetics, they really should lose the hand in the background moving around with a remote control. Adds nothing to the experience other than annoyance.

The site is powered by Mindmaps. Which means you are going to find Apache Tinkerpop under the hood.


Panama Papers Import Scripts for Neo4j and Docker

Tuesday, May 10th, 2016

Panama Papers Import Scripts for Neo4j and Docker by Michael Hunger.

Michael’s import scripts enable you too to explore and visualize, a sub-set of the Panama Papers data.

Thanks Michael!

Visual Searching with Google – One Example – Neo4j – Raspberry Pi

Tuesday, April 26th, 2016

Just to show I don’t spend too much time thinking of ways to gnaw on the ankles of Süddeutsche Zeitung (SZ), the hoarders of the Panama Papers, here is my experience with visual searching with Google today.

I saw this image on Twitter:


I assumed that cutting the “clutter” from around the cluster might produce a better result. Besides, the plastic separators looked (to me) to be standard and not custom made.

Here is my cropped image for searching:


Google responded this looks like: “water.” 😉

OK, so I tried cropping it more just to show the ports, thinking that might turn up similar port arrangements, here’s that image:


Google says: “machinery.” With a number of amusing “similar” images.

BTW, when I tried the full image, the first one, Google says: “electronics.”

OK, so much for Google image searching. What if I try?

Searching on neo4j cluster and raspberry pi (the most likely suspect), my first “hit” had this image:


Same height as the search image.

My seventh “hit” has this image:


Same height and logo as the search image. That’s Stefan Armbruster next to the cluster. (He does presentations on building the cluster, but I have yet to find a video of one of those presentations.)

My eight “hit


Common wiring color (networking cable), height.

Definitely Raspberry Pi but I wasn’t able to uncover further details.

Very interested in seeing a video of Stefan putting one of these together!

Loading the Galaxy Network of the “Cosmic Web” into Neo4j

Saturday, April 23rd, 2016

Loading the Galaxy Network of the “Cosmic Web” into Neo4j by Michael Hunger.

Cypher script for loading “Cosmic Web” into Neo4j.

You remember “Cosmic Web:”



Nine Inch Gremlins

Saturday, April 23rd, 2016

Nine Inch Gremlins


Stephen Mallette writes:

On the back of TinkerPop 3.1.2-incubating comes TinkerPop 3.2.0-incubating. Yes – a dual release – an unprecedented and daring move you’ve come to expect and not expect from the TinkerPop clan! Be sure to review the upgrade documentation in full as you may find some changes that introduce some incompatibilities.

The release artifacts can be found at this location:

The online docs can be found here: (user docs) (upgrade docs) (core javadoc) (full javadoc)

The release notes are available here:

The Central Maven repo has sync’d as well:

Another impressive release!

In reading the documentation I discovered that Ketrina Yim is responsible for drawing Gremlin and his TinkerPop friends.

I was relieved to find that Marko was only responsible for the Gremlin/TinkerPop code/prose and not the graphics as well. That would be too much talent for any one person! 😉


Planet TinkerPop [+ 2 New Graph Journals]

Tuesday, April 12th, 2016

Planet TinkerPop

From the webpage:

Planet TinkerPop is a vendor-agnostic, community-driven site aimed at advancing graph technology in general and Apache TinkerPop™ in particular. Graph technology is used to manage, query, and analyze complex information topologies composed of numerous heterogenous relationships and is currently benefiting companies such as Amazon, Google, and Facebook. For all companies to ultimately adopt graph technology, vendor-agnostic graph standards and graph knowledge must be promulgated. For the former, TinkerPop serves as an Apache Software Foundation governed community that develops a standard graph data model (the property graph) and query language (Gremlin). Apache TinkerPop is a widely supported graph computing framework that has been adopted by leading graph system vendors and interfaced with by numerous graph-based applications across various industries. For educating the public on graphs, Planet TinkerPop’s Technology journal publishes articles about TinkerPop-related graph research and development. The Use Cases journal promotes articles on the industrial use of graphs and TinkerPop. The articles are contributed by members of the Apache TinkerPop community and additional contributions are welcomed and strongly encouraged. We hope you enjoy your time learning about graphs here at Planet TinkerPop.

If you are reading about Planet TinkerPop I can skip the usual “graphs are…” introductory comments. 😉

Planet TinkerPop is a welcome addition to the online resources on graphs in general and TinkerPop in particular.

So they aren’t buried in the prose, let me highlight two new journals at Planet TinkerPop:

TinkerPop Technology journal  publishes articles about TinkerPop-related graph research and development.

TinkerPop Use Cases journal  promotes articles on the industrial use of graphs and TinkerPop.

Both are awaiting your contributions!


PS: I prepended “TinkerPop” to the journal names and suggest an ISSN ( would be appropriate for both journals.

NSA-grade surveillance software: IBM i2 Analyst’s Notebook (Really?)

Tuesday, April 5th, 2016

I stumbled across Revealed: Denver Police Using NSA-Grade Surveillance Software which had this description of “NSA-grade surveillance software…:”

Intelligence gathered through Analyst’s Notebook is also used in a more active way to guide decision making, including with deliberate targeting of “networks” which could include loose groupings of friends and associates, as well as more explicit social organizations such as gangs, businesses, and potentially political organizations or protest groups. The social mapping done with Analyst’s Notebook is used to select leads, targets or points of intervention for future actions by the user. According to IBM, the i2 software allows the analyst to “use integrated social network analysis capabilities to help identify key individuals and relationships within networks” and “aid the decision-making process and optimize resource utilization for operational activities in network disruption, surveillance or influencing.” Product literature also boasts that Analyst’s Notebook “includes Social Network Analysis capabilities that are designed to deliver increased comprehension of social relationships and structures within networks of interest.”

Analyst’s Notebook is also used to conduct “call chaining” (show who is talking to who) and analyze telephone metadata. A software extension called Pattern Tracer can be used for “quickly identifying potential targets”. In the same vein, the Esri Edition of Analyst’s Notebook integrates powerful geo-spatial mapping, and allows the analyst to conduct “Pattern-of-Life Analysis” against a target. A training video for Analyst’s Notebook Esri Edition demonstrates the deployment of Pattern of Life Analysis in a military setting against an example target who appears appears to be a stereotyped generic Muslim terrorism suspect:

Perhaps I’m overly immune to IBM marketing pitches but I didn’t see anything in this post that could not be done with Python, R and standard visualization techniques.

I understand that IBM markets the i2 Analyst’s Notebook (and training too) as:

…deliver[ing] timely, actionable intelligence to help identify, predict, prevent and disrupt criminal, terrorist and fraudulent activities.

to a reported tune of over 2,500 organizations worldwide.

However, you have to bear in mind the software isn’t delivering that value-add but rather the analyst plus the right data and the IBM software. That is the software is at best only one third of what is required for meaningful results.

That insight seems to have gotten lost in IBM’s marketing pitch for the i2 Analyst’s Notebook and its use by the Denver police.

But to be fair, I have included below the horizontal bar, the complete list of features for the i2 Analyst’s Notebook.

Do you see any that can’t be duplicated with standard software?

I don’t.

That’s another reason to object to the Denver Police falling into the clutches of maintenance agreements/training on software that is likely irrelevant to their day to day tasks.

IBM® i2® Analyst’s Notebook® is a visual intelligence analysis environment that can optimize the value of massive amounts of information collected by government agencies and businesses. With an intuitive and contextual design it allows analysts to quickly collate, analyze and visualize data from disparate sources while reducing the time required to discover key information in complex data. IBM i2 Analyst’s Notebook delivers timely, actionable intelligence to help identify, predict, prevent and disrupt criminal, terrorist and fraudulent activities.

i2 Analyst’s Notebook helps organizations to:

Rapidly piece together disparate data

Identify key people, events, connections and patterns

Increase understanding of the structure, hierarchy and method of operation

Simplify the communication of complex data

Capitalize on rapid deployment that delivers productivity gains quickly

Be sure to leave a comment if you see “NSA-grade” capabilities. We would all like to know what those are.

Game of Thrones – Network Analysis

Thursday, March 31st, 2016


You can read the popular account of this network analysis of the Game of Thrones in Mathematicians mapped out every “Game of Thrones” relationship to find the main character by Adam Epstein or, you can try Network of Thrones by Andrew Beveridge and Jie Shan.

There are a number of choices you may want to re-visit if you explore the Game of Thrones as a graph/network, not the least of which is expanding the data beyond volume 3, characterizing the type of “relationships” (edges) found between characters and how you would capture the time aspect of the development of the “relationships” you do find.

Great work that will hopefully spur others to similar explorations.

Announcing the Structr Knowledge Graph

Friday, March 4th, 2016

Announcing the Structr Knowledge Graph by Alex Morgner.

From the post:

The Structr Knowledge Graph is the new one-stop resource base where all information about and around Structr are connected.

Besides the official manual, you will find Getting Started articles, FAQ, guides and tutorials, as well as links to external resources like StackOverflow questions, GitHub issues, or discussion threads.

The Knowledge Graph isn’t just another static content platform where information is stored once and then outdates over time. It is designed and built as a living structure, being updated not only by the Structr team but also semi-automatically by user activities in the support system.

By using a mixture of manual categorization and natural language processing, information is being extracted from the content origins to update and extend the graph. The SKG will replace our old documentation site

And of course, the SKG provides an interactive graph browser, full-text search and an article tree.

I was confused when I first saw this because I think of Structr as a knowledge graph so why the big splash? Then I saw a tweet saying 386 articles on and it suddenly made sense.

This is a knowledge graph about the knowledge graph software known as Structr.

OK, I’m straight now. I think. 😉

With a topic map it would be trivial to distinguish between “Structr Knowledge Graph” in the sense of using the Structr software versus a knowledge graph about Structr, which is also known as the Structr Knowledge Graph.

Momentary cognitive dissonance, well, not so momentary but I wasn’t devoting a lot of effort to it, but not a serious problem.

More serious when the cognitive dissonance is confusion a child’s name in transliterated Arabic with that of a sanctioned target being sought by a U.S. drone.

Graph Encryption: Going Beyond Encrypted Keyword Search [Subject Identity Based Encryption]

Wednesday, March 2nd, 2016

Graph Encryption: Going Beyond Encrypted Keyword Search by Xiarui Meng.

From the post:

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) store, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph database have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny’s previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today’s world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

2. Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, which can lead to many interesting research directions.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

Apologies for the long quote but I thought this topic might be new to some readers. Xianrui goes on to describe a solution for efficient queries over encrypted graphs.

Chase and Kamara remark in Structured Encryption and Controlled Disclosure, CK10:

To address this problem we introduce the notion of structured encryption. A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key. In addition, the query process reveals no useful information about either the query or the data. An important consideration in this context is the efficiency of the query operation on the server side. In fact, in the context of cloud storage, where one often works with massive datasets, even linear time operations can be infeasible. (emphasis in original)

With just a little nudging, their:

A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key.

could be re-stated as:

A subject identity encryption scheme leaves out merging data in such a way that the resulting topic map can only be queried with knowledge of the subject identity merging key.

You may have topics that represent diagnoses such as cancer, AIDS, sexual contacts, but if none of those can be associated with individuals who are also topics in the map, there is no more disclosure than census results for a metropolitan area and a list of the citizens therein.

That is you are missing the critical merging data that would link up (associate) any diagnosis with a given individual.

Multi-property subject identities would make the problem even harder, so say nothing of conferring properties on the basis of supplied properties as part of the merging process.

One major benefit of a subject identity based approach is that without the merging key, any data set, however sensitive the information, is just a data set, until you have the basis for solving its subject identity riddle.

PS: With the usual caveats of not using social security numbers, birth dates and the like as your subject identity properties. At least not in the map proper. I can think of several ways to generate keys for merging that would be resistant to even brute force attacks.

Ping me if you are interested in pursuing that on a data set.

networkD3: D3 JavaScript Network Graphs from R

Monday, February 15th, 2016

networkD3: D3 JavaScript Network Graphs from R by Christopher Gandrud, JJ Allaire, & Kent Russell.

From the post:

This is a port of Christopher Gandrud’s R package d3Network for creating D3 network graphs to the htmlwidgets framework. The htmlwidgets framework greatly simplifies the package’s syntax for exporting the graphs, improves integration with RStudio’s Viewer Pane, RMarkdown, and Shiny web apps. See below for examples.

It currently supports three types of network graphs:

I haven’t compared this to GraphViz but the Sankey diagram option is impressive!