Archive for the ‘Networks’ Category

Network analysis of Game of Thrones family ties [A Timeless Network?]

Monday, May 15th, 2017

Network analysis of Game of Thrones family ties by Shirin Glander.

From the post:

In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.

The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.

Glander improves network data for the Game of Thrones and walks you through the use of R to analyze that network.

It’s useful work and will repay close study.

Network analysis can used with all social groups, activists, bankers, hackers, members of Congress (U.S.), terrorists, etc.

But just as Ned Stark has no relationship with dire wolves when the story begins, networks of social groups develop, change, evolve if you will, over time.

Moreover, events, interactions, involving one or more members of the network, occur in time sequence. A social network that fails to capture those events and their sequencing, from one or more points of view, is a highly constrained network.

A useful network as Glander demonstrates but one cannot answer simple questions about the order in which characters gained knowledge that a particular character hurled another character from a very high window.

If I were investigating say a leak of NSA cybertools, time sequencing like that would be one of my top priorities.


Network datasets (@Ognyanova)

Tuesday, May 9th, 2017

Network datasets by Katherine Ognyanova.

From the post:

Since I started posting network tutorials on this site, people will occasionally write to ask me about the included example datasets. I also get e-mails from people asking where they might find network data to use for a project or in teaching. Seems like a good idea to post a quick reply here.

The datasets included in my tutorials are mostly synthetic (or trimmed and heavily manipulated) in order to illustrate various visualization aspects in a manageable way. Feel free to use those datasets (citing or linking to the source is appreciated), but keep in mind that they are artificially generated and not the result of actual data collection. When I do use empirical data, the download files include documentation (if the data is collected by me) or clearly point to the source (if the data was collected by someone else).

If you are looking for network data, large or small, there are a number of excellent open online repositories that you can take a look at. Below is a short list (feel free to e-mail me if you have other good links, and I will add them here).

Links to ten (10) collections of network datasets, plus suggestions on software for collecting and analyzing social network data.

Considering following her: @Ognyanova. See her website, for additional resources.

No Frills Gephi (8.2) Import of Clinton/Podesta Emails (1-18)

Wednesday, October 26th, 2016

Using Gephi 8.2, you can create graphs of the Clinton/Podesta emails based on terms in subject lines or the body of the emails. You can interactively work with all 30K+ (as of today) emails and extract networks based on terms in the posts. No programming required. (Networks based on terms will appear tomorrow.)

If you have Gephi 8.2 (I can’t find the import spigot in 9.0 or 9.1), you can import the Clinton/Podesta Emails (1-18) for analysis as a network.

To save you the trouble of regressing to Gephi 8.2, I performed a no frills/default import and exported that file as podesta-1-18-network.gephi.gz.

Download and uncompress podesta-1-18-network.gephi.gz, then you can pickup at timemark 3.49.

Open the file (your location may differ):


Obligatory hair-ball graph visualization. 😉


Considerably less appealing that Jennifer Golbeck’s but be patient!

First step, Layout -> Yifan Hu. My results:


Second step, Network Diameter statistics (right side, run).

No visible impact on the graph but, now you can change the color and size of nodes in the graph. That is they have attributes on which you can base the assignment of color and size.

Tutorial gotcha: Not one of Jennifer’s tutorials but I was watching a Gephi tutorial that skipped the part about running statistics on the graph prior to assignment of color and size. Or I just didn’t hear it. The menu options appear in documentation but you can’t access them unless and until you run network statistics or have attributes for the assignment of color and size. Run statistics first!

Next, assign colors based on betweenness centrality:


The densest node is John Podesta, but if you remove his node, rerun the network statistics and re-layout the graph, here is part of what results:


A no frills import of 31,819 emails results in a graph of 3235 nodes and 11,831 edges.

That’s because nodes and edges combine (merge to you topic map readers) when they have the same identifier or for edges are between the same nodes.

Subject to correction, when that combining/merging occurs, the properties on the respective nodes/edges are accumulated.

Topic mappers already realize there are important subjects missing, some 31,819 of them. That is the emails themselves don’t by default appear as nodes in the network.

Ian Robinson, Jim Webber & Emil Eifrem illustrate this lossy modeling in Graph Databases this way:


Modeling emails without the emails is rather lossy. 😉

Other nodes/subjects we might want:

  • Multiple to: emails – Is who was also addressed important?
  • Multiple cc: emails – Same question as with to:.
  • Date sent as properties? So evolution of network/emails can be modeled.
  • Capture “reply-to” for relationships between emails?

Other modeling concerns?

Bear in mind that we can suppress a large amount of the detail so you can interactively explore the graph and only zoom into/display data after finding interesting patterns.

Some helpful links: The email collection as bulk download, thanks to Michael Best, @NatSecGeek. Where you can grab a copy of Gephi 8.2.

Network structure and resilience of Mafia syndicates

Wednesday, May 4th, 2016

Network structure and resilience of Mafia syndicates by Santa Agrestea, Salvatore Catanesea, Pasquale De Meoc, Emilio Ferrara, Giacomo Fiumaraa.


In this paper we present the results of our study of Sicilian Mafia organizations using social network analysis. The study investigates the network structure of a Mafia syndicate, describing its evolution and highlighting its plasticity to membership-targeting interventions and its resilience to disruption caused by police operations. We analyze two different datasets dealing with Mafia gangs that were built by examining different digital trails and judicial documents that span a period of ten years. The first dataset includes the phone contacts among suspected individuals, and the second captures the relationships among individuals actively involved in various criminal offenses. Our report illustrates the limits of traditional investigative methods like wiretapping. Criminals high up in the organization hierarchy do not occupy the most central positions in the criminal network, and oftentimes do not appear in the reconstructed criminal network at all. However, we also suggest possible strategies of intervention. We show that, although criminal networks (i.e., the network encoding mobsters and crime relationships) are extremely resilient to different kinds of attacks, contact networks (i.e., the network reporting suspects and reciprocated phone calls) are much more vulnerable, and their analysis can yield extremely valuable insights.

Studying the vulnerabilities identified here may help you strengthen your own networks against similar analysis.

To give you the perspective of the authors:

Due to its normative structure as well as strong ties with finance, entrepreneurs and politicians, Mafia has now risen to prominence as a worldwide criminal organization by controlling many illegal activities like the trade of cocaine, money laundering or illegal military weapon trafficking [4].

They say that as though it is a bad thing. As Neal Stephenson says in Snow Crash, the Mafia is just another franchise. 😉

Understanding the model others expect enables you to expose a model that doesn’t match their expectations.

Think of it as hiding in plain sight.

NSA Grade – Network Visualization with Gephi

Sunday, April 10th, 2016

Network Visualization with Gephi by Katya Ognyanova.

It’s not possible to cover Gephi in sixteen (16) pages but you will wear out more than one printed copy of these sixteen (16) pages as you become experienced with Gephi.

This version is from a Gephi workshop at Sunbelt 2016.

Katya‘s homepage offers a wealth of network visualization posts and extensive use of R.

Follow her at @Ognyanova.

PS: Gephi equals or exceeds visualization capabilities in use by the NSA, depending upon your skill as an analyst and the quality of the available data.

Game of Thrones – Network Analysis

Thursday, March 31st, 2016


You can read the popular account of this network analysis of the Game of Thrones in Mathematicians mapped out every “Game of Thrones” relationship to find the main character by Adam Epstein or, you can try Network of Thrones by Andrew Beveridge and Jie Shan.

There are a number of choices you may want to re-visit if you explore the Game of Thrones as a graph/network, not the least of which is expanding the data beyond volume 3, characterizing the type of “relationships” (edges) found between characters and how you would capture the time aspect of the development of the “relationships” you do find.

Great work that will hopefully spur others to similar explorations.

Visualizing the Clinton Email Network in R

Wednesday, February 24th, 2016

Visualizing the Clinton Email Network in R by Bob Rudis.

From the post:

This isn’t a post about politics. I do have opinions about the now infamous e-mail server (which will no doubt come out here), but when the WSJ folks made it possible to search the Clinton email releases I though it would be fun to get the data into R to show how well the igraph and ggnetwork packages could work together, and also show how to use svgPanZoom to make it a bit easier to poke around the resulting hairball network.

NOTE: There are a couple “Assignment” blocks in here. My Elements of Data Science students are no doubt following the blog by now so those are meant for you 🙂 Other intrepid readers can ignore them.

A great walk through on importing, analyzing, and visualizing any email archive, not just Hillary’s.

You will quickly find that “…connecting the dots…” isn’t as useful as the intelligence community would have you believe.

Yes, yes! There is a call to Papa John’s! Oh, that’s not a code name, that’s a pizza place. (Even suspected terrorists have to eat.)

Great to have the dots. Great to have connections. Not so great if that is all that you have.

I found a number of other interesting posts at Bob’s blog:

Including: Dairy-free Parker House Rolls! I bake fairly often so am susceptible to this sort of posting. Looks very good!

networkD3: D3 JavaScript Network Graphs from R

Monday, February 15th, 2016

networkD3: D3 JavaScript Network Graphs from R by Christopher Gandrud, JJ Allaire, & Kent Russell.

From the post:

This is a port of Christopher Gandrud’s R package d3Network for creating D3 network graphs to the htmlwidgets framework. The htmlwidgets framework greatly simplifies the package’s syntax for exporting the graphs, improves integration with RStudio’s Viewer Pane, RMarkdown, and Shiny web apps. See below for examples.

It currently supports three types of network graphs:

I haven’t compared this to GraphViz but the Sankey diagram option is impressive!

The Social-Network Illusion That Tricks Your Mind – (Terrorism As Majority Illusion)

Friday, December 25th, 2015

The Social-Network Illusion That Tricks Your Mind

From the post:

One of the curious things about social networks is the way that some messages, pictures, or ideas can spread like wildfire while others that seem just as catchy or interesting barely register at all. The content itself cannot be the source of this difference. Instead, there must be some property of the network that changes to allow some ideas to spread but not others.

Today, we get an insight into why this happens thanks to the work of Kristina Lerman and pals at the University of Southern California. These people have discovered an extraordinary illusion associated with social networks which can play tricks on the mind and explain everything from why some ideas become popular quickly to how risky or antisocial behavior can spread so easily.

Network scientists have known about the paradoxical nature of social networks for some time. The most famous example is the friendship paradox: on average your friends will have more friends than you do.

This comes about because the distribution of friends on social networks follows a power law. So while most people will have a small number of friends, a few individuals have huge numbers of friends. And these people skew the average.

Here’s an analogy. If you measure the height of all your male friends. you’ll find that the average is about 170 centimeters. If you are male, on average, your friends will be about the same height as you are. Indeed, the mathematical notion of “average” is a good way to capture the nature of this data.

But imagine that one of your friends was much taller than you—say, one kilometer or 10 kilometers tall. This person would dramatically skew the average, which would make your friends taller than you, on average. In this case, the “average” is a poor way to capture this data set.

If that has you interested, see:

The Majority Illusion in Social Networks by Kristina Lerman, Xiaoran Yan, Xin-Zeng Wu.


Social behaviors are often contagious, spreading through a population as individuals imitate the decisions and choices of others. A variety of global phenomena, from innovation adoption to the emergence of social norms and political movements, arise as a result of people following a simple local rule, such as copy what others are doing. However, individuals often lack global knowledge of the behaviors of others and must estimate them from the observations of their friends’ behaviors. In some cases, the structure of the underlying social network can dramatically skew an individual’s local observations, making a behavior appear far more common locally than it is globally. We trace the origins of this phenomenon, which we call “the majority illusion,” to the friendship paradox in social networks. As a result of this paradox, a behavior that is globally rare may be systematically overrepresented in the local neighborhoods of many people, i.e., among their friends. Thus, the “majority illusion” may facilitate the spread of social contagions in networks and also explain why systematic biases in social perceptions, for example, of risky behavior, arise. Using synthetic and real-world networks, we explore how the “majority illusion” depends on network structure and develop a statistical model to calculate its magnitude in a network.

Research has not reached the stage of enabling the manipulation of public opinion to reflect the true rarity of terrorist activity in the West.

That being the case, being factually correct that Western fear of terrorism is a majority illusion isn’t as profitable as product tying to that illusion.

‘Outsiders’ Crack A 50-Year-Old Math Problem

Saturday, December 5th, 2015

‘Outsiders’ Crack A 50-Year-Old Math Problem by Erica Klarreich.

From the post:

In 2008, Daniel Spielman told his Yale University colleague Gil Kalai about a computer science problem he was working on, concerning how to “sparsify” a network so that it has fewer connections between nodes but still preserves the essential features of the original network.

Network sparsification has applications in data compression and efficient computation, but Spielman’s particular problem suggested something different to Kalai. It seemed connected to the famous Kadison-Singer problem, a question about the foundations of quantum physics that had remained unsolved for almost 50 years.

Over the decades, the Kadison-Singer problem had wormed its way into a dozen distant areas of mathematics and engineering, but no one seemed to be able to crack it. The question “defied the best efforts of some of the most talented mathematicians of the last 50 years,” wrote Peter Casazza and Janet Tremain of the University of Missouri in Columbia, in a 2014 survey article.

As a computer scientist, Spielman knew little of quantum mechanics or the Kadison-Singer problem’s allied mathematical field, called C*-algebras. But when Kalai, whose main institution is the Hebrew University of Jerusalem, described one of the problem’s many equivalent formulations, Spielman realized that he himself might be in the perfect position to solve it. “It seemed so natural, so central to the kinds of things I think about,” he said. “I thought, ‘I’ve got to be able to prove that.’” He guessed that the problem might take him a few weeks.

Instead, it took him five years. In 2013, working with his postdoc Adam Marcus, now at Princeton University, and his graduate student Nikhil Srivastava, now at the University of California, Berkeley, Spielman finally succeeded. Word spread quickly through the mathematics community that one of the paramount problems in C*-algebras and a host of other fields had been solved by three outsiders — computer scientists who had barely a nodding acquaintance with the disciplines at the heart of the problem.

Why all the excitement?

The proof of the Kadison-Singer problem implies that all the constructions in its dozen incarnations can, in principle, be carried out—quantum knowledge can be extended to full quantum systems, networks can be decomposed into electrically similar ones, matrices can be broken into simpler chunks. The proof won’t change what quantum physicists do, but it could have applications in signal processing, since it implies that collections of vectors used to digitize signals can be broken down into smaller frames that can be processed faster. The theorem “has potential to affect some important engineering problems,” Casazza said.

Just so you know, the same people who are saying it will be years before practical results emerge from this breakthrough are the same ones who assumed the answer to this problem was negative. 😉

I’m not saying techniques based on this work will be in JavaScript libraries next year but without trying, they never will be.


I first saw this in a post by Lars Marius Garshol

Why you should understand (a little) about TCP

Sunday, November 22nd, 2015

Why you should understand (a little) about TCP by Julia Evans.

From the post:

This isn’t about understanding everything about TCP or reading through TCP/IP Illustrated. It’s about how a little bit of TCP knowledge is essential. Here’s why.

When I was at the Recurse Center, I wrote a TCP stack in Python (and wrote about what happens if you write a TCP stack in Python). This was a fun learning experience, and I thought that was all.

A year later, at work, someone mentioned on Slack “hey I’m publishing messages to NSQ and it’s taking 40ms each time”. I’d already been thinking about this problem on and off for a week, and hadn’t gotten anywhere.

A little background: NSQ is a queue that you send to messages to. The way you publish a message is to make an HTTP request on localhost. It really should not take 40 milliseconds to send a HTTP request to localhost. Something was terribly wrong. The NSQ daemon wasn’t under high CPU load, it wasn’t using a lot of memory, it didn’t seem to be a garbage collection pause. Help.

Then I remembered an article I’d read a week before called In search of performance – how we shaved 200ms off every POST request. In that article, they talk about why every one of target=”_blank” their POST requests were taking 200 extra milliseconds. That’s.. weird. Here’s the key paragraph from the post

Julia’s posts are generally useful and entertaining to read and this one is no exception.

As Julia demonstrates in this post, TCP isn’t as low-level as you might think. 😉

The other lesson to draw here is the greater your general knowledge of how things work, the more likely you can fix (or cause) problems with a minimal amount of effort.

Learn a little TCP with Julia and keep bookmarked deeper resources should the need arise.

A Certain Tendency Of The Database Community

Tuesday, October 27th, 2015

A Certain Tendency Of The Database Community by Christopher Meiklejohn.

From the post:


We posit that striving for distributed systems that provide “single system image” semantics is fundamentally flawed and at odds with how systems operate in the physical world. We realize the database as an optimization of this system: a required, essential optimization in practice that facilitates central data placement and ease of access to participants in a system. We motivate a new model of computation that is designed to address the problems of computation over “eventually consistent” information in a large-scale distributed system.

Eventual Consistency

When we think about the world we live in, we do not usually say it is eventually consistent, for this is a term usually applied to computing systems, made up of multiple machines, that have to operate with shared information.

Eventual consistency is a consistency model for replicated, shared state. A consistency model is a contract between an application developer and a system that application will run on. A contract between a developer and a system states the following: given the developer follows the rules defined by the system, certain outcomes from the system are guaranteed. This makes it possible for developers to build successful applications, for without this contract, applications would have no guarantee that the actions they perform would have a correct outcome.

(italics in original)

A very accessible and great read on “eventual consistency.”

Christopher points out that any “state” of knowledge is a snapshot under a given set of constraints:

For instance, if the leading researchers on breast cancer were to document the state-of-the-art in a book, as the document is being written it would no longer reflect the state-of-the-art. The collective knowledge of this group is always changing, and as long as we continue to rewrite the document it will only be approaching the combined knowledge of the group. We can think of this somewhat formally: if we had a way to view the group’s knowledge as a omniscient observer and we represent that knowledge as a linear function, the recorded text would be asymptotic to function of the sum of global knowledge.

He concludes with this question:

…Can we build computational abstractions that allow devices to communicate peer-to-peer, acknowledging the true source of truth for a particular piece of information and scale to the amount of information that exists, not only between all computers in a planetary-scale distributed system, but all entities in the universe[?]

I’m not sure about “all entities in the universe,” or even a “planetary-scale distributed system,” but we do know that Netware Directory Services (NDS) (now eDirectory) was a replicated, distributed, sharded database with eventual convergence that was written in 1993.

We have had the computational abstractions for a replicated, distributed, sharded database with eventual convergence for a number of years.

I would adjust Christopher’s “true source of truth,” for “source of truth as defined by users,” to avoid the one-world-truth position that crippled the Semantic Web even before FOL and RDF syntax arrived.

Network motif discovery: A GPU approach [subgraph isomorphism and topics]

Sunday, August 30th, 2015

Network motif discovery: A GPU approach by Wenqing Lin ; Xiaokui Xiao ; Xing Xie ; Xiao-Li Li.


The identification of network motifs has important applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires enumerating subgraphs from a real-life graph, and computing the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days to derive network motifs from biological networks with only a few thousand vertices. To address this problem, this paper presents a novel study on network motif discovery using Graphical Processing Units (GPUs). The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in computing subgraph frequencies from random graphs, so as to reduce the overall computation time of network motif discovery. We explore the design space of GPU-based subgraph matching algorithms, with careful analysis of several crucial factors that affect the performance of GPU programs. Based on our analysis, we develop a GPU-based solution that (i) considerably differs from existing CPU-based methods, and (ii) exploits the strengths of GPUs in terms of parallelism while mitigating their limitations in terms of the computation power per GPU core. With extensive experiments on a variety of biological networks, we show that our solution is up to two orders of magnitude faster than the best CPU-based approach, and is around 20 times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used.

For those of you who need to dodge the paywall: Network motif discovery: A GPU approach

From the introduction (topic map comments interspersed):

Given a graph G, a network motif in G is a subgraph g of G, such that g appears much more frequently in G than in random graphs whose degree distributions are similar to that of G [1]. The identification of network motifs finds important applications in numerous domains. For example, network motifs are used (i) in system biology to predict protein interactions in biological networks and discover functional sub-units [2], (ii) in electronic engineering to understand the characteristics of circuits [3], and (iii) in brain science to study the functionalities of brain networks [4].

Unlike a topic map considered as graph G, the network motif problem is to discover subgraphs of G. In a topic map, the subgraph constituting a topic and its defined isomophism with other topics, is declarable.

…Roughly speaking, all existing techniques adopt a common two-phase framework as follows:

Subgraph Enumeration: Given a graph G and a parameter k, enumerate the subgraphs g of G with k vertices each;

Frequency Estimation:

Rather than enumerating subgraph g of G (a topic map), we need only collect those subgraphs for the isomorphism test.

…To compute the frequency of g in a random graph G, however, we need to derive the number of subgraphs of G that are isomorphic to g – this requires a large number of subgraph isomorphism tests [14], which are known to be computationally expensive.

Footnote 14 is a reference to: Practical graph isomorphism, II by Brendan D. McKay, Adolfo Piperno.

See also: nauty and Traces by Brendan McKay and Adolfo Piperno.

nauty and Traces are programs for computing automorphism groups of graphs and digraphs [*]. They can also produce a canonical label. They are written in a portable subset of C, and run on a considerable number of different systems.

This is where topic map subgraph g isomorphism issue intersects with the more general subgraph isomorphism case.

Any topic can have properties (nee internal occurrences) in addition to those thought to make it “isomorphic” to another topic. Or play roles in associations that are not played by other topics representing the same subject.

Motivated by the deficiency of existing work, we present an in-depth study on efficient solutions for network motif discovery. Instead of focusing on the efficiency of individual subgraph isomorphism tests, we propose to utilize Graphics Processing Units (GPUs) to parallelize a large number of isomorphism tests, in order to reduce the computation time of the frequency estimation phase. This idea is intuitive, and yet, it presents a research challenge since there is no existing algorithm for testing subgraph isomorphisms on GPUs. Furthermore, as shown in Section III, existing CPU-based algorithms for subgraph isomorphism tests cannot be translated into efficient solutions on GPUs, as the characteristics of GPUs make them inherently unsuitable for several key procedures used in CPU-based algorithms.

The punch line is that the authors present a solution that:

…is up to two orders of magnitude faster than the best CPU-based approach, and is around 20 times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used.

Since topic isomophism is defined as opposed to discovered, this looks like a fruitful avenue to explore for topic map engine performance.

Dynamical Systems on Networks: A Tutorial

Thursday, May 14th, 2015

Dynamical Systems on Networks: A Tutorial by Mason A. Porter and James P. Gleeson.


We give a tutorial for the study of dynamical systems on networks. We focus especially on “simple” situations that are tractable analytically, because they can be very insightful and provide useful springboards for the study of more complicated scenarios. We briefly motivate why examining dynamical systems on networks is interesting and important, and we then give several fascinating examples and discuss some theoretical results. We also briefly discuss dynamical systems on dynamical (i.e., time-dependent) networks, overview software implementations, and give an outlook on the field.

At thirty-nine (39) pages and two hundred and sixty-three references, the authors leave the reader with an overview of the field and the tools to go further.

I am intrigued by the closer by the authors:

Finally, many networks are multiplex (i.e., include multiple types of edges) or have other multilayer features [16, 136]. The existence of multiple layers over which dynamics can occur and the possibility of both structural and dynamical correlations between layers offers another rich set of opportunities to study dynamical systems on networks. The investigation of dynamical systems on multilayer networks is only in its infancy, and this area is also loaded with a rich set of problems [16, 136, 144, 205].

Topic maps can have multiple type of edges and multiple layers.

For further reading on those topics see:

The structure and dynamics of multilayer networks by S. Boccaletti, G. Bianconi, R. Criado, C.I. del Genio, J. Gómez-Gardeñes, M. Romance, I. Sendiña-Nadal, Z. Wang, M. Zanin.


In the past years, network theory has successfully characterized the interaction among the constituents of a variety of complex systems, ranging from biological to technological, and social systems. However, up until recently, attention was almost exclusively given to networks in which all components were treated on equivalent footing, while neglecting all the extra information about the temporal- or context-related properties of the interactions under study. Only in the last years, taking advantage of the enhanced resolution in real data sets, network scientists have directed their interest to the multiplex character of real-world systems, and explicitly considered the time-varying and multilayer nature of networks. We offer here a comprehensive review on both structural and dynamical organization of graphs made of diverse relationships (layers) between its constituents, and cover several relevant issues, from a full redefinition of the basic structural measures, to understanding how the multilayer nature of the network affects processes and dynamics.

Multilayer Networks by Mikko Kivelä, Alexandre Arenas, Marc Barthelemy, James P. Gleeson, Yamir Moreno, Mason A. Porter.


In most natural and engineered systems, a set of entities interact with each other in complicated patterns that can encompass multiple types of relationships, change in time, and include other types of complications. Such systems include multiple subsystems and layers of connectivity, and it is important to take such “multilayer” features into account to try to improve our understanding of complex systems. Consequently, it is necessary to generalize “traditional” network theory by developing (and validating) a framework and associated tools to study multilayer systems in a comprehensive fashion. The origins of such efforts date back several decades and arose in multiple disciplines, and now the study of multilayer networks has become one of the most important directions in network science. In this paper, we discuss the history of multilayer networks (and related concepts) and review the exploding body of work on such networks. To unify the disparate terminology in the large body of recent work, we discuss a general framework for multilayer networks, construct a dictionary of terminology to relate the numerous existing concepts to each other, and provide a thorough discussion that compares, contrasts, and translates between related notions such as multilayer networks, multiplex networks, interdependent networks, networks of networks, and many others. We also survey and discuss existing data sets that can be represented as multilayer networks. We review attempts to generalize single-layer-network diagnostics to multilayer networks. We also discuss the rapidly expanding research on multilayer-network models and notions like community structure, connected components, tensor decompositions, and various types of dynamical processes on multilayer networks. We conclude with a summary and an outlook.

This may have been where we collectively went wrong in marketing topic maps. Yes, yes it is true that topic maps could do multilayer networks but network theory has made $billions with an overly simplistic model that bears little resemblance to reality.

As computation resources improve and closer to reality models, at least somewhat closer, become popular, something between simplistic networks and the full generality of topic maps could be successful.

Hunting Changes in Complex Networks (Changes in Networks)

Thursday, April 16th, 2015

While writing the Methods for visualizing dynamic networks post, I remembered a technique that the authors didn’t discuss.

What if only one node in a complex network was different? That is all of the other nodes and edges remained fixed while one node and it edges changed? How easy would that be to visualize?

If that sounds like an odd use case, it’s not. In fact, the discovery of Pluto in the 1930’s was made using a blink comparator exactly for that purpose.


This is Cyrus Tombaugh using a blink comparator which shows the viewer two images, quickly alternating between them. The images are of the same parts of the night sky and anything that has changed with be quickly noticed by the human eye.


Select the star field image to get a larger view and the gif will animate as though seen through a blink comparator. Do you see Pluto? (These are images of the original discovery plates.)

If not, see these with Pluto marked by a large arrow in each one.

This wonderful material on Pluto came from: Beyond the Planets – the discovery of Pluto

All of that was to interest you in reading: GrepNova: A tool for amateur supernova hunting by Frank Dominic.

From the article:

This paper presents GrepNova, a software package which assists amateur supernova hunters by allowing new observations of galaxies to be compared against historical library images in a highly automated fashion. As each new observation is imported, GrepNova automatically identifies a suitable comparison image and rotates it into a common orientation with the new image. The pair can then be blinked on the computer’s display to allow a rapid visual search to be made for stars in outburst. GrepNova has been in use by Tom Boles at his observatory in Coddenham, Suffolk since 2005 August, where it has assisted in the discovery of 50 supernovae up to 2011 October.

That’s right, these folks are searching for supernovas in other galaxies, each of which consists of millions of stars, far denser than most contact networks.

The download information for GrepNova has changed since the article was published:

I don’t have any phone metadata to try the experiment on but with a graph of contacts, the usual contacts will simply be background and new contacts will jump off the screen at you.

A great illustration of why prior searching techniques remain relevant to modern “information busy” visualizations.

Methods for visualizing dynamic networks (Parts 1 and 2)

Thursday, April 16th, 2015

Methods for visualizing dynamic networks Part 1

Methods for visualizing dynamic networks Part 2

From part 1:

The challenge of visualizing the evolution of connected data through time has kept academics and data scientists busy for years. Finding a way to convey the added complexity of a temporal element without overwhelming the end user with it is not easy.

Whilst building the KeyLines Time Bar – our component for visualizing dynamic networks – we spent a great deal of time appraising the existing temporal visualization options available.

In this blog post, we’ve collated some of the most popular ways of visualizing dynamic graphs through time. Next week, we’ll share some of the more creative and unusual options.

Not a comprehensive survey but eight (8) ways to visualize dynamic networks that you will find interesting.

Others that you would add to this list?

Newly Discovered Networks among Different Diseases…

Monday, February 9th, 2015

Newly Discovered Networks among Different Diseases Reveal Hidden Connections by Veronique Greenwood and Quanta Magazine.

From the post:

Stefan Thurner is a physicist, not a biologist. But not long ago, the Austrian national health insurance clearinghouse asked Thurner and his colleagues at the Medical University of Vienna to examine some data for them. The data, it turned out, were the anonymized medical claims records—every diagnosis made, every treatment given—of most of the nation, which numbers some 8 million people. The question was whether the same standard of care could continue if, as had recently happened in Greece, a third of the funding evaporated. But Thurner thought there were other, deeper questions that the data could answer as well.

In a recent paper in the New Journal of Physics, Thurner and his colleagues Peter Klimek and Anna Chmiel started by looking at the prevalence of 1,055 diseases in the overall population. They ran statistical analyses to uncover the risk of having two diseases together, identifying pairs of diseases for which the percentage of people who had both was higher than would be expected if the diseases were uncorrelated—in other words, a patient who had one disease was more likely than the average person to have the other. They applied statistical corrections to reduce the risk of drawing false connections between very rare and very common diseases, as any errors in diagnosis will get magnified in such an analysis. Finally, the team displayed their results as a network in which the diseases are nodes that connect to one another when they tend to occur together.

The style of analysis has uncovered some unexpected links. In another paper, published on the scientific preprint site, Thurner’s team confirmed a controversial connection between diabetes and Parkinson’s disease, as well as unique patterns in the timing of when diabetics develop high blood pressure. The paper in the New Journal of Physics generated additional connections that they hope to investigate further.

Every medical claim for almost eight (8) million people would make a very dense graph. Yes?

When you look at the original papers, notice that the researchers did not create a graph that held all their data. In the New Journal of Physics paper, only the diseases appear to demonstrate their clustering and the patients not at all. In the paper, another means is used to demonstrate the risk of specific diseases and the two types (DM1, DM2) of diabetes.

I think the lesson here is that despite being “network” data, that isn’t determinative for presentation or analysis of data.

Graph-Tool – Update!

Wednesday, February 4th, 2015

Graph-Tool – Update!

From the webpage:

Graph-tool is an efficient Python module for manipulation and statistical analysis of graphs (a.k.a. networks). Contrary to most other python modules with similar functionality, the core data structures and algorithms are implemented in C++, making extensive use of template metaprogramming, based heavily on the Boost Graph Library. This confers it a level of performance that is comparable (both in memory usage and computation time) to that of a pure C/C++ library.

A new version of graph-tool is available for downloading!

A graph based analysis of the 2016 U.S. Budget is going to require all the speed you can muster!


Stellar Navigation Using Network Analysis

Sunday, December 7th, 2014

Stellar Navigation Using Network Analysis by Caleb Jones.

To give you an idea of where this post ends up:

From the post:

This has been the funnest and most challenging network analysis and visualization I have done to date. As I've mentioned before, I am a huge space fan. One of my early childhood fantasies was the idea of flying instantly throughout the universe exploring all the different planets, stars, nebulae, black holes, galaxies, etc. The idea of a (possibly) infinite universe with inexhaustible discoveries to be made has kept my interest and fascination my whole life. I identify with the sentiment expressed by Carl Sagan in his book Pale Blue Dot:

In the last ten thousand years, an instant in our long history, we’ve abandoned the nomadic life. For all its material advantages, the sedentary life has left us edgy, unfulfilled. The open road still softly calls like a nearly forgotten song of childhood. Your own life, or your band’s, or even your species’ might be owed to a restless few—drawn, by a craving they can hardly articulate or understand, to undiscovered lands and new worlds.

Herman Melville, in Moby Dick, spoke for wanderers in all epochs and meridians: “I am tormented with an everlasting itch for things remote. I love to sail forbidden seas…”

Maybe it’s a little early. Maybe the time is not quite yet. But those other worlds— promising untold opportunities—beckon.

Silently, they orbit the Sun, waiting.

Fair warning: If you aren’t already a space enthusiast, this project may well turn you into one!

Distance and relative location are only two (2) facts that are known for stars within eight (8) light-years. What other facts or resources would you connect to the stars in these networks?

Visualizing Website Pathing With Network Graphs

Monday, September 8th, 2014

Visualizing Website Pathing With Network Graphs by Randy Zwitch.

From the post:

Last week, version 1.4 of RSiteCatalyst was released, and now it’s possible to get site pathing information directly within R. Now, it’s easy to create impressive looking network graphs from your Adobe Analytics data using RSiteCatalyst and d3Network. In this blog post, I will cover simple and force-directed network graphs, which show the pairwise representation between pages. In a follow-up blog post, I will show how to visualize longer paths using Sankey diagrams, also from the d3Network package.

Great technical details and examples but also worth the read for:

I’m not going to lie, all three of these diagrams are hard to interpret. Like wordclouds, network graphs can often be visually interesting, yet difficult to ascertain any concrete information. Network graphs also have the tendency to reinforce what you already know (you or someone you know designed your website, you should already have a feel for its structure!).

Randy does spot some patterns but working out what those patterns “mean” remain for further investigation.

Hairball graph visualizations can be a starting point for the hard work that extracts actionable intelligence.

Stanford Large Network Dataset Collection

Saturday, July 26th, 2014

Stanford Large Network Dataset Collection by Jure Leskovec.

From the webpage:

SNAP networks are also availalbe from UF Sparse Matrix collection. Visualizations of SNAP networks by Tim Davis.

If you need software to go with these datasets, consider Stanford Network Analysis Platform (SNAP)

Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

A Python interface is available for SNAP.

I first saw this at: Stanford Releases Large Network Datasets by Ryan Swanstrom.

Network Data (And Merging Graphs)

Wednesday, June 11th, 2014

Network Data by Mark Newman.

From the webpage:

This page contains links to some network data sets I’ve compiled over the years. All of these are free for scientific use to the best of my knowledge, meaning that the original authors have already made the data freely available, or that I have consulted the authors and received permission to the post the data here, or that the data are mine. If you make use of any of these data, please cite the original sources.

The data sets are in GML format. For a description of GML see here. GML can be read by many network analysis packages, including Gephi and Cytoscape. I’ve written a simple parser in C that will read the files into a data structure. It’s available here. There are many features of GML not supported by this parser, but it will read the files in this repository just fine. There is a Python parser for GML available as part of the NetworkX package here and another in the igraph package, which can be used from C, Python, or R. If you know of or develop other software (Java, C++, Perl, R, Matlab, etc.) that reads GML, let me know.

I count sixteen (16) data sets and seven (7) collections of data sets.

Reminded me of a tweet I saw today:

Glimpse Conference

It’s used to be the social graph, then the interest-graph. Now, w/ social shopping it’s all about the taste graph. (emphasis added)

That’s three very common graphs and we all belong to networks or have interests that could be represented as still others.

After all the labor that goes into the composition of a graph, Mr. Normalization Graph would say we have to re-normalize these graphs to use them together.

That sounds like a bad plan. To me, reduplicating work that has already been done is always a bad plan.

If we could merge nodes and edges of two or more graphs together, then we can leverage the prior work on both graphs.

Not to mention that after merging, the unified graph could be searched, visualized and explored with less capable graph software and techniques.

Something to keep in mind.

I first saw this in a tweet by Steven Strogatz.

Community detection in networks:…

Thursday, June 5th, 2014

Community detection in networks: structural clusters versus ground truth by Darko Hric, Richard K. Darst, and, Santo Fortunato.


Algorithms to find communities in networks rely just on structural information and search for cohesive subsets of nodes. On the other hand, most scholars implicitly or explicitly assume that structural communities represent groups of nodes with similar (non-topological) properties or functions. This hypothesis could not be verified, so far, because of the lack of network datasets with ground truth information on the classification of the nodes. We show that traditional community detection methods fail to find the ground truth clusters in many large networks. Our results show that there is a marked separation between structural and annotated clusters, in line with recent findings. That means that either our current modeling of community structure has to be substantially modified, or that annotated clusters may not be recoverable from topology alone.

Deeply interesting work if you are trying to detect “subjects” by clustering nodes in a network.

I would heed the warning that typology may not accurately represent hidden information.

Beyond this particular case, I would test any assumption that some known factor represents an unknown factor(s) for any data set. Better than the results surprise you than your client.

I first saw this in a tweet by Brian Keegan.

PS: As you already know, “ground truth” depends upon your point of view. Don’t risk your work on the basis of someone else’s “ground truth.”

Community Detection in Graphs — a Casual Tour

Tuesday, May 20th, 2014

Community Detection in Graphs — a Casual Tour by Jeremy Kun.

From the post:

Graphs are among the most interesting and useful objects in mathematics. Any situation or idea that can be described by objects with connections is a graph, and one of the most prominent examples of a real-world graph that one can come up with is a social network.

Recall, if you aren’t already familiar with this blog’s gentle introduction to graphs, that a graph G is defined by a set of vertices V, and a set of edges E, each of which connects two vertices. For this post the edges will be undirected, meaning connections between vertices are symmetric.

One of the most common topics to talk about for graphs is the notion of a community. But what does one actually mean by that word? It’s easy to give an informal definition: a subset of vertices C such that there are many more edges between vertices in C than from vertices in C to vertices in V - C (the complement of C). Try to make this notion precise, however, and you open a door to a world of difficult problems and open research questions. Indeed, nobody has yet come to a conclusive and useful definition of what it means to be a community. In this post we’ll see why this is such a hard problem, and we’ll see that it mostly has to do with the word “useful.” In future posts we plan to cover some techniques that have found widespread success in practice, but this post is intended to impress upon the reader how difficult the problem is.

Thinking that for some purposes, communities of nodes could well be a subject in a topic map. But we would have to be able to find them. And Jeremy says that’s a hard problem.

Looking forward to more posts on communities in graphs from Jeremy.

High-Performance Browser Networking

Monday, May 12th, 2014

High-Performance Browser Networking by Ilya Grigorik.

From the foreword:

In High Performance Browser Networking, Ilya explains many whys of networking: Why latency is the performance bottleneck. Why TCP isn’t always the best transport mechanism and UDP might be your better choice. Why reusing connections is a critical optimization. He then goes even further by providing specific actions for improving networking performance. Want to reduce latency? Terminate sessions at a server closer to the client. Want to increase connection reuse? Enable connection keep-alive. The combination of understanding what to do and why it matters turns this knowledge into action.

Ilya explains the foundation of networking and builds on that to introduce the latest advances in protocols and browsers. The benefits of HTTP 2.0 are explained. XHR is reviewed and its limitations motivate the introduction of Cross-Origin Resource Sharing. Server-Sent Events, WebSockets, and WebRTC are also covered, bringing us up to date on the latest in browser networking.

Viewing the foundation and latest advances in networking from the perspective of performance is what ties the book together. Performance is the context that helps us see the why of networking and translate that into how it affects our website and our users. It transforms abstract specifications into tools that we can wield to optimize our websites and create the best user experience possible. That’s important. That’s why you should read this book.

Network latency may be responsible for a non-responsive app but can you guess who the user is going to blame?

Right in one, the app!

“Not my fault” isn’t a line item on any bank deposit form.

You or someone on your team needs to be tasked with performance, including reading High-Performance Browser Networking.

I first saw this in a tweet by Jonas Bonér

Self organising hypothesis networks

Saturday, May 10th, 2014

Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge by Thierry Hanser, et al. (Journal of Cheminformatics 2014, 6:21)



Combining different sources of knowledge to build improved structure activity relationship models is not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate between learning techniques. Most of the current approaches address this problem by using consensus models that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general methodology to facilitate knowledge discovery and produce accurate and interpretable models.


To combine models at the knowledge level, we propose to decouple the learning phase from the knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification permits to combine different sources of knowledge into a common formalised framework. The approach allows us to create a synergistic system between different forms of knowledge and new algorithms can be applied to leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity.


It is possible to represent knowledge in the unified form of a hypothesis network allowing interpretable predictions with performances comparable to mainstream machine learning techniques. This new approach offers the potential to combine knowledge from different sources into a common framework in which high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work.

One interesting feature of this publication is a graphic abstract:


Assuming one could control the length of the graphic abstracts, that would be an interesting feature for conference papers.

What should be the icon for repeating old news before getting to the new stuff? 😉

Among a number of good points in this paper, see in particular:

  • Distinction between SOHN and “a Galois lattice used in Formal Concept
    Analysis [19] (FCA)” (at page 10).
  • Discussion of the transparency of this approach at page 21.

In a very real sense, announcing an answer to a medical question may be welcome, but it isn’t very informative. Nor will it enable others to advance the medical arts.

Other domains where answers are important but how you arrived at an answer is equally important if not more so?

Experimental CS – Networks

Friday, May 2nd, 2014

Design and analysis of experiments in networks: Reducing bias from interference by Dean Eckles, Brian Karrer, and, Johan Ugander.


Estimating the effects of interventions in networks is complicated when the units are interacting, such that the outcomes for one unit may depend on the treatment assignment and behavior of many or all other units (i.e., there is interference). When most or all units are in a single connected component, it is impossible to directly experimentally compare outcomes under two or more global treatment assignments since the network can only be observed under a single assignment. Familiar formalism, experimental designs, and analysis methods assume the absence of these interactions, and result in biased estimators of causal effects of interest. While some assumptions can lead to unbiased estimators, these assumptions are generally unrealistic, and we focus this work on realistic assumptions. Thus, in this work, we evaluate methods for designing and analyzing randomized experiments that aim to reduce this bias and thereby reduce overall error. In design, we consider the ability to perform random assignment to treatments that is correlated in the network, such as through graph cluster randomization. In analysis, we consider incorporating information about the treatment assignment of network neighbors. We prove sufficient conditions for bias reduction through both design and analysis in the presence of potentially global interference. Through simulations of the entire process of experimentation in networks, we measure the performance of these methods under varied network structure and varied social behaviors, finding substantial bias and error reductions. These improvements are largest for networks with more clustering and data generating processes with both stronger direct effects of the treatment and stronger interactions between units.

Deep sledding but that is to be expected as CS matures and abandons simplistic models, such as non-interaction between units in a network.

While I was reading the abstract, it occurred to me that merges that precipitate other merges could be said to cause interaction between topics.

Since the authors found error reduction in networks with as few as 1,000 vertices, you should not wait until you are building very large topic maps to take this paper into account.

Network Analysis and the Law:…

Tuesday, March 25th, 2014

Network Analysis and the Law: Measuring the Legal Importance of Precedents at the U.S. Supreme Court by James H. Fowler, et al.


We construct the complete network of 26,681 majority opinions written by the U.S. Supreme Court and the cases that cite them from 1791 to 2005. We describe a method for using the patterns in citations within and across cases to create importance scores that identify the most legally relevant precedents in the network of Supreme Court law at any given point in time. Our measures are superior to existing network-based alternatives and, for example, offer information regarding case importance not evident in simple citation counts. We also demonstrate the validity of our measures by showing that they are strongly correlated with the future citation behavior of state courts, the U.S. Courts of Appeals, and the U.S. Supreme Court. In so doing, we show that network analysis is a viable way of measuring how central a case is to law at the Court and suggest that it can be used to measure other legal concepts.

Danny Bickson pointed this paper out in: Spotlight: Ravel Law – introducing graph analytics to law research.

Interesting paper but remember that models are just that, models. Subsets of a more complex reality.

For example, I don’t know of any models of the Supreme Court (U.S.) that claim to be able to predict The switch in time that saved nine. If you don’t know the story, it makes really interesting reading. I won’t spoil the surprise but you will come away feeling the law is less “fixed” than you may have otherwise thought.

I commend this paper to you but if you need of legal advice, it’s best to consult an attorney and not an model.

Mapping Twitter Topic Networks:…

Thursday, February 20th, 2014

Mapping Twitter Topic Networks: From Polarized Crowds to Community Clusters by Marc A. Smith, Lee Rainie, Ben Shneiderman and Itai Himelboim.

From the post:

Conversations on Twitter create networks with identifiable contours as people reply to and mention one another in their tweets. These conversational structures differ, depending on the subject and the people driving the conversation. Six structures are regularly observed: divided, unified, fragmented, clustered, and inward and outward hub and spoke structures. These are created as individuals choose whom to reply to or mention in their Twitter messages and the structures tell a story about the nature of the conversation.

If a topic is political, it is common to see two separate, polarized crowds take shape. They form two distinct discussion groups that mostly do not interact with each other. Frequently these are recognizably liberal or conservative groups. The participants within each separate group commonly mention very different collections of website URLs and use distinct hashtags and words. The split is clearly evident in many highly controversial discussions: people in clusters that we identified as liberal used URLs for mainstream news websites, while groups we identified as conservative used links to conservative news websites and commentary sources. At the center of each group are discussion leaders, the prominent people who are widely replied to or mentioned in the discussion. In polarized discussions, each group links to a different set of influential people or organizations that can be found at the center of each conversation cluster.

While these polarized crowds are common in political conversations on Twitter, it is important to remember that the people who take the time to post and talk about political issues on Twitter are a special group. Unlike many other Twitter members, they pay attention to issues, politicians, and political news, so their conversations are not representative of the views of the full Twitterverse. Moreover, Twitter users are only 18% of internet users and 14% of the overall adult population. Their demographic profile is not reflective of the full population. Additionally, other work by the Pew Research Center has shown that tweeters’ reactions to events are often at odds with overall public opinion— sometimes being more liberal, but not always. Finally, forthcoming survey findings from Pew Research will explore the relatively modest size of the social networking population who exchange political content in their network.

Great study on political networks but all the more interesting for introducing an element of sanity into discussions about Twitter.

At a minimum, Twitter having 18% of all Internet users and 14% of the overall adult population casts serious doubt on metrics using Twitter to rate software popularity. (“It’s all we have” is a pretty lame excuse for using bad metrics.)

Not to say it isn’t important to mine Twitter data for what content it holds but at the same time to remember Twitter isn’t the world.

I first saw this at Mapping Twitter Topic Networks: From Polarized Crowds to Community Clusters by FullTextReports.


Wednesday, January 8th, 2014

BIIIG : Enabling Business Intelligence with Integrated Instance Graphs by André Petermann, Martin Junghanns, Robert Müller, Erhard Rahm.


We propose a new graph-based framework for business intelligence called BIIIG supporting the flexible evaluation of relationships between data instances. It builds on the broad availability of interconnected objects in existing business information systems. Our approach extracts such interconnected data from multiple sources and integrates them into an integrated instance graph. To support specific analytic goals, we extract subgraphs from this integrated instance graph representing executed business activities with all their data traces and involved master data. We provide an overview of the BIIIG approach and describe its main steps. We also present initial results from an evaluation with real ERP data.

Very interesting paper because on one hand it talks about merging data from heterogeneous data sets and at the same time claims to be using Neo4j.

In case you didn’t know, Neo4j enforces normalization and doesn’t have a concept of merging nodes. (True, Cypher has a “merge” operator but it doesn’t “merge” nodes in any meaningful sense of the word. Either a node is matched or a new node is created. Not how I interpret “merge.”)

It took more than one read but in puzzling over:

For integrated objects we can merge the properties from the sources. For the example in Fig. 2, we can combine employees objects with CIT.employees.erp_empl_number = ERP.EmplyeeTable.number and merge their properties from both sources (name, degree, dob, address, phone).

I realized the authors were producing a series of graphs where only the final version of the graph has the “merged” nodes. If you notice, the nodes are created first and then populated with associations, which resolves the question of using different pointers from the original sources.

The authors also point out that Neo4j cannot manage sets of graphs. I had overlooked that point. That is a fairly severe limitation.

Do spend some time at the Database Group Leipzig. There are several other recent papers that look very interesting.