Graph analysis becomes a key component of data science. A lot of things can be modeled as graphs, but social networks are really one of the most obvious examples.
In this post, I am going to show how one could visualize its own LinkedIn graph, using the LinkedIn API and Gephi, a very nice software for working on this type of data. If you don’t have it yet, just go to http://gephi.org/ and download it now !
My objective is to simply look at my connections (the “nodes” or “vertices” of the graph), see how they relate to each other (the “edges”) and find clusters of strongly connected users (“communities”). This is somewhat emulating what is available already in the InMaps data product, but, hey, this is cool to do it by ourselves, no ?
The first thing to do for running this graph analysis is to be able to query LinkedIn via its API. You really don’t want to get the data by hand… The API uses the oauth authentification protocol, which will let an application make queries on behalf of a user. So go to https://www.linkedin.com/secure/developer and register a new application. Fill the form as required, and in the OAuth part, use this redirect URL for instance:
Great introduction to Gephi!
As a bonus, reinforces the lesson that ETL isn’t required to re-use data.
ETL may be required in some cases but in a world of data APIs those are getting fewer and fewer.
Think of it this way: Non-ETL data access means someone else is paying for maintenance, backups, hardware, etc.
How much of your IT budget is supporting duplicated data?
This is the first article about the future Gephi 0.9 version. Our objective is to prepare the ground for a future 1.0 release and focus on solving some of the most difficult problems. It all starts with the core of Gephi and we’re giving today a preview of the upcoming changes in that area. In fact, we’re rewriting the core modules from scratch to improve performance, stability and add new features. The core modules represent and store the graph and attributes in memory so it’s available to the rest of the application. Rewriting Gephi’s core is like replacing the engine of a truck and involves adapting a lot of interconnected pieces. Gephi’s current graph structure engine was designed in 2009 and didn’t change much in multiple releases. Although it’s working, it doesn’t have the level of quality we want for Gephi 1.0 and needs to be overhauled. The aim is to complete the new implementation and integrate it in the 0.9 version.
I’ve used Google Maps API to visualize a relatively large network collected from Steam Community members. The data is collected from public player profiles that Valve reveals through their Steam Web API. For each player their links to friends and links to Steam Groups they belong are collected. This creates a social network which can be visualized using Gephi.
…
Graph consists of 212600 nodes and 4045203 edges. Before filtering outliers and low/high degree nodes there are approximately 800 000 groups and over 11 million users.
The Gephi Blueprints plugin allows a user to import graph-data from any graph database that implements the Tinkerpop Blueprints generic graph API. Out of the box, the plugin provides support for TinkerGraph, Neo4j, OrientDB, Dex and RexterGraph. Additionally, it also provides support for the FluxGraph temporal graph database.
Excellent!
Not to mention having a short list of interesting graph software to boot!
Today Stan Nikolov, who just finished his masters at MIT in studying information diffusion networks, walked us through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure (from the popular Easley and Kleinberg Network book). Stan also gathered a huge amount of Twitter data, processed it using Pig scripts, and graphed the results using Gephi. The video lecture below shows you some great visualizations of the spreading behavior of the data!
(video omitted)
The slides in his Lecture Notes let you see the Pig scripts in more detail.
Another deeply awesome lecture from Marti’s class on Twitter and big data.
Also an example of the level of analysis that a Twitter stream will need to withstand to avoid “imperial entanglements.”
A couple of weeks ago, I came across Gephi, a desktop application for visualising networks.
And quite by chance, a day or two after I was asked about any tools I knew of that could visualise and help analyse social network activity around an OU course… which I take as a reasonable justification for exploring exactly what Gephi can do
So, after a few false starts, here’s what I’ve learned so far…
First up, we need to get some graph data – netvizz – facebook to gephi suggests that the netvizz facebook app can be used to grab a copy of your Facebook network in a format that Gephi understands, so I installed the app, downloaded my network file, and then uninstalled the app… (can’t be too careful
Once Gephi is launched (and updated, if it’s a new download – you’ll see an updates prompt in the status bar along the bottom of the Gephi window, right hand side) Open… the network file you downloaded.
If you like part 1 as an introduction to Gephi, be sure to take in:
In Getting Started With Gephi Network Visualisation App – My Facebook Network, Part I I described how to get up and running with the Gephi network visualisation tool using social graph data pulled out of my Facebook account. In this post, I’ll explore some of the tools that Gephi provides for exploring a network in a more structured way.
If you aren’t familiar with Gephi, and if you haven’t read Part I of this series, I suggest you do so now…
…done that…?
Okay, so where do we begin? As before, I’m going to start with a fresh worksheet, and load my Facebook network data, downloaded via the netvizz app, into Gephi, but as an undirected graph this time! So far, so exactly the same as last time. Just to give me some pointers over the graph, I’m going to set the node size to be proportional to the degree of each node (that is, the number of people each person is connected to).
You will find lots more to explore with Gephi but this should give you a good start.
When I started running some years ago, I bought a Garmin Forerunner 405. It’s a nifty little device that tracks GPS coordinates while you are running. After a run, the device can be synchronized by uploading your data to the Garmin Connect website. Based upon the tracked time and GPS coordinates, the Garmin Connect website provides you with a detailed overview of your run, including distance, average pace, elevation loss/gain and lap splits. It also visualizes your run, by overlaying the tracked course on Bing and/or Google maps. Pretty cool! One of my last runs can be found here.
Apart from simple aggregations such as total distance and average speed, the Garmin Connect website provides little or no support to gain deeper insights in all of my runs. As I often run the same course, it would be interesting to calculate my average pace at specific locations. When combining the data of all of my courses, I could deduct frequently encountered locations. Finally, could there be a correlation between my average pace and my distance from home? In order to come up with answers to these questions, I will import my running data into a Neo4J Spatial datastore. Neo4J Spatial extends the Neo4J Graph Database with the necessary tools and utilities to store and query spatial data in your graph models. For visualizing my running data, I will make use of Gephi, an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs.
Suggestion: If you want to know where you go and/or how you spend your time, try tracking both for a week. Faithfully record how you spend your time, reading, commuting, TV, exercise, work, etc., in say 30 minute intervals. Also keep track of your physical location. Don’t try to be overly precise, use big buckets. And no peeking as to how the week is shaping up. I think you will be surprised at how your week shapes up.
Neo4j is a powerful, award-wining graph database written in Java. It can store billions of nodes and relationships and allows very fast query/traversal. We release today a new version of the Neo4j Plugin supporting the latest 1.5 version of Neo4j. In Gephi, go to Tools > Plugins to install the plug-in.
The plugin let you visualize a graph stored in a Neo4j database and play with it. Features include full import, traversal, filter, export and lazy loading.
Warning: A real time sink!
Seriously, very cool plugin that will enhance your use of Neo4j!
Last week, the Neo4Jplugin for Gephi was released. Gephi is an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs. The graphs themselves can be loaded through a variety of file formats. Thanks to Martin Škurla, it is now possible to load and lazily explore graphs that are stored in a Neo4J data store.
In one of my previous articles, I explained how Neo4J and the Tinkerpop framework can be used to load and query RDF triples. The newly released Neo4J plugin now allows to visually browse these RDF triples and perform some more fancy operations such as finding patterns and executing social network analysis algorithms from within Gephi itself. Tinkerpop’s Sail Ouplementation also supports the notion of RDF Schema inferencing. Inferencing is the process where new (RDF) data is automatically deducted from existing (RDF) data through reasoning. Unfortunately, the Sail reasoner cannot easily be integrated within Gephi, as the Gephi plugin grabs a lock on the Neo4J store and no RDF data can be added, except through the plugin itself.
Being able to visualize the RDF Schema reasoning process and graphically indicate which RDF triples were added manually and which RDF data was automatically inferred would be a nice to have. To implement this feature, we should be able to push graph changes from Tinkerpop and Neo4J to Gephi. Luckily, the Gephi graph streaming plugin allows us to do just that. In the rest of this article, I will detail how to setup the required Gephi environment and how we can stream (inferred) RDF data from Neo4J to Gephi.
Visual is good!
Visual display and exploration of graphs is better!
Visual display and exploration of Neo4j data stores from within Gephi is the best!
Dave concludes:
With just a few lines of code we are able to stream (inferred) RDF triples to Gephi and make use of its powerful visualization and analysis tools to explore and inspect our datasets. As always, the complete source code can be found on the Datablend public GitHub repository. Make sure to surf the internet to find some other nice Gephi streaming examples, the coolest one probably being the visualization of the Egyptian revolution on Twitter.
ForceAtlas2 (paper) +appendices by Mathieu Jacomy, Sebastien Heymann, Tommaso Venturini, and Mathieu Bastian.
Abstract:
ForceAtlas2 is a force vector algorithm proposed in the Gephi software, appreciated for its simplicity and for the readability of the networks it helps to visualize. This paper presents its distinctive features, its energy-model and the way it optimizes the “speed versus precision” approximation to allow quick convergence. We also claim that ForceAtlas2 is handy because the force vector principle is unaffected by optimizations, offering a smooth and accurate experience to users.
I knew I had to cite this paper when I read:
These earliest Gephi users were not fully satisfied with existing spatialization tools. We worked on empirical improvements and that’s how we created the first version of our own algorithm, ForceAtlas. Its particularity was a degree-dependant repulsion force that causes less visual cluttering. Since then we steadily added some features while trying to keep in touch with users’ needs. ForceAtlas2 is the result of this long process: a simple and straightforward algorithm, made to be useful for experts and profanes. (footnotes omitted, emphasis added)
Profanes. I like that! Well, rather I like the literacy that enables a writer to use that in a technical paper.
Work is underway on a new visualization API for Gephi. If you are interested in writing visualization of graph software, here’s your opportunity to make a difference.
Though I never intended it, some posts of mine from a few years back dealing with 26 tools for large-scale graph visualization have been some of the most popular on this site. Indeed, my recommendation for Cytoscape for viewing large-scale graphs ranks within the top 5 posts all time on this site.
When that analysis was done in January 2008 my company was in the midst of needing to process the large UMBEL vocabulary, which now consists of 28,000 concepts. Like anything else, need drives research and demand, and after reviewing many graphing programs, we chose Cytoscape, then provided some ongoing guidelines in its use for semantic Web purposes. We have continued to use it productively in the intervening years.
Like for any tool, one reviews and picks the best at the time of need. Most recently, however, with growing customer usage of large ontologies and the development of our own structOntology editing and managing framework, we have begun to butt up against the limitations of large-scale graph and network analysis. With this post, we announce our new favorite tool for semantic Web network and graph analysis — Gephi — and explain its use and showcase a current example.
Times change and sometimes software choices do as well.
This is a case in point that reviews the current limitations of Cytoscape, the good points of Gephi, its needed improvements and pointers to more resources on Gephi. Can’t ask for much more.
Cezary Bartosiak and Rafa? Kasprzyk just released the Complex Generators plugin, introducing many awaited scientific generators. These generators are extremely useful for scientists, as they help to simulate various real networks. They can test their models and algorithms on well-studied graph examples. For instance, the Watts-Strogatz generator creates networks as described by Duncan Watts in his Six Degrees book.
It had already been some time without having some fun with Gephi so today I told myself: why not trying visualizing the whole Gene Ontology and seeing what happens?
First of all I had to generate the corresponding file in gexf format containing all the terms and relationships belonging to the ontology.
For that I did a small program (GenerateGexfGo.java) which uses Bio4j for terms/relationships info retrieval and a couple of XML Gexf wrapper classes from the github project Era7BioinfoXML.
This looks like fun!
And a good way to look at an important data set, that could benefit from a topic map.
Its events like this that make me wish I were on the West Coast.
Even so, there are a number of resources listed for those of us who cannot attend.
From the website:
The next Gephi Workshop will be on Wednesday, March 23rd at 1PM at the IC classroom in Green Library.
I’ll occasionally be able to provide two-hour workshops on the basics of using Gephi, the network analysis package with which I’ve made the images and videos below. The workshops will focus on:
getting graph data into Gephi using .gexf, .csv and database connections
running Filters, Analytics and Layouts on the data
optimization of Gephi for large datasets
overview of layout algorithms and strategies for their use
In the following posts I’m finally keeping my promise to explore in earnest the use of Gephi’s dynamic timeline feature for visualising Twitter-based discussions as they unfolded in real time. A few months ago, Jean posted a first glimpse of our then still very experimental data on Twitter dynamics, with a string of caveats attached – and I followed up on this a little while later with some background on the Gawk scripts we’re using to generate timeline data in GEXF format from our trusty Twapperkeeper archives (note that I’ve updated one of the scripts in that post, to make the process case-insensitive). Building on those posts, here I’ll outline the entire process and show some practical results (disclaimer: actual dynamic animations will follow in part two, tomorrow – first we’re focussing on laying the groundwork).
While there is an obvious time component to tweets, is there an implied relevancy based on time for other information as well?
Tactical information should be displayed to ground level commanders and be sans longer term planning data, while for command headquarters, tactical information is just clutter on the display.
Text analysis and graph visualization on the Wikileaks Cablegate dataset.
We propose to present a complete work-flow of textual data analysis, from acquisition to visual exploration of a complex network. Through the presentation of a simple software specifically developed for this talk, we will cover a set of productive and widely used softwares and libraries in text analysis, then introduce some features of Gephi, an open-source network visualization & analysis software, using the data collected and transformed with cablegate-semnet.
The purpose of the Graph Streaming API project, run by André Panisson, is to build a unified framework for streaming graph objects. Gephi’s data structure and visualization engine has been built with the idea that a graph is not static and might change continuously. By connecting Gephi with external data-sources, we leverage its power to visualize and monitor complex systems or enterprise data in real-time. Moreover, the idea of streaming graph data goes beyond Gephi, and a unified and standardized API could bring interoperability with other available tools for graph and network analysis, as they could start to interoperate with other tools in a distributed and cooperative fashion.
There are times when no comment seems adequate. This is one of those times.
Read the post, play with the code, follow the work (and support it!).