Archive for the ‘Gephi’ Category

Clinton/Podesta Emails – Towards A More Complete Graph (Part 2)

Friday, October 28th, 2016

I assume you are starting with DKIM-complete-podesta-1-18.txt.gz.

If you are starting with another source, you will need different instructions. 😉

First, remember from Clinton/Podesta Emails – Towards A More Complete Graph (Part 1) that I wanted to delete all the < and > signs from the text.

That’s easy enough (uncompress the text first):

sed 's/<//g' DKIM-complete-podesta-1-18.txt > DKIM-complete-podesta-1-18-01.txt

followed by:

sed 's/>//g' DKIM-complete-podesta-1-18-01.txt > DKIM-complete-podesta-1-18-02.txt

Here’s where we started:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner <jdorner@americanprogress.org>|”‘bigcampaign@googlegroups.com'” <bigcampaign@googlegroups.com>|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|<A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F @CAPMAILBOX.americanprogresscenter.org>
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin <jschwerin@hillaryclinton.com>|hrcrapid <HRCrapid@googlegroups.com>|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|<CAPrY+5KJ=NG+Vs-khDVpe-L=bP5=qvPcZTS5FDam5LixueQsKA@mail.gmail.com>

Here’s the result after the first two sed scripts:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner jdorner@americanprogress.org|”‘bigcampaign@googlegroups.com'” bigcampaign@googlegroups.com|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|
A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F
@CAPMAILBOX.americanprogresscenter.org
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin jschwerin@hillaryclinton.com|hrcrapid HRCrapid@googlegroups.com|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|CAPrY+5KJ=NG+Vs-khDVpe-L=bP5=qvPcZTS5FDam5LixueQsKA@mail.gmail.com

BTW, I increment the numbers of my result files, DKIM-complete-podesta-1-18-01.txt, DKIM-complete-podesta-1-18-02.txt, because when I don’t, I run different sed commands on the same original file, expecting a cumulative result.

That’s spelled – disappointment and wasted effort looking for problems that aren’t problems. Number your result files.

The nodes and edges mentioned in Clinton/Podesta Emails – Towards A More Complete Graph (Part 1):

Nodes

  • Emails, message-id is ID and subject is label, make Wikileaks id into link
  • From/To, email addresses are ID and name is label
  • True/False, true/false as ID, True/False as labels
  • Date, truncated to 2015-07-24 (example), date as id and label

Edges

  • To/From – Edges with message-id (source) email-address (target) to/from (label)
  • Verify – Edges with message-id (source) true/false (target) verify (label)
  • Date – Edges with message-id (source) – date (target) date (label)

Am I missing anything? The longer I look at problems like this the more likely my thinking/modeling will change.

What follows is very crude, command line creation of the node and edge files. Something more elaborate could be scripted/written in any number of languages.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

You don’t have to take my word for it, try this:

awk 'FS="|" { print $7}' DKIM-complete-podesta-1-18-02.txt

The output prints to the console. Those all look like message-ids to me, well, with the exception of the one that reads ” 24 September.”

How much dirty data do you think is in the message-id field?

A crude starting assumption is that any message-id field without the “@” character is dirty.

Let’s try:

awk ‘FS = “|” { print $7} DKIM-complete-podesta-1-18-02.txt | grep -v @ | wc -l

Which means we are going to extract the 7th field, search (grep) over those results for the “@” sign, where the -v switch means only print lines that DO NOT match, and we will count those lines with wc -l.

Ready? Press return.

I get 594 “dirty” message-ids.

Here is a sampling:


Rolling Stone
TheHill
MSNBC, Jeff Weaver interview on Sanders health care plan and his Wall St. ad
Texas Tribune
20160120 Burlington, IA Organizing Event
start spreadin’ the news…..
Building Trades Union (Keystone XL)
Rubio Hits HRC On Cuba, Russia
2.19.2016
Sourcing Story
Drivers Licenses
Day 1
H1N1 Flu Shot Available

Those look an awful lot like subject lines to me. You?

I suspect those subject lines had the separator character “|” in those lines, before we extracted the data from the .eml files.

I’ve tried to repair the existing files but the cleaner solution is to return to the extraction script and the original email files.

More on that tomorrow!

Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)

Friday, October 28th, 2016

Gephi is a great tool, but it’s only as good as its input.

The Gephi 8.2 email importer (missing in Gephi 9.*) is lossy, informationally speaking, as I have mentioned before.

Here’s a sample from the verification results on podesta-release-1-18:

9981 00045326.eml|False|2015-07-24 12:58:16-04:00|Oren Shur |John Podesta , Robby Mook , Joel Benenson , “Margolis, Jim” , Mandy Grunwald , David Binder , Teddy Goff , Jennifer Palmieri , Kristina Schake , Christina Reynolds , Katie Connolly , “Kaye, Anson” , Peter Brodnitz , “Rimel, John” , David Dixon , Rich Davis , Marlon Marshall , Michael Halle , Matt Paul , Elan Kriegel , Jake Sullivan |FW: July IA Poll Results|<168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com>

The Gephi 8.2 mail importer fails to create a node representing an email message.

I propose we cure that failure by taking the last field, here:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com

and the next to last field:

FW: July IA Poll Results

and putting them as id and label, respectively in a node list:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”;

As part of the transformation, we need to remove the < and > signs around the message ID, then add a ; to mark the end of the ID field and put double quote ” “ around the subject to use it as a label. Then close the second field with another ;.

While we are talking about nodes, all the email addresses change from:

Oren Shur

to:

oshur@hillaryclinton.com; “Oren Shur”;

which are ID and label of the node list, respectively.

I could remove the < and > characters as part of the extraction script but will use sed at the command line instead.

Reminder: Always work on a copy of your data, never the original.

Then we need to create an edge list, one that represents the relationships between the email (as node) to the sender and receivers of the email (also nodes). For this first iteration, I’m going to use labels on the edges to distinguish between senders and receivers.

Assuming my first row of the edges file reads:

Source; Target; Role (I did not use “Type” because I suspect that is a controlled term for Gephi.)

Then the first few edges would read:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; oshur@hillaryclinton.com>; from;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; john.podesta@gmail.com>; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; re47@hillaryclinton.com; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; jbenenson@bsgco.com; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; Jim.Margolis@gmmb.com; to;
….

As you can see, this is going to be a “busy” graph! 😉

Filtering is going to play an important role in exploring this graph, so let’s add nodes that will help with that process.

I propose we add to the node list:

true; True
false; False

as id and labels.

Which means for the edge list we can have:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; true; verify;

Do you have an opinion on the order, source/target for true/false?

Thinking this will enable us to filter nodes that have not been verified or to include only those that have failed verification.

For experimental purposes, I think we need to rework the date field:

2015-07-24 12:58:16-04:00

I would truncate that to:

2015-07-24

and add such truncated dates to the node list:

2015-07-24; 2015-07-24;

as ID and label, respectively.

Then for the edge list:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; 2015-07-24; date;

Reasoning that we can filter to include/exclude nodes based on dates, which if you add enough email boxes, could help visualize the reaction to and propagation of emails.

Even assuming named participants in these emails have “deleted” their inboxes, there are always automatic backups. It’s just a question of persistence before the rest of this network can be fleshed out.

Oh, one last thing. You have probably notice the Wikileaks “ID” that forms part of the filename?

9981 00045326.eml

The first part forms the end of a URL to link to the original post at Wikileaks.

Thus, in this example, 9981 becomes:

https://wikileaks.org/podesta-emails/emailid/9981

The general form being:

https://wikileaks.org/podesta-emails/emailid/(Wikileaks-ID)

For the convenience of readers/users, I want to modify my earlier proposal for the email node list entry from:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”;

to:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”; https://wikileaks.org/podesta-emails/emailid/9981;

Where the third field is “link.”

I am eliding over lots of relationships and subjects but I’m not reluctant to throw it all away and start over.

Your investment in a model isn’t lost by tossing the model, you learn something with every model you build.

Scripting underway, a post on that experience and the node/edge lists to follow later today.

Podesta/Clinton Emails: Filtering by Email Address (Pimping Bill Clinton)

Thursday, October 27th, 2016

The Bill Clinton, Inc. story reminds me of:

Although I steadfastly resist imaging either Bill or Hillary in that video. Just won’t go there!

Where a graph display can make a difference is that instead of just the one email/memo from Bill’s pimp, we can rapidly survey all of the emails in which he appears, in any role.

import-spigot-email-filter-460

I ran that on Gephi 8.2 against podesta-release-1-18 but the results were:

Nodes 0, Edges 0.

Hmmm, there is something missing, possibly the CSV file?

I checked and podesta-release-1-18 has 393 emails where doug@presidentclinton.com appears.

Could try to find the “right” way to filter on email addresses but for now, let’s take a dirty short-cut.

I created a directory to hold all emails with doug@presidentclinton.com and ran into all manner of difficulties because the file names are plagued with spaces!

So much so that I unintentionally (read “by mistake”) saved all the doug@presidentclinton.com posts from podesta-release-1-18 to a different folder than the ones from podesta-release-19.

🙁

Well, but there is a happy outcome and an illustration of yet another Gephi capability.

I build the first graph from the doug@presidentclinton.com posts from podesta-release-1-18 and then with that graph open, imported the doug@presidentclinton.com from podesta-release-19 and appended those results to the open graph.

How cool is that!

Imagine doing that across data sets, assuming you paid close attention to identifiers, etc.

Sorry, back to the graphs, here is the random layout once the graphs were combined:

doug-default-460

Applying the Yifan Hu network layout:

doug-network-yifan-hu-460

I ran network statistics on network diameter and applied colors based on betweenness:

doug-network-centrality-460

And finally, adjusted the font and turned on the labels:

doub-network-labels-460

I have spent a fair amount of time just moving stuff about but imagine if you could interactively explore the emails, creating and trashing networks based on to:, from:, cc:, dates, content, etc.

The limits of Gephi imports were a major source of pain today.

I’m dodging those tomorrow in favor of creating node and adjacency tables with scripts.

PS: Don’t John Podesta and Doug Band look like two pimps in a pod? 😉

PPS: If you haven’t read the pimping Bill Clinton memo. (I think it has some other official title.)

No Frills Gephi (8.2) Import of Clinton/Podesta Emails (1-18)

Wednesday, October 26th, 2016

Using Gephi 8.2, you can create graphs of the Clinton/Podesta emails based on terms in subject lines or the body of the emails. You can interactively work with all 30K+ (as of today) emails and extract networks based on terms in the posts. No programming required. (Networks based on terms will appear tomorrow.)

If you have Gephi 8.2 (I can’t find the import spigot in 9.0 or 9.1), you can import the Clinton/Podesta Emails (1-18) for analysis as a network.

To save you the trouble of regressing to Gephi 8.2, I performed a no frills/default import and exported that file as podesta-1-18-network.gephi.gz.

Download and uncompress podesta-1-18-network.gephi.gz, then you can pickup at timemark 3.49.

Open the file (your location may differ):

gephi-podesta-open-460

Obligatory hair-ball graph visualization. 😉

gephi-first-look-460

Considerably less appealing that Jennifer Golbeck’s but be patient!

First step, Layout -> Yifan Hu. My results:

yifan-hu-first-460

Second step, Network Diameter statistics (right side, run).

No visible impact on the graph but, now you can change the color and size of nodes in the graph. That is they have attributes on which you can base the assignment of color and size.

Tutorial gotcha: Not one of Jennifer’s tutorials but I was watching a Gephi tutorial that skipped the part about running statistics on the graph prior to assignment of color and size. Or I just didn’t hear it. The menu options appear in documentation but you can’t access them unless and until you run network statistics or have attributes for the assignment of color and size. Run statistics first!

Next, assign colors based on betweenness centrality:

gephi-betweenness-460

The densest node is John Podesta, but if you remove his node, rerun the network statistics and re-layout the graph, here is part of what results:

delete-central-node-460

A no frills import of 31,819 emails results in a graph of 3235 nodes and 11,831 edges.

That’s because nodes and edges combine (merge to you topic map readers) when they have the same identifier or for edges are between the same nodes.

Subject to correction, when that combining/merging occurs, the properties on the respective nodes/edges are accumulated.

Topic mappers already realize there are important subjects missing, some 31,819 of them. That is the emails themselves don’t by default appear as nodes in the network.

Ian Robinson, Jim Webber & Emil Eifrem illustrate this lossy modeling in Graph Databases this way:

graph-databases-lossy-460

Modeling emails without the emails is rather lossy. 😉

Other nodes/subjects we might want:

  • Multiple to: emails – Is who was also addressed important?
  • Multiple cc: emails – Same question as with to:.
  • Date sent as properties? So evolution of network/emails can be modeled.
  • Capture “reply-to” for relationships between emails?

Other modeling concerns?

Bear in mind that we can suppress a large amount of the detail so you can interactively explore the graph and only zoom into/display data after finding interesting patterns.

Some helpful links:

https://archive.org/details/PodestaEmailszipped: The email collection as bulk download, thanks to Michael Best, @NatSecGeek.

https://github.com/gephi/gephi/releases: Where you can grab a copy of Gephi 8.2.

NSA Grade – Network Visualization with Gephi

Sunday, April 10th, 2016

Network Visualization with Gephi by Katya Ognyanova.

It’s not possible to cover Gephi in sixteen (16) pages but you will wear out more than one printed copy of these sixteen (16) pages as you become experienced with Gephi.

This version is from a Gephi workshop at Sunbelt 2016.

Katya‘s homepage offers a wealth of network visualization posts and extensive use of R.

Follow her at @Ognyanova.

PS: Gephi equals or exceeds visualization capabilities in use by the NSA, depending upon your skill as an analyst and the quality of the available data.

…[N]ew “GraphStore” core – Gephi

Monday, January 11th, 2016

Gephi boosts its performance with new “GraphStore” core by Mathieu Bastian.

From the post:

Gephi is a graph visualization and analysis platform – the entire tool revolves around the graph the user is manipulating. All modules (e.g. filter, ranking, layout etc.) touch the graph in some way or another and everything happens in real-time, reflected in the visualization. It’s therefore extremely important to rely on a robust and fast underlying graph structure. As explained in this article we decided in 2013 to rewrite the graph structure and started the GraphStore project. Today, this project is mostly complete and it’s time to look at some of the benefits GraphStore is bringing into Gephi (which its 0.9 release is approaching).

Performance is critical when analyzing graphs. A lot can be done to optimize how graphs are represented and accessed in the code but it remains a hard problem. The first versions of Gephi didn’t always shine in that area as the graphs were using a lot of memory and some operations such as filter were slow on large networks. A lot was learnt though and when the time came to start from scratch we knew what would move the needle. Compared to the previous implementation, GraphStore uses simpler data structures (e.g. more arrays, less maps) and cache-friendly collections to make common graph operations faster. Along the way, we relied on many micro-benchmarks to understand what was expensive and what was not. As often with Java, this can lead to surprises but it’s a necessary process to build a world-class graph library.

What shall we say about the performance numbers?

IMPRESSIVE!

The test were against “two different classic graphs, one small (1.5K nodes, 19K edges) and one medium (83K nodes, 68K edges).”

Less than big data size graphs but isn’t the goal of big data analysis to extract the small portion of relevant data from the big data?

Yes?

Maybe there should be an axiom about gathering of irrelevant data into a big data pile, only to be excluded again.

Or premature graphification of largely irrelevant data.

Something to think about as you contribute to the further development of this high performing graph library.

Enjoy!

Announcing Gephi 0.9 release date

Monday, November 2nd, 2015

Announcing Gephi 0.9 release date by Mathieu Bastian.

From the post:

Gephi has an amazing community of passionate users and developers. In the past few years, they have been very dedicated creating tutorials, developing new plugins or helping out on GitHub. They also have been patiently waiting for a new Gephi release! Today we’re happy to share with you that the wait will come to an end December 20th with the release of Gephi 0.9 for Windows, MacOS X and Linux.

We’re very excited about this upcoming release and developers are hard at work to deliver its roadmap before the end of 2015. This release will resolve a serie of compatibility issues as well as improve features and performance.

Our vision for Gephi remains focused on a few fundamentals, which were already outlined in our Manifesto back in 2009. Gephi should be a software for everyone, powerful yet easy to learn. In many ways, we still have the impression that we’ve only scratched the surface and want to continue to focus on making each module of Gephi better. As part of this release, we’ve undertaken one of the most difficult project we’ve worked on and completely rewrote the core of Gephi. Although not very visible for the end-user, this brings new capabilities, better performance and a level of code quality we can be proud of. This ensure a very solid foundation for the future of this software and paves the way for a future 1.0 version.

Below is an overview of the new features and improvements the 0.9 version will bring.

The list of highlights includes:

  • Java and MacOS compatibility
  • New redeveloped core
  • New Appearance module
  • Timestamp support
  • GEXF 1.3 support
  • Multiple files import
  • Multi-graph support (visualization n a future release)
  • New workspace selection UI
  • Giphi Toolkit release (soon after 0.9)

Enough new features to keep you busy over the long holiday season!

Enjoy!

Wikipedia in Python, Gephi, and Neo4j

Thursday, January 8th, 2015

Wikipedia in Python, Gephi, and Neo4j: Vizualizing relationships in Wikipedia by Matt Krzus.

From the introduction:

g3

We have had a bit of a stretch here where we used Wikipedia for a good number of things. From Doc2Vec to experimenting with word2vec layers in deep RNNs, here are a few of those cool visualization tools we’ve used along the way.

Cool things you will find in this post:

  • Building relationship links between Categories and Subcategories
  • Visualization with Networkx (think Betweenness Centrality and PageRank)
  • Neo4j and Cypher (the author thinks avoiding the Giraph learning curve is a plus, I leave that for you to decide)
  • Visualization with Gephi

Enjoy!

Gremlin and Visualization with Gephi [Death of Import/Export?]

Wednesday, June 25th, 2014

Gremlin and Visualization with Gephi by Stephen Mallette.

From the post:

We are often asked how to go about graph visualization in TinkerPop. We typically refer folks to Gephi or Cytoscape as the standard desktop data visualization tools. The process of using those tools involves: getting your graph instance, saving it to GraphML (or the like) then importing it to those tools

TinkerPop3 now does two things to help make that process easier:

  1. A while back we introduced the “subgraph” step which allows you to pop-off a Graph instance from a Traversal, which help greatly simplify the typical graph visualization process with Gremlin, where you are trying to get a much smaller piece of your large graph to focus the visualization effort.
  2. Today we introduce a new :remote command in the Console. Recall that :remote is used to configure a different context where Gremlin will be evaluated (e.g. Gremlin Server). For visualization, that remote is called “gephi” and it configures the :submit command to take any Graph instance and push it through to the Gephi Streaming API. No more having to import/export files!

This rocks!

How do you imagine processing your data when import/export goes away?

Of course, this doesn’t have anything on *nix pipes but it is nice to see good ideas come back around.

Gephi Upgrade – Neo4j 2.0.1 Support

Tuesday, March 4th, 2014

Gephi Upgrade

From the webpage:

This plugin adds support for Neo4j graph database. You can open Neo4j 2.0.1 database directory and manipulate with graph as any other Gephi graph. You can also export any graph into Neo4j database, you can filter import or export and you can use debugging as well as lazy loading support.

That’s welcome news!

Easier than Excel:…

Wednesday, September 25th, 2013

Easier than Excel: Social Network Analysis of DocGraph with Gephi by Janos G. Hajagos and Fred Trotter. (PDF)

From the session description:

The DocGraph dataset was released at Strata RX 2012. The dataset is the result of FOI request to CMS by healthcare data activist Fred Trotter (co-presenter). The dataset is minimal where each row consists of just three numbers: 2 healthcare provider identifiers and a weighting factor. By combining these three numbers with other publicly available information sources novel conclusions can be made about delivery of healthcare to Medicare members. As an example of this approach see: http://tripleweeds.tumblr.com/post/42989348374/visualizing-the-docgraph-for-wyoming-medicare-providers

The DocGraph dataset consists of over 49,685,810 relationships between 940,492 different Medicare providers. Analyzing the complete dataset is too big for traditional tools but useful subsets of the larger dataset can be analyzed with Gephi. Gephi is a opensource tool to visually explore and analyze graphs. This tutorial will teach participants how to use Gephi for social network analysis on the DocGraph dataset.

Outline of the tutorial:

Part 1: DocGraph and the network data model (30% of the time)

The DocGraph dataset The raw data Helper data (NPI associated data) The graph / network data model Nodes versus edges How graph models are integral to social networking Other Healthcare graph data sets

Part 2: Using Gephi to perform analysis (70% of the time)

Basic usage of Gephi Saving and reading the GraphML format Laying out edges and nodes of a graph Navigating and exploring the graph Generating graph metrics on the network Filtering a subset of the graph Producing the final output of the graph.

Links from the last slide:

http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html (information)

https://github.com/jhajagos/DocGraph (code)

http://notonlydev.com/docgraph-data (open source $1 covers bandwidth fees)

https://groups.google.com/forum/#!forum/docgraph (mailing list)

Just in case you don’t have it bookmarked already: Gephi.

The type of workshop that makes an entire conference seem like lagniappe.

Just sorry I will have to appreciate it from afar.

Work through this one carefully. You will acquire useful skills doing so.

Force Atlas 3D:…

Friday, June 28th, 2013

3d graph

Force Atlas 3D: New plugin to visualize your graphs in 3D with Gephi by Clement Levallois.

From the post:

Hi, Just released today a plugin to visualize your networks in 3D with Gephi: Force Atlas 3D. Find it here, but you can install it directly from within Gephi, by following these instructions.

Your 2D networks are now visualized in the 3D space. Effects of depth and perspective make it easier to perceive the structure of your network.

“Which node is most central” can get a new answer, visually: nodes “nested” inside the network are surely interesting to look at.

This plugin was written on top of the Force Atlas 2 plugin, developed by Mathieu Jacomy et al. and that you can find installed by default in Gephi already. Thanks to them for this great work!

I think you will find this quite impressive!

Visualizing your LinkedIn graph using Gephi (Parts 1 & 2)

Sunday, May 19th, 2013

Visualizing your LinkedIn graph using Gephi – Part 1

&

Visualizing your LinkedIn graph using Gephi – Part 2

by Thomas Cabrol.

From part 1:

Graph analysis becomes a key component of data science. A lot of things can be modeled as graphs, but social networks are really one of the most obvious examples.

In this post, I am going to show how one could visualize its own LinkedIn graph, using the LinkedIn API and Gephi, a very nice software for working on this type of data. If you don’t have it yet, just go to http://gephi.org/ and download it now !

My objective is to simply look at my connections (the “nodes” or “vertices” of the graph), see how they relate to each other (the “edges”) and find clusters of strongly connected users (“communities”). This is somewhat emulating what is available already in the InMaps data product, but, hey, this is cool to do it by ourselves, no ?

The first thing to do for running this graph analysis is to be able to query LinkedIn via its API. You really don’t want to get the data by hand… The API uses the oauth authentification protocol, which will let an application make queries on behalf of a user. So go to https://www.linkedin.com/secure/developer and register a new application. Fill the form as required, and in the OAuth part, use this redirect URL for instance:

Great introduction to Gephi!

As a bonus, reinforces the lesson that ETL isn’t required to re-use data.

ETL may be required in some cases but in a world of data APIs those are getting fewer and fewer.

Think of it this way: Non-ETL data access means someone else is paying for maintenance, backups, hardware, etc.

How much of your IT budget is supporting duplicated data?

Rebuilding Gephi’s core for the 0.9 version

Tuesday, March 5th, 2013

Rebuilding Gephi’s core for the 0.9 version by Mathieu Bastian.

From the post:

This is the first article about the future Gephi 0.9 version. Our objective is to prepare the ground for a future 1.0 release and focus on solving some of the most difficult problems. It all starts with the core of Gephi and we’re giving today a preview of the upcoming changes in that area. In fact, we’re rewriting the core modules from scratch to improve performance, stability and add new features. The core modules represent and store the graph and attributes in memory so it’s available to the rest of the application. Rewriting Gephi’s core is like replacing the engine of a truck and involves adapting a lot of interconnected pieces. Gephi’s current graph structure engine was designed in 2009 and didn’t change much in multiple releases. Although it’s working, it doesn’t have the level of quality we want for Gephi 1.0 and needs to be overhauled. The aim is to complete the new implementation and integrate it in the 0.9 version.

Deeply interesting work!

To follow, consider subscribing to: gephi-dev — List for core developers.

Large Steam network visualization with Google Maps + Gephi

Wednesday, November 21st, 2012

Large Steam network visualization with Google Maps + Gephi

From the post:

I’ve used Google Maps API to visualize a relatively large network collected from Steam Community members. The data is collected from public player profiles that Valve reveals through their Steam Web API. For each player their links to friends and links to Steam Groups they belong are collected. This creates a social network which can be visualized using Gephi.

Graph consists of 212600 nodes and 4045203 edges. Before filtering outliers and low/high degree nodes there are approximately 800 000 groups and over 11 million users.

Very impressive visualization.

Enjoy!

“Drug Deal” Network Analysis with Gephi (Tutorial)

Friday, November 9th, 2012

“Drug Deal” Network Analysis with Gephi (Tutorial) by A. J. Hirst.

A.J. reviews Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi, suggests that you read it before continuing, and then reviews how to use Gephi to converse with the drug dealer data set.

Good tutorial on Gephi and just as good on “conversing” with the data.

Gephi Blueprints plugin

Wednesday, November 7th, 2012

Gephi Blueprints plugin by David Suvee.

From the homepage:

The Gephi Blueprints plugin allows a user to import graph-data from any graph database that implements the Tinkerpop Blueprints generic graph API. Out of the box, the plugin provides support for TinkerGraph, Neo4j, OrientDB, Dex and RexterGraph. Additionally, it also provides support for the FluxGraph temporal graph database.

Excellent!

Not to mention having a short list of interesting graph software to boot!

Information Diffusion on Twitter by @snikolov

Friday, October 26th, 2012

Information Diffusion on Twitter by @snikolov by Marti Hearst.

From the post:

Today Stan Nikolov, who just finished his masters at MIT in studying information diffusion networks, walked us through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure (from the popular Easley and Kleinberg Network book). Stan also gathered a huge amount of Twitter data, processed it using Pig scripts, and graphed the results using Gephi. The video lecture below shows you some great visualizations of the spreading behavior of the data!

(video omitted)

The slides in his Lecture Notes let you see the Pig scripts in more detail.

Another deeply awesome lecture from Marti’s class on Twitter and big data.

Also an example of the level of analysis that a Twitter stream will need to withstand to avoid “imperial entanglements.”

Twitter Results Recipe with Gephi Garnish

Tuesday, October 2nd, 2012

Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi by Tony Hirst.

From the post:

How can we get a quick snapshot of who’s talking to whom on Twitter in the context of a particular hashtag?

What follows is a detailed recipe with the answer to that question.

NodeGL: An online interactive viewer for NodeXL graphs uploaded to Google Spreadsheet

Friday, March 30th, 2012

NodeGL: An online interactive viewer for NodeXL graphs uploaded to Google Spreadsheet.

Martin Hawksey writes:

Recently Tony (Hirst) tipped me off about a new viewer for Gephi graphs. Developed by Raphaël Velt it uses JavaScript to parse Gephi .gefx files and output the result on a HTML5 canvas. The code for the viewer is on github available under a MIT license if you want to download and remash, I’ve also put an instance here if you want to play. Looking for a solution to render NodeXL data from a Google Spreadsheet in a similar way here is some background in the development of NodeGL – an online viewer of NodeXL graphs hosted on Google Spreadsheets

Introduction to NodeGL.

Getting Started With The Gephi…

Saturday, January 21st, 2012

Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I by Tony Hirst.

From the post:

A couple of weeks ago, I came across Gephi, a desktop application for visualising networks.

And quite by chance, a day or two after I was asked about any tools I knew of that could visualise and help analyse social network activity around an OU course… which I take as a reasonable justification for exploring exactly what Gephi can do 🙂

So, after a few false starts, here’s what I’ve learned so far…

First up, we need to get some graph data – netvizz – facebook to gephi suggests that the netvizz facebook app can be used to grab a copy of your Facebook network in a format that Gephi understands, so I installed the app, downloaded my network file, and then uninstalled the app… (can’t be too careful 😉

Once Gephi is launched (and updated, if it’s a new download – you’ll see an updates prompt in the status bar along the bottom of the Gephi window, right hand side) Open… the network file you downloaded.

If you like part 1 as an introduction to Gephi, be sure to take in:

Getting Started With Gephi Network Visualisation App – My Facebook Network, Part II: Basic Filters

which starts out:

In Getting Started With Gephi Network Visualisation App – My Facebook Network, Part I I described how to get up and running with the Gephi network visualisation tool using social graph data pulled out of my Facebook account. In this post, I’ll explore some of the tools that Gephi provides for exploring a network in a more structured way.

If you aren’t familiar with Gephi, and if you haven’t read Part I of this series, I suggest you do so now…

…done that…?

Okay, so where do we begin? As before, I’m going to start with a fresh worksheet, and load my Facebook network data, downloaded via the netvizz app, into Gephi, but as an undirected graph this time! So far, so exactly the same as last time. Just to give me some pointers over the graph, I’m going to set the node size to be proportional to the degree of each node (that is, the number of people each person is connected to).

You will find lots more to explore with Gephi but this should give you a good start.

Running along the graph using Neo4J Spatial and Gephi

Thursday, January 5th, 2012

Running along the graph using Neo4J Spatial and Gephi

Just to whet your appetite:

When I started running some years ago, I bought a Garmin Forerunner 405. It’s a nifty little device that tracks GPS coordinates while you are running. After a run, the device can be synchronized by uploading your data to the Garmin Connect website. Based upon the tracked time and GPS coordinates, the Garmin Connect website provides you with a detailed overview of your run, including distance, average pace, elevation loss/gain and lap splits. It also visualizes your run, by overlaying the tracked course on Bing and/or Google maps. Pretty cool! One of my last runs can be found here.

Apart from simple aggregations such as total distance and average speed, the Garmin Connect website provides little or no support to gain deeper insights in all of my runs. As I often run the same course, it would be interesting to calculate my average pace at specific locations. When combining the data of all of my courses, I could deduct frequently encountered locations. Finally, could there be a correlation between my average pace and my distance from home? In order to come up with answers to these questions, I will import my running data into a Neo4J Spatial datastore. Neo4J Spatial extends the Neo4J Graph Database with the necessary tools and utilities to store and query spatial data in your graph models. For visualizing my running data, I will make use of Gephi, an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs.

Suggestion: If you want to know where you go and/or how you spend your time, try tracking both for a week. Faithfully record how you spend your time, reading, commuting, TV, exercise, work, etc., in say 30 minute intervals. Also keep track of your physical location. Don’t try to be overly precise, use big buckets. And no peeking as to how the week is shaping up. I think you will be surprised at how your week shapes up.

Gephi: Graph Streaming API

Tuesday, November 22nd, 2011

Gephi: Graph Streaming API

Matt O’Donnell, @mdbod, wanted more information on the graph streaming API for Gephi, then tweets the URL you see above.

I have collaborated with Matt before. It is like working with a caffeinated fire hose. 😉

Seriously, Matt does extremely good work from biblical languages, linguistics, markup languages and now NLP and beyond.

Looking forward to him working on topic maps and related areas.

Gephi adds Neo4j graph database support

Monday, November 21st, 2011

Gephi adds Neo4j graph database support (screencast)

From the webpage:

Neo4j is a powerful, award-wining graph database written in Java. It can store billions of nodes and relationships and allows very fast query/traversal. We release today a new version of the Neo4j Plugin supporting the latest 1.5 version of Neo4j. In Gephi, go to Tools > Plugins to install the plug-in.

The plugin let you visualize a graph stored in a Neo4j database and play with it. Features include full import, traversal, filter, export and lazy loading.

Warning: A real time sink! 😉

Seriously, very cool plugin that will enhance your use of Neo4j!

Enjoy!

Visualizing RDF Schema inferencing through Neo4J, Tinkerpop, Sail and Gephi

Monday, November 21st, 2011

Visualizing RDF Schema inferencing through Neo4J, Tinkerpop, Sail and Gephi by Dave Suvee.

From the post:

Last week, the Neo4J plugin for Gephi was released. Gephi is an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs. The graphs themselves can be loaded through a variety of file formats. Thanks to Martin Škurla, it is now possible to load and lazily explore graphs that are stored in a Neo4J data store.

In one of my previous articles, I explained how Neo4J and the Tinkerpop framework can be used to load and query RDF triples. The newly released Neo4J plugin now allows to visually browse these RDF triples and perform some more fancy operations such as finding patterns and executing social network analysis algorithms from within Gephi itself. Tinkerpop’s Sail Ouplementation also supports the notion of RDF Schema inferencing. Inferencing is the process where new (RDF) data is automatically deducted from existing (RDF) data through reasoning. Unfortunately, the Sail reasoner cannot easily be integrated within Gephi, as the Gephi plugin grabs a lock on the Neo4J store and no RDF data can be added, except through the plugin itself.

Being able to visualize the RDF Schema reasoning process and graphically indicate which RDF triples were added manually and which RDF data was automatically inferred would be a nice to have. To implement this feature, we should be able to push graph changes from Tinkerpop and Neo4J to Gephi. Luckily, the Gephi graph streaming plugin allows us to do just that. In the rest of this article, I will detail how to setup the required Gephi environment and how we can stream (inferred) RDF data from Neo4J to Gephi.

Visual is good!

Visual display and exploration of graphs is better!

Visual display and exploration of Neo4j data stores from within Gephi is the best!

Dave concludes:

With just a few lines of code we are able to stream (inferred) RDF triples to Gephi and make use of its powerful visualization and analysis tools to explore and inspect our datasets. As always, the complete source code can be found on the Datablend public GitHub repository. Make sure to surf the internet to find some other nice Gephi streaming examples, the coolest one probably being the visualization of the Egyptian revolution on Twitter.

Other suggestions for Gephi streaming examples?

ForceAtlas2

Friday, October 21st, 2011

ForceAtlas2 (paper) +appendices by Mathieu Jacomy, Sebastien Heymann, Tommaso Venturini, and Mathieu Bastian.

Abstract:

ForceAtlas2 is a force vector algorithm proposed in the Gephi software, appreciated for its simplicity and for the readability of the networks it helps to visualize. This paper presents its distinctive features, its energy-model and the way it optimizes the “speed versus precision” approximation to allow quick convergence. We also claim that ForceAtlas2 is handy because the force vector principle is unaffected by optimizations, offering a smooth and accurate experience to users.

I knew I had to cite this paper when I read:

These earliest Gephi users were not fully satisfied with existing spatialization tools. We worked on empirical improvements and that’s how we created the first version of our own algorithm, ForceAtlas. Its particularity was a degree-dependant repulsion force that causes less visual cluttering. Since then we steadily added some features while trying to keep in touch with users’ needs. ForceAtlas2 is the result of this long process: a simple and straightforward algorithm, made to be useful for experts and profanes. (footnotes omitted, emphasis added)

Profanes. I like that! Well, rather I like the literacy that enables a writer to use that in a technical paper.

Highly recommended paper.

Introducing Gephi 0.7

Saturday, August 13th, 2011

Introducing Gephi 0.7

Yes, I know that Gephi 0.8 is out in alpha release but this video is worth viewing, even though it is about the “old” version.

From the description:

The video highlights the following features:

  • grouping: Group nodes into clusters and navigate in multi-level graphs.
  • multi-level layout: Very fast layout algorithm that coersen the graph to reduce computation.
  • interaction: Highlight neighbors and interact directly with the visualization when using tools.
  • partitionning: Use data attributes to colorize partitions and communities.
  • ranking: Use degree, metrics or data attributes to set nodes/edges’ color and size.
  • metrics: Run various algorithm in one click and get HTML report page.
  • data laboratory: Data table view with search feature.
  • dynamics: Use Timeline to explore dynamic graphs.
  • filtering: Dynamic queries, create and combine a large set of filters.
  • auto update: The application is updating itself it’s core and plugins.
  • vectorial preview: Switch to the preview tab to put the final touch before explorting in SVG or PDF.

Gephi News: new Visualization API

Friday, August 12th, 2011

Gephi News: new Visualization API

Work is underway on a new visualization API for Gephi. If you are interested in writing visualization of graph software, here’s your opportunity to make a difference.

A New Best Friend: Gephi for Large-scale Networks

Tuesday, August 9th, 2011

A New Best Friend: Gephi for Large-scale Networks

Though I never intended it, some posts of mine from a few years back dealing with 26 tools for large-scale graph visualization have been some of the most popular on this site. Indeed, my recommendation for Cytoscape for viewing large-scale graphs ranks within the top 5 posts all time on this site.

When that analysis was done in January 2008 my company was in the midst of needing to process the large UMBEL vocabulary, which now consists of 28,000 concepts. Like anything else, need drives research and demand, and after reviewing many graphing programs, we chose Cytoscape, then provided some ongoing guidelines in its use for semantic Web purposes. We have continued to use it productively in the intervening years.

Like for any tool, one reviews and picks the best at the time of need. Most recently, however, with growing customer usage of large ontologies and the development of our own structOntology editing and managing framework, we have begun to butt up against the limitations of large-scale graph and network analysis. With this post, we announce our new favorite tool for semantic Web network and graph analysis — Gephi — and explain its use and showcase a current example.

Times change and sometimes software choices do as well.

This is a case in point that reviews the current limitations of Cytoscape, the good points of Gephi, its needed improvements and pointers to more resources on Gephi. Can’t ask for much more.

Scientific graphs Generators plugin

Thursday, April 28th, 2011

Scientific graphs Generators plugin

A new plugin for Gephi, described as:

Cezary Bartosiak and Rafa? Kasprzyk just released the Complex Generators plugin, introducing many awaited scientific generators. These generators are extremely useful for scientists, as they help to simulate various real networks. They can test their models and algorithms on well-studied graph examples. For instance, the Watts-Strogatz generator creates networks as described by Duncan Watts in his Six Degrees book.

The plugin contains the following generators:

  • Balanced Tree
  • Barabasi Albert
  • Barabasi Albert Generalized
  • Barabasi Albert Simplified A
  • Barabasi Albert Simplified B
  • Erdos Renyi Gnm
  • Erdos Renyi Gnp
  • Kleinberg
  • Watts Strogatz Alpha
  • Watts Strogatz Beta