Archive for the ‘Networks’ Category

Visualizing your LinkedIn graph using Gephi (Parts 1 & 2)

Sunday, May 19th, 2013

Visualizing your LinkedIn graph using Gephi – Part 1

&

Visualizing your LinkedIn graph using Gephi – Part 2

by Thomas Cabrol.

From part 1:

Graph analysis becomes a key component of data science. A lot of things can be modeled as graphs, but social networks are really one of the most obvious examples.

In this post, I am going to show how one could visualize its own LinkedIn graph, using the LinkedIn API and Gephi, a very nice software for working on this type of data. If you don’t have it yet, just go to http://gephi.org/ and download it now !

My objective is to simply look at my connections (the “nodes” or “vertices” of the graph), see how they relate to each other (the “edges”) and find clusters of strongly connected users (“communities”). This is somewhat emulating what is available already in the InMaps data product, but, hey, this is cool to do it by ourselves, no ?

The first thing to do for running this graph analysis is to be able to query LinkedIn via its API. You really don’t want to get the data by hand… The API uses the oauth authentification protocol, which will let an application make queries on behalf of a user. So go to https://www.linkedin.com/secure/developer and register a new application. Fill the form as required, and in the OAuth part, use this redirect URL for instance:

Great introduction to Gephi!

As a bonus, reinforces the lesson that ETL isn’t required to re-use data.

ETL may be required in some cases but in a world of data APIs those are getting fewer and fewer.

Think of it this way: Non-ETL data access means someone else is paying for maintenance, backups, hardware, etc.

How much of your IT budget is supporting duplicated data?

Graph Representation – Edge List

Saturday, May 18th, 2013

Graph Representation – Edge List

From the post:

An Edge List is a form of representation for a graph. It maintains a list of all the edges in the graph. For each edge, it keeps track of the 2 connecting vertices as well as the weight between them.

Followed by C++ code as an example.

A hypergraph would require tracking of 3 or more connected nodes.

HeadStart for Planet Earth [Titan]

Tuesday, May 14th, 2013

Educating the Planet with Pearson by Marko A. Rodriguez.

From the post:

Pearson is striving to accomplish the ambitious goal of providing an education to anyone, anywhere on the planet. New data processing technologies and theories in education are moving much of the learning experience into the digital space — into massive open online courses (MOOCs). Two years ago Pearson contacted Aurelius about applying graph theory and network science to this burgeoning space. A prototype proved promising in that it added novel, automated intelligence to the online education experience. However, at the time, there did not exist scalable, open-source graph database technology in the market. It was then that Titan was forged in order to meet the requirement of representing all universities, students, their resources, courses, etc. within a single, unified graph. Moreover, beyond representation, the graph needed to be able to support sub-second, complex graph traversals (i.e. queries) while sustaining at least 1 billion transactions a day. Pearson asked Aurelius a simple question: “Can Titan be used to educate the planet?” This post is Aurelius’ answer.

Liking the graph approach in general and Titan in particular does not make me any more comfortable with some aspects of this posting.

You don’t need to spin up a very large Cassandra database on Amazon to see the problems.

Consider the number of concepts for educating the world, some 9,000 if the chart is to be credited.

Suggested Upper Merged Ontology (SUMO) has “~25,000 terms and ~80,000 axioms when all domain ontologies are combined.

The SUMO totals being before you get into the weeds of any particular subject, discipline or course material.

Or the subset of concepts and facts represented in DBpedia:

The English version of the DBpedia knowledge base currently describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including 764,000 persons, 573,000 places (including 387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), 192,000 organizations (including 45,000 companies and 42,000 educational institutions), 202,000 species and 5,500 diseases.

In addition, we provide localized versions of DBpedia in 111 languages. All these versions together describe 20.8 million things, out of which 10.5 million overlap (are interlinked) with concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 10.3 million unique things in up to 111 different languages; 8.0 million links to images and 24.4 million HTML links to external web pages; 27.2 million data links into external RDF data sets, 55.8 million links to Wikipedia categories, and 8.2 million YAGO categories. The dataset consists of 1.89 billion pieces of information (RDF triples) out of which 400 million were extracted from the English edition of Wikipedia, 1.46 billion were extracted from other language editions, and about 27 million are data links to external RDF data sets. The Datasets page provides more information about the overall structure of the dataset. Dataset Statistics provides detailed statistics about 22 of the 111 localized versions.

I don’t know if the 9,000 concepts cited in the post would be sufficient for a world wide HeadStart program in multiple languages.

Moreover, why would any sane person want a single unified graph to represent course delivery from Zaire to the United States?

How is a single unified graph going to deal with the diversity of educational institutions around the world? A diversity that I take as a good thing.

It sounds like Pearson is offering a unified view of education.

My suggestion is to consider the value of your own diversity before passing on that offer.

Motif Simplification…[Simplifying Graphs]

Monday, May 13th, 2013

Motif Simplification: Improving Network Visualization Readability with Fan, Connector, and Clique Glyphs by Cody Dunne and Ben Shneiderman.

Abstract:

Analyzing networks involves understanding the complex relationships between entities, as well as any attributes they may have. The widely used node-link diagrams excel at this task, but many are difficult to extract meaning from because of the inherent complexity of the relationships and limited screen space. To help address this problem we introduce a technique called motif simplification, in which common patterns of nodes and links are replaced with compact and meaningful glyphs. Well-designed glyphs have several benefits: they (1) require less screen space and layout effort, (2) are easier to understand in the context of the network, (3) can reveal otherwise hidden relationships, and (4) preserve as much underlying information as possible. We tackle three frequently occurring and high-payoff motifs: fans of nodes with a single neighbor, connectors that link a set of anchor nodes, and cliques of completely connected nodes. We contribute design guidelines for motif glyphs; example glyphs for the fan, connector, and clique motifs; algorithms for detecting these motifs; a free and open source reference implementation; and results from a controlled study of 36 participants that demonstrates the effectiveness of motif simplification.

When I read “replace,” “aggregation,” etc., I automatically think about merging in topic maps. ;-)

After replacing “common patterns of nodes and links” I may still be interested in the original content of those nodes and links.

Or I may wish to partially unpack them based on some property in the original content.

Definitely a paper for a slow, deep read.

Not to mention research on the motifs in graph representations of your topic maps.

I first saw this in Visualization Papers at CHI 2013 by Enrico Bertini.

Guess: The Graph Exploration System

Sunday, May 12th, 2013

Guess: The Graph Exploration System

From the webpage:

GUESS is an exploratory data analysis and visualization tool for graphs and networks. The system contains a domain-specific embedded language called Gython (an extension of Python, or more specifically Jython) which supports the operators and syntactic sugar necessary for working on graph structures in an intuitive manner. An interactive interpreter binds the text that you type in the interpreter to the objects being visualized for more useful integration. GUESS also offers a visualization front end that supports the export of static images and dynamic movies.

Graph movies? Cool!

If you could catch a graph in an unguarded moment, what would you want to capture in a movie?

See also: Sourceforge – Guess.

Bond Percolation in GraphLab

Sunday, May 12th, 2013

Bond Percolation in GraphLab by Danny Bickson.

From the post:

I was asked by Prof. Scott Kirkpatrick to help and implement bond percolation in GraphLab. It is an oldie but goldie problem which is closely related to the connected components problem.

Here is an explanation about bond percolation from Wikipedia:

A representative question (and the source of the name) is as follows. Assume that some liquid is poured on top of some porous material. Will the liquid be able to make its way from hole to hole and reach the bottom? This physical question is modelled mathematically as a three-dimensional network of n × n × n vertices, usually called “sites”, in which the edge or “bonds” between each two neighbors may be open (allowing the liquid through) with probability p, or closed with probability 1 – p, and they are assumed to be independent. Therefore, for a given p, what is the probability that an open path exists from the top to the bottom? The behavior for large n is of primary interest. This problem, called now bond percolation, was introduced in the mathematics literature by Broadbent & Hammersley (1957), and has been studied intensively by mathematicians and physicists since.

Perculation Graph

In social networks, Danny notes this algorithm is used to find groups of friends.

Similar mazes appear in puzzle books.

My curiosity is about finding groups of subject identity properties.

A couple of other percolation resources of interest:

Percolation Exercises by Eric Mueller.

PercoVis (Mac), visualization of percolation by Daniel B. Larremore.

Medusa: Simplified Graph Processing on GPUs

Saturday, May 11th, 2013

Medusa: Simplified Graph Processing on GPUs by Jianlong Zhong, Bingsheng He.

Abstract:

Graphs are the de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the graphics processing unit (GPU) has been adopted to accelerate various graph processing algorithms such as BFS and shortest path. However, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. To simplify graph processing on GPUs, we propose a programming framework called Medusa which enables developers to leverage the capabilities of GPUs by writing sequential C/C++ code. Medusa offers a small set of user-defined APIs, and embraces a runtime system to automatically execute those APIs in parallel on the GPU. We develop a series of graph-centric optimizations based on the architecture features of GPU for efficiency. Additionally, Medusa is extended to execute on multiple GPUs within a machine. Our experiments show that (1) Medusa greatly simplifies implementation of GPGPU programs for graph processing, with much fewer lines of source code written by developers; (2) The optimization techniques significantly improve the performance of the runtime system, making its performance comparable with or better than the manually tuned GPU graph operations.

Just in case you are interested in high performance graph processing. ;-)

DELTACON: A Principled Massive-Graph Similarity Function

Saturday, April 20th, 2013

DELTACON: A Principled Massive-Graph Similarity Function by Danai Koutra, Joshua T. Vogelstein, Christos Faloutsos.

Abstract:

How much did a network change since yesterday? How different is the wiring between Bob’s brain (a left-handed male) and Alice’s brain (a right-handed female)? Graph similarity with known node correspondence, i.e. the detection of changes in the connectivity of graphs, arises in numerous settings. In this work, we formally state the axioms and desired properties of the graph similarity functions, and evaluate when state-of-the-art methods fail to detect crucial connectivity changes in graphs. We propose DeltaCon, a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes (e.g. employees of a company, customers of a mobile carrier). Experiments on various synthetic and real graphs showcase the advantages of our method over existing similarity measures. Finally, we employ DeltaCon to real applications: (a) we classify people to groups of high and low creativity based on their brain connectivity graphs, and (b) do temporal anomaly detection in the who-emails-whom Enron graph.

How different is your current topic map from a prior version?

Could be an interesting marketing ploy to colorize the distinct portions of the graph.

Not to mention using “similarity” to mean the same subject for some purposes. Group subjects come to mind.

And for other types of analysis.

Fast Collaborative Graph Exploration

Thursday, April 18th, 2013

Fast Collaborative Graph Exploration by Dariusz Dereniowski, Yann Disser, Adrian Kosowski, Dominik Pajak, Przemyslaw Uznanski.

Abstract:

We study the following scenario of online graph exploration. A team of $k$ agents is initially located at a distinguished vertex $r$ of an undirected graph. At every time step, each agent can traverse an edge of the graph. All vertices have unique identifiers, and upon entering a vertex, an agent obtains the list of identifiers of all its neighbors. We ask how many time steps are required to complete exploration, i.e., to make sure that every vertex has been visited by some agent. We consider two communication models: one in which all agents have global knowledge of the state of the exploration, and one in which agents may only exchange information when simultaneously located at the same vertex. As our main result, we provide the first strategy which performs exploration of a graph with $n$ vertices at a distance of at most $D$ from $r$ in time $O(D)$, using a team of agents of polynomial size $k = D n^{1+ \epsilon} < n^{2+\epsilon}$, for any $\epsilon > 0$. Our strategy works in the local communication model, without knowledge of global parameters such as $n$ or $D$. We also obtain almost-tight bounds on the asymptotic relation between exploration time and team size, for large $k$. For any constant $c>1$, we show that in the global communication model, a team of $k = D n^c$ agents can always complete exploration in $D(1+ \frac{1}{c-1} +o(1))$ time steps, whereas at least $D(1+ \frac{1}{c} -o(1))$ steps are sometimes required. In the local communication model, $D(1+ \frac{2}{c-1} +o(1))$ steps always suffice to complete exploration, and at least $D(1+ \frac{2}{c} -o(1))$ steps are sometimes required. This shows a clear separation between the global and local communication models.

Heavy going but seems important for graph exploration performance.

See also the special case of exploring trees under related work.

Another possibility for exploring overlapping markup. Each agent has an independent view of one part of the markup trees.

A New Perspective on Vertex Connectivity

Wednesday, April 17th, 2013

A New Perspective on Vertex Connectivity by Keren Censor-Hillel, Mohsen Ghaffari, Fabian Kuhn.

Abstract:

Edge connectivity and vertex connectivity are two fundamental concepts in graph theory. Although by now there is a good understanding of the structure of graphs based on their edge connectivity, our knowledge in the case of vertex connectivity is much more limited. An essential tool in capturing edge connectivity are edge-disjoint spanning trees. The famous results of Tutte and Nash-Williams show that a graph with edge connectivity $\lambda$ contains $\floor{\lambda/2}$ edge-disjoint spanning trees. We present connected dominating set (CDS) partition and packing as tools that are analogous to edge-disjoint spanning trees and that help us to better grasp the structure of graphs based on their vertex connectivity. The objective of the CDS partition problem is to partition the nodes of a graph into as many connected dominating sets as possible. The CDS packing problem is the corresponding fractional relaxation, where CDSs are allowed to overlap as long as this is compensated by assigning appropriate weights. CDS partition and CDS packing can be viewed as the counterparts of the well-studied edge-disjoint spanning trees, focusing on vertex disjointedness rather than edge disjointness.

We constructively show that every $k$-vertex-connected graph with $n$ nodes has a CDS packing of size $\Omega(k/\log n)$ and a CDS partition of size $\Omega(k/\log^5 n)$. We prove that the $\Omega(k/\log n)$ CDS packing bound is existentially optimal.

Using CDS packing, we show that if vertices of a $k$-vertex-connected graph are independently sampled with probability $p$, then the graph induced by the sampled vertices has vertex connectivity $\tilde{\Omega}(kp^2)$. Moreover, using our $\Omega(k/\log n)$ CDS packing, we get a store-and-forward broadcast algorithm with optimal throughput in the networking model where in each round, each node can send one bounded-size message to all its neighbors.

Just in case you are interested in cutting edge (sorry) graph research.

Users can assure each other they are using the most popular graph software or they can be using the most powerful graph software.

I know which one I would choose.

How about you?

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Sunday, April 7th, 2013

Deploying Graph Algorithms on GPUs: an Adaptive Solution by Da Li and Michela Becchi. (27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013)

From the post:

Thanks to their massive computational power and their SIMT computational model, Graphics Processing Units (GPUs) have been successfully used to accelerate a wide variety of regular applications (linear algebra, stencil computations, image processing and bioinformatics algorithms, among others). However, many established and emerging problems are based on irregular data structures, such as graphs. Examples can be drawn from different application domains: networking, social networking, machine learning, electrical circuit modeling, discrete event simulation, compilers, and computational sciences. It has been shown that irregular applications based on large graphs do exhibit runtime parallelism; moreover, the amount of available parallelism tends to increase with the size of the datasets. In this work, we explore an implementation space for deploying a variety of graph algorithms on GPUs. We show that the dynamic nature of the parallelism that can be extracted from graph algorithms makes it impossible to find an optimal solution. We propose a runtime system able to dynamically transition between different implementations with minimal overhead, and investigate heuristic decisions applicable across algorithms and datasets. Our evaluation is performed on two graph algorithms: breadth-first search and single-source shortest paths. We believe that our proposed mechanisms can be extended and applied to other graph algorithms that exhibit similar computational patterns.

A development that may surprise some graph software vendors, there are “no optimal solution[s] across graph problems and datasets” for graph algorithms on GPU.

This paper points towards an adaptive technique that may prove to be “resilient to the irregularity and heterogeneity of real world graphs.”

I first saw this in a tweet by Stefano Bertolo.

2nd GraphLab workshop [Early Bird Registration]

Wednesday, April 3rd, 2013

The 2nd GraphLab workshop is coming up! by Danny Bickson.

Danny also says there is a 30% discount if you email him: danny.bickson@gmail.com. Don’t know when that runs out but worth a try.

From the post:

Following the great success of the first GraphLab workshop, we have started to organize this year event, in July at the bay area. To remind you, last year we wanted to organize a 15-20 people event, which eventually got a participation of 300+ researchers from 100+ companies.

The main aim of this year workshop is to bring together top researchers from academia, as well as top data scientists from industry with the special focus of large scale machine learning on sparse graphs.

The event will take place Monday July 1st, 2013 in San Francisco. Early bird registration is now open!

Preliminary agenda.

Definitely one to have on your calendar!

Embedding Pubmed, Graphviz and a remote image in #LaTeX

Sunday, March 31st, 2013

Embedding Pubmed, Graphviz and a remote image in #LaTeX by Pierre Lindenbaum.

Pierre demonstrates how to use:

\newcommand{name}[num]{definition}

to load a remote picture, a graphviz result and retrieving a PubMed record for embedding in a LaTeX document.

From the LaTeX Macro page Pierre cites:

The num argument in square brackets is optional and specifies the number of arguments the new command takes (up to 9 are possible). If missing it defaults to 0, i.e. no argument allowed.

I caught myself wondering about that argument.

The graphviz command looks particularly interesting for topic map illustrations.

Permission Resolution with Neo4j – Part 1

Saturday, March 30th, 2013

Permission Resolution with Neo4j – Part 1 by Max De Marzi.

From the post:

People produce a lot of content. Messages, text files, spreadsheets, presentations, reports, financials, etc, the list goes on. Usually organizations want to have a repository of all this content centralized somewhere (just in case a laptop breaks, gets lost or stolen for example). This leads to some kind of grouping and permission structure. You don’t want employees seeing each other’s HR records, unless they work for HR, same for Payroll, or unreleased quarterly numbers, etc. As this data grows it no longer becomes easy to simply navigate and a search engine is required to make sense of it all.

But what if your search engine returns 1000 results for a query and the user doing the search is supposed to only have access to see 4 things? How do you handle this? Check the user permissions on each file realtime? Slow. Pre-calculate all document permissions for a user on login? Slow and what if new documents are created or permissions change between logins? Does the system scale at 1M documents, 10M documents, 100M documents?

Search is one example of a need to restrict viewing results but browsing raises the same issues. Or display of information along side other information.

As I recall, Netware 4.1 (other versions as well no doubt) had the capability for a sysadmin to create sub-sysadmins, say for accounting or HR, that could hide information from the sysadmin. That was prior to search being commonly available.

What other security for search result schemes are out there?

Titan 0.3.0 Released

Friday, March 29th, 2013

Titan 0.3.0 Released

From the webpage:

Titan 0.3.0 has been released and is ready for download. This release provides a complete performance-driven redesign of many core components. Furthermore, the primary outward facing feature is advanced indexing. The new indexing features are itemized below:

  • Geo: Search for elements using shape primitives within a 2D plane.
  • Full-text: Search elements for matching string and text properties.
  • Numeric range: Search for elements with numeric property values using intervals.
  • Edge: Edges can be indexed as well as vertices.

The Titan tutorial demonstrates the new capabilities.

This should keep you busy over the weekend!

MapEquation.org

Tuesday, March 26th, 2013

MapEquation.org by Daniel Edler and Martin Rosvall.

From the “about” page:

What do we do?

We develop mathematics, algorithms and software to simplify and highlight important structures in complex systems.

What are our goals?

To navigate and understand big data like we navigate and understand the real world by maps.

Suggest you start with the Apps.

Very impressive and has data available for loading.

You can also upload your own data.

Spend some time with Code and Publications as well.

I first saw this in a tweet by Chris@SocialTexture.

Under the Hood: Building out the infrastructure for Graph Search

Monday, March 25th, 2013

Under the Hood: Building out the infrastructure for Graph Search by Sriram Sankar, Soren Lassen, and Mike Curtiss.

From the post:

In the early days, Facebook was as much about meeting new people as keeping in touch with people you already knew at your college. Over time, Facebook became more about maintaining connections. Graph Search takes us back to our roots and helps people make new connections–this time with people, places, and interests.

With this history comes several old search systems that we had to unify in order to build Graph Search. At first, the old search on Facebook (called PPS) was keyword based–the searcher entered keywords and the search engine produced a results page that was personalized and could be filtered to focus on specific kinds of entities such as people, pages, places, groups, etc.

Entertaining overview of the development of the graph solution for Facebook.

Moreover, reassurance if you are worried about “scaling” for your graph application. ;-)

I first saw this at: This Week’s Links by Trevor Landau.

Graph Processing DevRoom 2013 edition

Saturday, March 23rd, 2013

Graph Processing DevRoom 2013 edition

Twelve slide decks from the Graph Processing workshop within FOSDEM.

Enjoy!

BitcoinVisualizer

Tuesday, March 19th, 2013

BitcoinVisualizer by John Russell.

From the webpage:

Block Viewer visualizes the Bitcoin block chain by building an ownership network on top of the underlying transaction network and presents a web-enabled user interface to display the visualization results.

Great mapping exercise!

Imagine what could be done tracking all banking transfers.

Before you object that banking transfer monitoring would require a search warrant, remember that Richard Nixon could not be prosecuted for treason because the evidence was the result of an illegal wiretap.

Take this mapping as a reminder to use cash whenever possible.

Demo: http://www.blockviewer.com/#30203900

I first saw this in a tweet by Max De Marzi.

Permission Resolution With Neo4j – Part 1

Monday, March 18th, 2013

Permission Resolution With Neo4j – Part 1 by Max De Marzi.

From the post:

People produce a lot of content. Messages, text files, spreadsheets, presentations, reports, financials, etc, the list goes on. Usually organizations want to have a repository of all this content centralized somewhere (just in case a laptop breaks, gets lost or stolen for example). This leads to some kind of grouping and permission structure. You don’t want employees seeing each other’s HR records, unless they work for HR, same for Payroll, or unreleased quarterly numbers, etc. As this data grows it no longer becomes easy to simply navigate and a search engine is required to make sense of it all.

But what if your search engine returns 1000 results for a query and the user doing the search is supposed to only have access to see 4 things? How do you handle this? Check the user permissions on each file realtime? Slow. Pre-calculate all document permissions for a user on login? Slow and what if new documents are created or permissions change between logins? Does the system scale at 1M documents, 10M documents, 100M documents?

Max addresses the scaling issue by checking only the results from a search. So to that extent, the side of the document store becomes irrelevant.

At least if you have a smallish number of results from the search.

I haven’t seen part 2 but another scale tactic would be to limit access to indexes by permissions. Segregating human resources, accounting, etc.

Looking forward to where Max takes this one.

Linkurious [free beta]

Sunday, March 17th, 2013

Linkurious

From the homepage:

CONNECT

Our Open Source backend indexes your graph so you can connect with it on Linkurious and get started in minutes. When it is done, launch the web application of Linkurious.

SEARCH

Typing any keyword in the search bar brings up all the related data in one step. We provide a console for advanced queries so you can be as broad or as specific as you want.

EXPLORE

By focusing on the items related to your search, visualizing and exploring your graph has never been easier. Dig further in any direction using the connected nodes and make sense of your data.

A couple of other resources:

How it works, and

Graph Visualization options and latest developments

will be of interest.

I haven’t signed up, yet, but the slides make a good point that what graph visualization you need depends, unsurprisingly, on your use case.

I first saw this in a tweet by David W. Allen.

JSNetworkX

Wednesday, March 13th, 2013

JSNetworkX

A port of the NetworkX graph library to JavaScript

SNetworkX is a port of the popular Python graph library NetworkX. Lets describe it with their words:

NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and function of complex networks.

With NetworkX you can load and store networks in standard and nonstandard data formats, generate many types of random and classic networks, analyze network structure, build network models, design new network algorithms, draw networks, and much more.

Github.

Wiki.

Looks like an easy way to include graph representations of topic maps in a web page.

I suspect you will be seeing more of this in the not too distant future.

I first saw this in a tweet by Christophe Viau.

Inferring Social Rank in…

Wednesday, March 13th, 2013

Inferring Social Rank in an Old Assyrian Trade Network by David Bamman, Adam Anderson, Noah A. Smith.

Abstract:

We present work in jointly inferring the unique individuals as well as their social rank within a collection of letters from an Old Assyrian trade colony in K\”ultepe, Turkey, settled by merchants from the ancient city of Assur for approximately 200 years between 1950-1750 BCE, the height of the Middle Bronze Age. Using a probabilistic latent-variable model, we leverage pairwise social differences between names in cuneiform tablets to infer a single underlying social order that best explains the data we observe. Evaluating our output with published judgments by domain experts suggests that our method may be used for building informed hypotheses that are driven by data, and that may offer promising avenues for directed research by Assyriologists.

An example of how digitization of ancient texts enables research other than text searching.

Inferring identity and social rank may be instructive for creation of topic maps from both ancient and modern data sources.

I first saw this in a tweet by Stefano Bertolo.

GraphLab: A Distributed Abstraction…

Monday, March 11th, 2013

GraphLab: A Distributed Abstraction for Machine Learning in the Cloud by Carlos Guestrin. (video)

Take away line: “How does a vertex think?”

Deeply impressive presentation and performance numbers!

Resources:

GraphLab 2.1: http://graphlab.org

GraphChi 0.1: http://graphchi.org

Slides from the talk.

This needs to be very high on your graph reading/following list.

I first saw this at: SFBayACM talk: GraphLab framework for Machine Learning in the Cloud.

NetflixGraph

Saturday, March 9th, 2013

NetflixGraph: Compact in-memory representation of directed graph data by Drew Koszewnik.

From the post:

Your memory footprint just shrank

NetflixGraph is a compact in-memory data structure used to represent directed graph data. You can use NetflixGraph to vastly reduce the size of your application’s memory footprint, potentially by an order of magnitude or more. If your application is I/O bound, you may be able to remove that bottleneck by holding your entire dataset in RAM. You’ll likely be very surprised by how little memory is actually required to represent your data.

NetflixGraph provides an API to translate your data into a graph format, compress that data in memory, then serialize the compressed in-memory representation of the data so that it may be easily transported across your infrastructure.

Definitely a high priority for the coming weekend!

Graph Partitioning and Expanders (April 2013)

Saturday, March 9th, 2013

Graph Partitioning and Expanders by Professor Luca Trevisan.

From the description:

In this research-oriented graduate course, we will study algorithms for graph partitioning and clustering, constructions of expander graphs, and analysis of random walks. These are three topics that build on the same mathematical background and that have several important connections: for example it is possible to find graph clusters via random walks, and it is possible to use the linear programming approach to graph partitioning as a way to study random walks.

We will study spectral graph theory, which explains how certain combinatorial properties of graphs are related to the eigenvalues and eigenvectors of the adjacency matrix, and we will use it describe and analyze spectral algorithms for graph partitioning and clustering. Spectral graph theory will recur as an important tool in the rest of the course. We we will also discuss other approaches to graph partitioning via linear programming and semidefinite programming. Then we will study constructions of expander graphs, which are graphs with very strong pseudorandomness properties, which are useful in many applications, including in cryptography, in complexity theory, in algorithms and data structures, and in coding theory. Finally, we will study the mixing time of random walks, a problem that comes up in several applications, including the analysis of the convergence time of certain randomized algorithms, such as the Metropolis algorithm.

Workload

about 8 hours per week

Prerequisites

linear algebra, discrete probability, and algorithms

The Instructor

Luca Trevisan is a professor of computer science at Stanford University. Before joining Stanford in 2010, Luca taught at Columbia University and at the University of California, Berkeley.

Luca’s research is in theoretical computer science, and he has worked on average-case complexity theory, pseudorandomness and derandomization, hardness of approximation, probabilistically checkable proofs, and approximation algorithms. In the past three years he has been working on spectral graph theory and its applications to graph algorithmns.

Luca received the STOC’97 Danny Lewin award, the 2000 Oberwolfach Prize, and the 2000 Sloan Fellowship. He was an invited speaker at the 2006 International Congress of Mathematicians in Madrid.

Not for the faint of heart!

But on the other hand, if you want to be on the cutting edge of graph development….

neo4j: Make properties relationships [Associations As First Class Citizens?]

Friday, March 8th, 2013

neo4j: Make properties relationships by Mark Needham.

From the post:

I spent some of the weekend working my way through Jim, Ian & Emil‘s book ‘Graph Databases‘ and one of the things that they emphasise is that graphs allow us to make relationships first class citizens in our model.

Looking back on a couple of the graphs that I modelled last year I realise that I didn’t quite get this and although the graphs I modelled had some relationships a lot of the time I was defining things as properties on nodes.

While it’s fine to do this I think we lose some of the power of a graph and it’s not necessarily obvious what we’ve lost until we model a property as a relationship and see what possibilities open up.

For example in my football graph I wanted to record the date of matches and initially stored this as a property on the match before realising that modelling it as a relationship which might open up some interesting queries.

Reading Mark’s post illustrates the power of using associations to model “properties” in topic maps.

In Neo4j, relationships are first class citizens.

Unfortunately, we can’t say the same for associations in topic maps.

You may recall that associations in a topic map are restricted in the information they can carry.

If you want to add a name to an association, for example, you have to reify the association with a topic. Which means you have the association and a topic for the association, representing the same subject.

Not to mention a lot of machinery overhead for something fairly simple.

I am aware that the TMDM and XTM were fashioned to follow the original version of ISO 13250. The origin of reification in topic maps.

However, simply because all buggies had whips at one point is no reason to design cars with whip holders.

The time has come to revisit reification and in my view, revise both the TMDM and XTM to remove it.

And to make associations and occurrences first class citizens in both the TMDM and XTM.

Comments/suggestions?

Social Graphs and Applied NoSQL Solutions [Merging Graphs?]

Wednesday, March 6th, 2013

Social Graphs and Applied NoSQL Solutions by John L. Myers.

From the post:

Recent postings have been more about the “theory” behind the wonderful world of NoSQL and less about how to implement a solution with a NoSQL platform. Well it’s about time that I changed that. This posting will be about how the graph structure and graph databases in particular can be an excellent “applied solution” of NoSQL technologies.

When Facebook released its Graph Search, the general public finally got a look at what the “backend” of Facebook looked like or its possible uses … For many the consumer to consumer (c2c) version of Facebook’s long available business-to-business and business-to-consumer offerings was a bit more of the “creepy” vs. the “cool” of the social media content. However, I think it will have the impact of opening people’s eyes on how their content can and probably should be used for search and other analytical purposes.

With graph structures, unlike tabular structures such as row and column data schemas, you look at the relationships between the nodes (i.e. customers, products, locations, etc.) as opposed to looking at the attributes of a particular object. For someone like me, who has long advocated that we should look at how people, places and things interact with each other versus how their “demographics” (i.e. size, shape, income, etc.) make us “guess” how they interact with each other. In my opinion, demographics and now firmographics have been used as “substitutes” for how people and organizations behave. While this can be effective in the aggregate, as we move toward a “bucket of one” treatment model for customers or clients, for instance, we need to move away from using demographics/firmographics as a primary analysis tool.

Let’s say that graph databases become as popular as SQL databases. You can’t scratch an enterprise without finding a graph database.

And they are all as different from each other as the typical SQL database is today.

How do you go about merging graph databases?

Can you merge a graph database and retain the characteristics of the graph databases separately?

If graph databases become as popular as they should, those are going to be real questions in the not too distant future.

Rebuilding Gephi’s core for the 0.9 version

Tuesday, March 5th, 2013

Rebuilding Gephi’s core for the 0.9 version by Mathieu Bastian.

From the post:

This is the first article about the future Gephi 0.9 version. Our objective is to prepare the ground for a future 1.0 release and focus on solving some of the most difficult problems. It all starts with the core of Gephi and we’re giving today a preview of the upcoming changes in that area. In fact, we’re rewriting the core modules from scratch to improve performance, stability and add new features. The core modules represent and store the graph and attributes in memory so it’s available to the rest of the application. Rewriting Gephi’s core is like replacing the engine of a truck and involves adapting a lot of interconnected pieces. Gephi’s current graph structure engine was designed in 2009 and didn’t change much in multiple releases. Although it’s working, it doesn’t have the level of quality we want for Gephi 1.0 and needs to be overhauled. The aim is to complete the new implementation and integrate it in the 0.9 version.

Deeply interesting work!

To follow, consider subscribing to: gephi-dev — List for core developers.

“Do Bees” / “Don’t Bees” and Neo4j

Tuesday, March 5th, 2013

According to Michael Hunger in a Neo4j Google Groups message, the Neo4j team is drowning in its own success!

Now there’s a problem to have!

“Do Bees” for Neo4j will:

…ask questions on Stack Overflow that related to:

Please tag your questions with “neo4j” and “cypher”, “gremlin” or “spring data neo4j” accordingly. See the current list:

http://stackoverflow.com/questions/tagged/neo4j

Currently questions on SO are answered quickly by a group of very active people which we hope you will join. We try to chime in as often as possible (especially with unanswered questions).

So PLEASE post your questions there on Stack Overflow, we will start asking individuals to move their questions to that platform and if they don’t manage it, move them ourselves.

We will also monitor this badge: http://stackoverflow.com/badges/1785/neo4j and award cool stuff for people that make it there.

This google group shall return to its initial goals of having broader discussions about graph topics, modeling, architectures, roadmap, announcements, cypher evolution, open source etc. So we would love everyone who has questions or problems in these areas to reach out and start a conversation.

Hope for your understanding to make more breathing room in this group and more interesting discussions in the future while keeping an interactive FAQ around Neo4j going on SO with quick feedback loops and turnaround times.

The Neo4j community will be healthier if we are all “Do Bees” so I won’t cover the alternative.

If you don’t know “Do Bees” / “Don’t Bees,” see: Romper Room.

See you at Stackoverflow!