Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 12, 2012

Creating a Neo4j graph of Wikipedia links

Filed under: Graphipedia,Graphs,Neo4j — Patrick Durusau @ 5:10 pm

Creating a Neo4j graph of Wikipedia links by Mirko Nasato.

From the post:

I started looking at Neo4j and thought: I need to write a simple but non-trivial application to really try it out. Something with lots of nodes and relationships. I need to find a large dataset that I can import into a graph database.

To my delight, I found that Wikipedia provides database dumps for download. That serves my purpose beautifully: I can represent each wiki page as a node, and the links between pages as relationships.

So I wrote some code to parse the Wikipedia XML dump and extract links from each article body, and some other code to import everything into a Neo4j store. It’s now on github, as project Graphipedia.

I ended up with a graph database containing 9,006,704 nodes (pages, titles only) and 82,537,500 relationships (links). The whole database takes up 3.8G on disc, of which 650M is a Lucene index (on page titles).

With this wealth of well-connected data at my disposal, I can now do some interesting stuff. For one, I can simply open the database with the Neoclipse tool, find a page by title and visualise all links to/from that page. Here’s an example with the Neo4j page at its centre.

This is simply awesome!

What experiments would you suggest?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress