Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 12, 2012

Creating a Neo4j graph of Wikipedia links

Filed under: Graphipedia,Graphs,Neo4j — Patrick Durusau @ 5:10 pm

Creating a Neo4j graph of Wikipedia links by Mirko Nasato.

From the post:

I started looking at Neo4j and thought: I need to write a simple but non-trivial application to really try it out. Something with lots of nodes and relationships. I need to find a large dataset that I can import into a graph database.

To my delight, I found that Wikipedia provides database dumps for download. That serves my purpose beautifully: I can represent each wiki page as a node, and the links between pages as relationships.

So I wrote some code to parse the Wikipedia XML dump and extract links from each article body, and some other code to import everything into a Neo4j store. It’s now on github, as project Graphipedia.

I ended up with a graph database containing 9,006,704 nodes (pages, titles only) and 82,537,500 relationships (links). The whole database takes up 3.8G on disc, of which 650M is a Lucene index (on page titles).

With this wealth of well-connected data at my disposal, I can now do some interesting stuff. For one, I can simply open the database with the Neoclipse tool, find a page by title and visualise all links to/from that page. Here’s an example with the Neo4j page at its centre.

This is simply awesome!

What experiments would you suggest?

Powered by WordPress