Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 8, 2014

Data Loading Neo4j Graphs From SQL Sources [+ 98 others]

Filed under: Mule,Neo4j — Patrick Durusau @ 3:48 pm

Data Loading Neo4j Graphs From SQL Sources by Richard Donovan.

From the post:

Neo4j’s powerful graph database can be used for analytics, recommendation engines, social graphs and many more applications.

In the following example we demonstrate in a few steps how you can load Neo4j from your legacy relations sql source.

You can download Mule Studio from; http://www.mulesoft.org/download-mule-esb-community-edition

A short post on using Mule to load SQL data into Neo4j.

More importantly Mule has ninety-nine (98) connectors (99 including Neo4j), opening up a world of data sources for Neo4j.

See the Mule documentation for details.

January 4, 2014

Generate Cypher Queries with R

Filed under: Cypher,Graphs,Neo4j,R — Patrick Durusau @ 5:00 pm

Generate Cypher Queries with R by Nicole White.

From the post:

Lately I have been using R to generate Cypher queries and dump them line-by-line to a text file. I accomplish this through the sink(), cat(), and paste() functions. The sink function makes it so any console output is sent to the given text file; the cat function prints things to the console; and the paste function concatenates strings.

For my movie recommendations graph gist, I generated my Cypher queries by looping through the CSV file containing the movie ratings and concatenating strings as appropriate. The CSV was first loaded into a data frame called data, of which a snippet is shown below:

You do remember that the Neo4j GraphGist December Challenge ends January 31st, 2014? Yes?

Auto-generation will help avoid careless key stroke errors.

And serve to scale up from gist size to something more challenging.

December 19, 2013

Fascinating food networks, in neo4j

Filed under: Food,Graphs,Neo4j — Patrick Durusau @ 7:38 pm

Fascinating food networks, in neo4j by Rik Van Bruggen.

From the post:

When you’re passionate about graphs like I am, you start to see them everywhere. And as we are getting closer to the food-heavy season of the year, it’s perhaps no coincidence that this graph I will be introducing in this blogpost – is about food.

A couple of weeks ago, when I woke up early (!) Sunday morning to get “pistolets” and croissants for my family from our local bakery, I immediately took notice when I saw a graph behind the bakery counter. It was a “foodpairing” graph, sponsored by the people of Puratos – a wholesale provider of bakery products, grains, etc. So I get home and start googling, and before you know it I find some terribly interesting research by Yong-Yeol (YY) Ahn, featured in a Wired article, and in Scientific American, and in Nature. This researcher had done some fascinating work in understanding al 57k recipes from Epicurious, Allrecipes and Menupan, their composing ingredients and ingredient categories, their origin and – perhaps most fascinating of all – their chemical compounds.

Rik walks you through acquiring some of these datasets, cleaning them up and then loading the datasets into Neo4j.

My only suggestion is that before you start browsing the dataset that you have cookies and milk within easy reach. 😉

indexing in Neo4j – an overview

Filed under: Graphs,Indexing,Neo4j — Patrick Durusau @ 7:13 pm

indexing in Neo4j – an overview by Stefan Armbruster.

From the post:

Neo4j as a graph database features indexing as the preferred way to find start points for graph traversals. Over the years multiple different indexing approach have been added. The goal of this article is to give an overview on this to avoid confusion esp. for those who just recently got started with Neo4j.

A graph database using a property graph model stores its data in nodes, relationships and properties. In Neo4j 2.0 this model was amended with labels.

A very nice summary of the indexing mechanisms in Neo4j.

After all, if you write something down and then can’t find it, what good is it?

Enjoy!

December 16, 2013

Neo4j – Labels and Regression

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 3:43 pm

Yes, I am using labels in Neo4j but only because I am the only user of this data set. If I paint myself into a semantic corner, it will be my fault and not poor design.

In any event, I ran into an odd limitation on labels that may be of general interest.

My script was dying because my label read: “expert-Validation.”

Thinking the Neo4j documentation should have the answer, I consulted:

3.4.1 Label names:

Any non-empty unicode string can be used as a label name. In Cypher, you may need to use the backtick (`) syntax to avoid clashes with Cypher identifier rules. By convention, labels are written with CamelCase notation, with the first letter in upper case. For instance, User or CarOwner.

OK, so that’s encouraging, maybe I have run afoul of mathematical syntax or something.

Welllll, not quite.

8.3 Identifiers (under Cypher):

Identifier names are case sensitive, and can contain underscores and alphanumeric characters (a-z, 0-9), but must start with a letter. If other characters are needed, you can quote the identifier using backquote (`) signs.

The same rules apply to property names.

Sherman, set the WayBack Machine for 1986, we want to watch Charles Goldfarb write the name character provisions of ISO 8879:1986:

4.173 lower-case letters: Charcter class composed of the 26 unaccented small letters from “a” through “z”.

4.326 upper-case letters: Character class composed of the 26 capital letters from “A” through “Z”.

4.175 lower-case name start characters: Character class consisting of each lower-case name start character assigned by the concrete reference syntax.

4.328 upper-case name start characters: Character class consisting of upper-case forms of the corresponding lower-case name start characters.

4.94 digits: Character class composed of the 10 Arabic numerals from “0” to “9”.

4.174 lower-case name characters: Character class consisting of each lower-case name character assigned by the concrete reference syntax.

4.327 upper-case name start characters: Character class consisting of upper-case forms of the corresponding lower-case name characters.

I had the honor of knowing many of the contributors to the SGML standard, including its author, Charles Goldfarb.

But that was 1986. The Unicode project formally started two years later.

Over twenty-eight years after the SGML standard we have returned to name start characters and name characters (those not escaped by a “backtick”).

Is Unicode support really that uncommon in graph databases?

December 12, 2013

Saint Nicolas brought me a new Batch Importer!!!

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:30 pm

Saint Nicolas brought me a new Batch Importer!!! by Rik Van Bruggen.

From the post:

After my previous blogpost about import strategies, the inimitable Michael Hunger decided to take my pros/cons to heart and created a new version of the batch importer – which is now even updated to the very last GA version of neo4j 2.0. Previously you actually needed to use Maven to build the importer – which I did not have/know, and therefore never used it. But now, it’s supposed to be as easy as download zip-file, unzip, run – so I of course HAD to test it out. Here’s what happened.

It’s amazing how unreasonable some users can be. Imagine, wanting a simple way to import data into a database. I tell you, IT has been far too easy on users over the years. 😉

If you want your software to be used, making your software more user friendly is a good idea.

As a data point, consider the recent W3C interest in CSV. At the other end of the spectrum from SWRL, wouldn’t you say?

Although I do hope we all remember that CSV was not invented at the W3C. (See RFC4180 for the most common features of CSV files.)

What are you going to import into your Neo4j 2.0.0 database?

December 11, 2013

Neo4j 2.0 GA – Graphs for Everyone

Filed under: Graphs,Neo4j — Patrick Durusau @ 9:05 pm

Neo4j 2.0 GA – Graphs for Everyone by Andreas Kollegger.

From the post:

A dozen years ago, we created a graph database because we needed it. We focused on performance, reliability and scalability, cementing a foundation for graph databases with the 0.x series, then expanding the features with the 1.x series. Today, we announce the first of the 2.x series of Neo4j and a commitment to take graph databases further to the mainstream.

Neo4j 2.0 has been brewing since early 2013, with almost a year of intense engineering effort producing the most significant change to graph databases since the term was invented. What makes this version of Neo4j so special? Two things: the power of a purpose-built graph query language, and a tool designed to let that language flow from your fingertips. Neo4j 2.0 is the graph database we dreamed about over a dozen years ago. And it’s available today!

Download Neo4j 2.0.

I’m not overly impressed with normalization.

After all, normalization is actually an abnormal condition. That is one you rarely encounter outside a relational database.

That being the case, why do we shoe horn non-normalized data into normalized form?

Granting that yes, with older technology, normalization made things possible that weren’t otherwise possible.

My question is why, several decades later, are we still shoe horning data into normalized forms?

Comments?

Neo4j Cypher Refcard 2.0

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 5:54 pm

Neo4j Cypher Refcard 2.0

From the webpage:

Key principles and capabilities of Cypher are as follows:

  • Cypher matches patterns of nodes and relationship in the graph, to extract information or modify the data.
  • Cypher has the concept of identifiers which denote named, bound elements and parameters.
  • Cypher can create, update, and remove nodes, relationships, labels, and properties.
  • Cypher manages indexes and constraints.

You can try Cypher snippets live in the Neo4j Console at console.neo4j.org or read the full Cypher documentation at docs.neo4j.org. For live graph models using Cypher check out GraphGist.

If you plan on entering the Neo4j GraphGist December Challenge, you are probably going to need this Refcard.

I first saw this in a tweet by Peter Neubauer.

December 8, 2013

Neo4j GraphGist December Challenge

Filed under: Contest,Graphs,Neo4j — Patrick Durusau @ 5:50 pm

Neo4j GraphGist December Challenge

Meetup Slides say: Deadline for entry is January 31st (2014). I mention that because the webpage still says Dec 31, 2013.

From the webpage:

This time we want you to look into these 10 categories and provide us with really easy to understand and still insightful Graph Use-Cases: Do not take the example keywords literally, you know your domain much better than we do!

  • Education – Schools, Universities, Courses, Planning, Management etc
  • Finance – Loans, Risks Fraud
  • Life Science – Biology, Genetics, Drug research, Medicine, Doctors, Referrals
  • Manufacturing – production line management, supply chain, parts list, product lines
  • Sports – Football, Baseball, Olympics, Public Sports
  • Resources – Energy Market, Consumption, Resource exploration, Green Energy, Climate Modeling
  • Retail – Recommendations, Product categories, Price Management, Seasons, Collections
  • Telecommunication – Infrastructure, Authorization, Planning, Impact
  • Transport – Shipping, Logistics, Flights, Cruises, Road/Train optimizations, Schedules
  • Advanced Graph Gists – for those of you that run outside of the competition anyway, give your best 🙂

Prizes:

We want to offer in each of our 10 categories Amazon gift-cards valued:

  1. Winner: 300 USD
  2. Second: 150 USD
  3. Third: 50 USD
  4. Every participant gets a special GraphGist t-shirt too.

In addition to the resources at the webpage, you may find AsciiDoc Cheatsheet helpful.

The meetup video where the GraphGist was announced.

Easy to understand graph use cases should not be too difficult.

Easy to solve graph use cases, that may be another matter. 😉

December 5, 2013

Geoff (update)

Filed under: Cypher,Geoff,Neo4j — Patrick Durusau @ 1:20 pm

Geoff

My prior post on Geoff pointed to a page about Geoff that appears to no longer exist. I have updated that page to point to the new location.

The current description reads:

Geoff is a text-based interchange format for Neo4j graph data that should be instantly readable to anyone familiar with Cypher, on which its syntax is based.

December 2, 2013

GraphGist Challenge December (5 Dec. 2013, Thursday)

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:10 pm

GraphGist Challenge December Organizers: Peter Neubauer and Micheal Hunger.

From the meeting announcement:

Thursday, December 5, 2013

It is 7PM CET, 10AM PT.

We’ll talk about our new GraphGist challenge to create the best graph domain models ever and how to model them as a graph gist.

After a quick presentation, we go into demo mode and show hands on how these gists are created, formatted and published.

We’re ready for your questions, comments and feedback.

Preliminary slide deck

We’ll add the hangout link as the time approaches.

Join and RSVP!

That should be 1 PM, Thursday, December 5, 2013, on the East Coast of the US.

If you are unfamiliar with graphgists, check out the GraphGist Wiki.

Entries from the first GraphGist Challenge, details on graphgists, etc.

Graphs for everything, ranging from chess to airports to Harry Potter!

December 1, 2013

Ordered Container

Filed under: Graphs,Neo4j — Patrick Durusau @ 8:43 pm

Ordered Container by Johannes Mockenhaupt.

From the post:

Mark Needham – via an interesting blogpost – made me go back to finish my pondering on how to model an ordered container. By which I basically mean an ordered list in a graph. Trimming down what he describes to the problem of containment and order, I use the example of songs that are part of an album and part of a playlist, both of which are ordered. Actually, I was modeling that anyway. So just doing the old NEXT relationships on songs to order them won’t work, since the unanswered question would be “who’s NEXT is it anyway?”. The album’s, the playlist’s? Or from another container that will be added in the future?

But why should the NEXT relationship go on the song in the first place? The song doesn’t care. Both the containment and the order are concerns of the album and playlist – the containers. So let them handle it. But how? Have HAS relationships from the container to the songs with position properties on the relationships? Awkward and not very pretty to query. Nor very graphy. So the position can’t be on the song node and it can’t be in the relationship … guess we need more nodes! Let’s extract the ordering into separate nodes:

Interesting but what puzzles me about the “next” relationship/edge is that I have to traverse all of the “next” relationships in order to reach say the fifth song on an album or playlist.

I would treat the order of the songs as a separate node, perhaps AlbumOrder, which contains an ordered list of the song as they appear on the album. Each song (represented by a separate node) can have an albumOrder relationship with the AlbumOrder node.

When at any song that appeared on the album, I can check its albumOrder (or playList1 or playList2) order relationship to discover its place in that album or list. Moreover, without unnecessary edge traversal, searching for another song, I can traverse the list and jump to any song that appeared on the album (assuming the song name and node ID have been captured by the list).

Suggestions/comments?

PS: True that my solution leaves the subject of the relationship of the songs implicit, but if all I am going to say is “next,” that hardly seems worth an edge.

November 30, 2013

Neo4j: What is a node?

Filed under: Graphs,Neo4j — Patrick Durusau @ 1:42 pm

Neo4j: What is a node? by Mark Needham.

From the post:

One of the first things I needed to learn when I started using Neo4j was how to model my domain using nodes and relationships and it wasn’t initially obvious to me what things should be nodes.

Luckily Ian Robinson showed me a mini-algorithm which I found helpful for getting started. The steps are as follows:

  1. Write out the questions you want to ask
  2. Highlight/underline the nouns
  3. Those are your nodes!

This is reasonably similar to the way that we work out what our objects should be when we’re doing OO modelling and I thought I’d give it a try on some of the data sets that I’ve worked with recently:

  • Female friends of friends that somebody could go out with
  • Goals scored by Arsenal players in a particular season
  • Colleagues who have similar skills to me
  • Episodes of a TV program that a particular actor appeared in
  • Customers who would be affected if a piece of equipment went in for repair

If you’re like me and aren’t that great at English grammar we can always cheat and get NLTK to help us out:

Pay particular attention to Mark’s use of NLTK to extract likely nodes from data.

As far as I can tell, Neo4j does not support half-edges, that is an edge with only one node. To support the use case where the player of a role (in topic map parlance) is unknown.

We know that Mary is married, for example, but we don’t know the name of her husband.

But we want to assign properties to the marriage edge (association) but there is no edge to carry those properties. Such as who reported Mary was married?

Any graph databases to suggest that support half-edges? (Computational Geometry Algorithms Library (CGAL) supports half-edges but isn’t a graph database.)

November 28, 2013

Quick Start with Neo4J…

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 7:20 pm

Quick Start with Neo4J using YOUR Twitter Data by John Berryman.

From the post:

When learning a new technology it’s best to have a toy problem in mind so that you’re not just reimplementing another glorified “Hello World” project. Also, if you need lots of data, it’s best to pull in a fun data set that you already have some familiarity with. This allows you to lean upon already established intuition of the data set so that you can more quickly make use of the technology. (And as an aside, this just why we so regularly use the StackExchange SciFi data set when presenting our new ideas about Solr.)

When approaching a graph database technology like Neo4J, if you’re as avid of a Twitter user as I am then POOF you already have the best possible data set for becoming familiar with the technology — your own Social network. And this blog post will help you download and setup Neo4J, set up a Twitter app (needed to access the Twitter API), pull down your social network as well as any other social network you might be interested in. At that point we’ll interrogate the network using the Neo4J and the Cypher syntax. Let’s go!

What? Not a single mention of Euler, bridges, claims about graphs rather that Atlas holding up the celestial sphere! Could this really be about Neo4j?

In a word: Yes!

In fact, it is one of the better introductions to Neo4j I have ever seen.

I like historical material but when you have seen dozens if not hundreds of presentations/slides repeating the same basic information, you start to worry about there being a Power-Point Slide Shortage. 😉

No danger of that with John’s post!

Following the instructions took a while in my case, mostly because I was called away to cook a pork loin (it’s a holiday here), plus rolls, etc., right as I got the authentication tokens. -( Then I had issues with a prior version of Neo4j that was already running. I installed via an installer and it had written a start script in rc4.d.

The latest version conflicts with the running older version and refuses to start without any meaningful error message. But, ps -ef | grep neo4j found the problem. Renaming the script while root, etc., fixed it. Do need to delete the older version at some point.

After all that, it was a piece of cake. John’s script works as promised.

I don’t know how to break this to John but now he is following but not being followed by neo4j, peterneubauer (Neo4j hotshot), and markhneedham (Neo4j hotshot). (As of 28 Nov. 2013, your results may vary.)

On the use of labels, you may be interested in the discussion at: RFC Blueprints3 and Vertex.getLabel()

Strings as labels leads to conflicts between labels with the same strings but different semantics.

If you are happy with a modest graph or are willing to police the use of labels it may work for you. On the other hand, it may not.

PS: I am over 11,500 nodes at this point and counting.

November 22, 2013

A names backbone:…

Filed under: Biology,Graphs,Merging,Neo4j,Topic Maps — Patrick Durusau @ 6:02 pm

A names backbone: a graph of taxonomy by Nicky Nicolson.

At first glance a taxonomy paper but as you look deeper:

Slide 34: Concepts layer: taxonomy as a graph

  • Names are nodes
  • Typed, directed relationships represent synonymy and taxonomic placement
  • Evidence for taxonomic assertions provided as references
  • …and again, standards bases import / export using TCS

Slide 35 shows a synonym_of relationship between two name nodes.

Slide 36 shows evidence attached to placement at one node and for the synonym_of link.

Slide 37 shows reuse of nodes to support “different taxonomic opinions.”

Slide 39 Persistent identification of concepts

We can re-create a sub-graph representing a concept at a particular point in time using:

  1. Name ID
  2. Classification
  3. State

Users can link to a stable state of a concept

We can provide a feed of what has changed since

I mention this item in part because Peter Neubauer (Neo4j) suggested in an email that rather than “merging” nodes that subject sameness (my term, not his) could be represented as a relationship between nodes.

Much in the same way that synonym_of was represented in these slides.

And I like the documentation of the reason for synonymy.

The internal data format of Neo4j makes “merging” in the sense of creating one node to replace two or more other nodes impractical.

Perhaps replacing nodes with other nodes has practical limits?

Is “virtual merging” in your topic map future?

Harry Potter (Neo4j GraphGist)

Filed under: Graphs,Literature,Neo4j,Time — Patrick Durusau @ 4:08 pm

Harry Potter (Neo4j GraphGist)

From the webpage:

v0 of this graph models some of Harrys friends, enemies and their parents. Also have some pets and a few killings. The obvious relation missing is the one between Harry Potter and Voldemort- it took us 7 books to figure that one out, so you’ll have to wait till I add more data 🙂

Great start on a graph representation of Harry Potter!

But the graph model has a different perspective than Harry or others the book series had.

Harry Potter model

I’m a Harry Potter fan. When Harry Potter and the Philosopher’s Stone starts, Harry doesn’t know Ron Weasley, Hermione Granger, Voldemort, or Hedwig.

The graph presents the vantage point of an omniscience observer, who knows facts the rest of us waited seven (7) volumes to discover.

A useful point of view, but it doesn’t show how knowledge and events unfolded to the characters in the story.

We loose any tension over whether Harry will choose Cho Chang or Ginny Weasley

And certainly the outcomes for Albus Dumbledore and Serverus Snape lose their rich texture.

If you object that I am confusing a novel with a graph, are you saying a graph cannot represent the development of information over time?*

That’s a fairly serious shortcoming for any information representation technique.

In stock trading, for example, when I “knew” your shaving lotion causes “purple pustules spelling PIMP” to break out on an user’s face would be critically important.

Did I know before or after I unloaded my shares in your company? 😉

A silly example but illustrates that “when” we know information can be very important.

Not to mention that “static” data is only an illusion of our information systems. Or rather information systems that don’t allow for tracking changing information.

Is your information system one of those?


* I’m in the camp that thinks graphs can represent the development of information over time. Depends on your use case whether you need the extra machinery that enables time-based views.

The granularity of time requirements vary when you are talking about Harry Potter versus the Divine Comedy versus leaks from the current White House.

In topic maps, the range of validity for an association was called its “scope.” Scope and time needs more than one or two other posts.

November 21, 2013

Neo4j 2.0.0-RC1 – Final preparations

Filed under: Graphs,Neo4j — Patrick Durusau @ 1:12 pm

Neo4j 2.0.0-RC1 – Final preparations by Andreas Kollegger.

From the post:

WARNING: This release is not compatible with earlier 2.0.0 milestones. See details below.

The next major version of Neo4j has been under development for almost a year now, methodically elaborated and refined into a solid foundation. Neo4j 2.0 is now feature-complete. We’re pleased to announce the first Release Candidate build is available today.

With that feature-completeness in mind, let’s see what’s on offer…

Andreas summarizes a number of new features for Cypher and concludes with:

To be clear: DO NOT USE THIS RELEASE WITH EXISTING DATA

Consider yourself warned!

Only bug fixes will be addressed between now and the GA release of Neo4j 2.0.

Now would be a good time to grab this release and go bug hunting.

November 20, 2013

Storm, Neo4j and Python:…

Filed under: Graphs,Neo4j,Python,Storm — Patrick Durusau @ 4:26 pm

Storm, Neo4j and Python: Real-Time Stream Computation on Graphs by Sonal Raj.

From the webpage:

This page serves a resource repository for my talk at Pycon India 2013 held at Bangalore, India on 30th August – 1st September, 2013. The talk introduces the basics of the Storm real-time distributed Computation Platform popularised by Twitter, and the Neo4J Graph Database and goes on to explain how they can be used in conjuction to perform real-time computations on Graph Data with the help of emerging python libraries – py2neo (for Neo4J) and petrel (for Storm)

Great slides, code skeletons, pointers to references and a live visualization!

See the video at: PyCon India 2013.

Demo gremlins mar the demonstration part but you can see:

A Storm Topology on AWS showing signup locations for people joining based on a sample Social Network data
http://www.enfoss.org/map-aws/storm-aws-visual.html

A quote from the slides that sticks with me:

Process Infinite Streams of data one-tuple-at-a-time.

😉

November 13, 2013

Exploring football …[with]… Clojure and friends

Filed under: Clojure,Graphs,Neo4j — Patrick Durusau @ 5:44 pm

Exploring football data & ranking teams using Clojure and friends by Mark Needham.

U.S. readers be forewarned that Mark doesn’t use the term “football” as you might expect. Think soccer.

A slide deck on transforming sports data with Clojure and using Neo4j (graph database) for storage and queries.

Sports are popular topic so the results could ease others into Clojure and Neo4j.

November 8, 2013

Pragmatic Cypher Optimization (2.0 M06)

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 7:34 pm

Pragmatic Cypher Optimization (2.0 M06)

From the post:

I’ve seen a few stack overflow and google group questions about queries that are slow, and I think there are some things that need to be said regarding Cypher optimization. These techniques are a few ways of improving your queries that aren’t necessarily intuitive. Before reading this, you should have an understanding of WITH (see my other post: The Mythical With).

First, let me throw out a nice disclaimer that these rules of thumb I’ve discovered are by no means definitively best practices, and you should measure your own results with cold and warm caches, running queries 3+ times to see realistic results with a warm cache.

Second, let me throw out another disclaimer, that Cypher is improving rapidly, and that these rules of thumb may only be valid for a few milestone releases. I’ll try to make future updates, but I’m sure there’s always danger of becoming out of date.

Ok, let’s get to it.

If you are looking for faster Cypher query results (who isn’t?), this is a good starting place for you!

November 7, 2013

Musicbrainz in Neo4j – Part 1

Filed under: Cypher,Graphs,Music,Music Retrieval,Neo4j — Patrick Durusau @ 9:06 am

Musicbrainz in Neo4j – Part 1 by Paul Tremberth.

From the post:

What is MusicBrainz?

Quoting Wikipedia, MusicBrainz is an “open content music database [that] was founded in response to the restrictions placed on the CDDB.(…) MusicBrainz captures information about artists, their recorded works, and the relationships between them.”

Anyone can browse the database at http://musicbrainz.org/. If you create an account with them you can contribute new data or fix existing records details, track lengths, send in cover art scans of your favorite albums etc. Edits are peer reviewed, and any member can vote up or down. There are a lot of similarities with Wikipedia.

With this first post, we want to show you how to import the Musicbrainz data into Neo4j for some further analysis with Cypher in the second post. See below for what we will end up with:

MusicBrainz data

MusicBrainz currently has around 1000 active users, nearly 800,000 artists, 75,000 record labels, around 1,200,000 releases, more than 12,000,000 tracks, and short under 2,000,000 URLs for these entities (Wikipedia pages, official homepages, YouTube channels etc.) Daily fixes by the community makes their data probably the freshest and most accurate on the web.
You can check the current numbers here and here.

This rocks!

Interesting data, walk through how to load the data into Neo4j and the promise of more interesting activities to follow.

However, I urge caution on showing this to family members. 😉

You may wind up scripting daily data updates and teaching Cypher to family members and no doubt their friends.

Up to you.

I first saw this in a tweet by Peter Neubauer.

November 6, 2013

Taming Galactus [Entity Fluidity, Complex Bibliography, Hyperedges]

Filed under: Biography,Graphs,Neo4j,Time — Patrick Durusau @ 2:57 pm

Taming Galactus by Peter Olson.

From the description:

Marvel Entertainment’s Peter Olson talk about how Marvel uses graph theory and the emerging NoSQL space to understand, model and ultimately represent the uncanny Marvel Universe.

Marvel Comics by any other name. 😉

From the slides:

  • 70+ Years of Stories
  • 30,000+ Comic Issues
  • 5,000+ Creators
  • 8,000+ Named Characters
  • 32 Movies (Marvel Studios and Licensed Movies)
  • 30+ Television Series
  • 100+ Video Games

Peter’s question: “How do you model a world where anything can happen?”

Main problems addressed are:

  • Entity fluidity, that is entities changing over time (sort of like people tracked by the NSA).
  • Complex bibliography, that is publication order isn’t story order. Not to mention that characters “reboot.”

Marvel uses graph databases.

Using hyperedges for modeling.

For example, the relationship between a character and person who plays the character is represented by a hyperedge that includes a node for the moment when that relationship is true.

Very good illustration of why hyperedges are useful.

Makes you wonder.

If a comic book company is using hypergraph techniques with its data, why are governments sharing data with data dumpster methods?

Like the data dumpster where Snowden obtained his supply of documents.

BTW, for experiments with graphs, sans the hyperedges, Marvel is using Neo4j.

October 24, 2013

I Mapreduced a Neo store:…

Filed under: Graphs,Hadoop,Neo4j — Patrick Durusau @ 2:14 pm

I Mapreduced a Neo store: Creating large Neo4j Databases with Hadoop by Kris Geusebroek. (Berlin Buzzwords 2013)

From the description:

When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.

When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).

In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.

If you find the slides hard to read (I did) you may want to try:

Combining Neo4J and Hadoop (part I) and,

Combining Neo4J and Hadoop (part II)

A recent update from Chris: I MapReduced a Neo4j store.

BTW, the code is on github.

Just in case you have any modest sized graphs that you want to play with in Neo4j. 😉

PS: I just found Chris’s slides: http://www.slideshare.net/godatadriven/i-mapreduced-a-neo-store-creating-large-neo4j-databases-with-hadoop

October 18, 2013

Exploring Neo4j Datasets

Filed under: Graphs,Neo4j — Patrick Durusau @ 5:55 pm

Neo4j: Exploring new data sets with help from Neo4j browser by Mark Needham.

From the post:

One of the things that I’ve found difficult when looking at a new Neo4j database is working out the structure of the data it contains.

I’m used to relational databases where you can easily get a list of the table and the foreign keys that allow you to join them to each other.

This has traditionally been difficult when using Neo4j but with the release of the Neo4j browser we can now easily get this type of overview by clicking on the Neo4j icon at the top left of the browser.

We’ll see something similar to the image on the left which shows the structure of my football graph and we can now discover parts of the graph by clicking on the various labels, properties or relationships.

See Mark’s post to follow along.

Oh, you don’t have the latest release?

Best correct that over sight before reading Mark’s post!

Updated conclusions about the graph database benchmark…

Filed under: Benchmarks,Graphs,Neo4j — Patrick Durusau @ 4:44 pm

Updated conclusions about the graph database benchmark – Neo4j can perform much better by Alex Popescu.

You may recall in Benchmarking Graph Databases I reported on a comparison of Neo4j against three relational databases, MySQL, Vertica and VoltDB.

Alex has listed resources relevant to the response from the original testers:

Our conclusions from this are that, like any of the complex systems we tested, properly tuning Neo4j can be tricky and getting optimal performance may require some experimentation with parameters. Whether a user of Neo4j can expect to see runtimes on graphs like this measured in milliseconds or seconds depends on workload characteristics (warm / cold cache) and whether setup steps can be amortized across many queries or not.

The response, Benchmarking Graph Databases – Updates, shows that Neo4j on shortest path outperforms MySQL, Vertica and VoltDB.

But scores on shortest path don’t appear for MySQL, Vertica and VoltDB on shortest path in the “Updates” post.

Let me help you with that.

Here is the original comparison:

Original comparison on shortest path

Here is Neo4j shortest path after reading the docs and suggestions from Neo4j tech support:

Neo4j shortest path

First graph has time in seconds, second graph has time in milliseconds.

Set up correctly, measure milliseconds on shortest path for Neo4j. SQL solutions, well, the numbers speak for themselves.

The moral here is to read software documentation and contact tech support before performing and publishing benchmarks.

Graph Modeling Do’s and Don’ts

Filed under: Graphs,Neo4j — Patrick Durusau @ 2:48 pm

Graph Modeling Do’s and Don’ts by Mark Needham.

Mark Needham credits Ian Robinson [Corrected from “Ian Anderson,” sorry.] with these slides.

Some ninety-six (96) slides in all, nearly all of which you will find useful for graph modeling.

Mark posted these in, let us say, a non-PDF format and I have converted them to PDF for your viewing pleasure.

😉

Enjoy!

October 17, 2013

GraphConnect SF 2013 Videos!

Filed under: Graphs,Neo4j — Patrick Durusau @ 5:50 pm

GraphConnect SF 2013 Videos!

By title:

By author:

Neo4j 2.0.0-M06 – Introducing Neo4j’s Browser

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 3:25 pm

Neo4j 2.0.0-M06 – Introducing Neo4j’s Browser by Andreas Kollegger.

From the post:

Type in a Cypher query, hit , then watch a graph visualization unfold. Want some data? Switch to the table view and download as CSV. Neo4j’s new Browser interface is a fluid developer experience, with iterative query authoring and graph visualization.

Available today in Neo4j 2.0.0 Milestone 6, download now to try out this shiny new user interface.

Like the man said: Download now! 😉

Andreas also suggests:

Ask questions on Stack Overflow.

Discuss ideas on our Google Group

Enjoy!

October 8, 2013

kaiso 0.12.0

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:41 pm

kaiso 0.12.0

From the webpage:

A graph based queryable object persistance framework built on top of Neo4j.

In addition to objects, Kaiso also stores the class information in the graph. This allows us to use cypher to query instance information, but also to answer questions about our types.

Early stages of the project but this looks interesting.

October 7, 2013

Markov Chains in Neo4j

Filed under: Graphs,Markov Decision Processes,Mathematics,Neo4j — Patrick Durusau @ 2:41 pm

Markov Chains in Neo4j by Nicole White.

From the post:

My new favorite thing lately is Neo4j, a graph database. It’s simple yet powerful: a graph database contains nodes and relationships, each which have properties. I recently made this submission to Neo4j’s GraphGist Challenge, which I did pretty well in.

After discovering Neo4j and graph databases a little over a month and a half ago, I’ve become subject to this weird syndrome where I think to myself, “Could I put that into a graph database?” with literally everything I encounter. The answer is usually yes.

Markov Chains

I realized the other day that nodes can have relationships with themselves, and for some reason, this immediately reminded me of Markov chains. The term Markov chain sounds intimidating at first (it did to me when I first saw the term on a syllabus), but they’re actually pretty simple: Markov chains consist of states and probabilities. The number of possible states is finite, and the Markov chain is a stochastic process that transitions, with certain probabilities, from one state to another over what I like to call time-steps.

The most important property of a Markov chain is that it is memoryless; that is, the probability of entering the next state depends only on the current state. We don’t care about where the process has been, only about where it is now.

If you wander over to the Wikipedia page on Markov chains, you’ll see pretty quickly why they are an obvious candidate for a graph database. The main profile picture for the page shows a Markov chain in graph form, where the states are nodes and the probabilities of transitioning from one state to another are the relationships between those nodes. The reason my realization mentioned earlier was important is that there is often a non-zero probability, given a Markov chain is in state A, that it will ‘enter’ state A in the next time-step. This is represented by a node that has a relationship with itself.

Interesting use of Neo4j to create a transition model.

Curious what you think of Nicole’s use of queries to avoid matrix multiplication?

It works but how often do you want to know the probability of one element in one state of a system?

Or would you extend the one element probability query to query more elements in a particular state?

« Newer PostsOlder Posts »

Powered by WordPress