I was listening to Ian Robinson’s recent presentation on Dr. Who and Neo4j when Ian remarked that similarity could be modeled as a relationship.
It seemed like an off-hand remark at the time but it struck me as having immediate relevance to using Neo4j with topic maps.
One of my concerns for using Neo4j with topic maps has been the TMDM specification of merging topic items as:
1. Create a new topic item C.
2. Replace A by C wherever it appears in one of the following properties of an information item: [topics], [scope], [type], [player], and [reifier].
3. Repeat for B.
4. Set C’s [topic names] property to the union of the values of A and B’s [topic names] properties.
5. Set C’s [occurrences] property to the union of the values of A and B’s [occurrences] properties.
6. Set C’s [subject identifiers] property to the union of the values of A and B’s [subject identifiers] properties.
7. Set C’s [subject locators] property to the union of the values of A and B’s [subject locators] properties.
8. Set C’s [item identifiers] property to the union of the values of A and B’s [item identifiers] properties.
(TMDM, 6.2 Merging Topic Items)
Obviously the TMDM is specifying an end result and not how you get there but still, there has to be a mechanism by which a query that finds A or B, also results in the “union of the values of A and B’s [topic names] properties.” (And the other operations specified by the TMDM here and elsewhere.)
Ian’s reference to similarity being modeled as a relationship made me realize that similarity relationships could be created between nodes that share the same [subject identifiers} property value (and other conditions for merging). Thus, when querying a topic map, there should be the main query, followed by a query for “sameness” relationships for any returned objects.
This avoids the performance “hit” of having to update pointers to information items that are literally re-written with new identifiers. Not to mention that processing the information items that will be presented to the user as one object could even be off-loaded onto the client, with a further savings in server side processing.
There is an added bonus to this approach, particularly for complex merging conditions beyond the TMDM. Since the directed edges have properties, it would be possible to dynamically specify merging conditions beyond those of the TMDM based on those properties. Which means that “merging” operations could be “unrolled” as it were.
Or would that be “rolled down?” Thinking that a user could step through each addition of a “merging” condition and observe the values as they were added, along with their source. Perhaps even set “break points” as in debugging software.
Will have to think about this some more and work up some examples in Neo4j. Comments/suggestions most welcome!
PS: You know, if this works, Neo4j already has a query language, Cypher. I don’t know if Cypher supports declaration of routines that can be invoked as parts of queries but investigate that possibility. Just to keep users from having to write really complex queries to gather up all the information items on a subject. That won’t help people using other key/value stores but there are some interesting possibilities there as well. Will depend on the use cases and nature of “merging” requirements.