I started wondering about how common merging is in topic maps because I discovered a lack I have not seen before. There aren’t any large test collections of topic maps for CS types to break their clusters against. The sort of thing that challenges their algorithms and hardware.
But test collections should have some resemblance to actual data sets, at least if that is known with any degree of certainty. Or at least be one of the available data sets.
As a first step towards exploring this issue, I grepped for topics in the Opera and CIA Fact Book and got:
- Opera topic map: 29,738
- CIA Fact Book: 111,154
for a total of 140,892 topic elements. After merging the two maps, there were 126,204 topic elements. So I count that as merging 14,688 topic elements.
Approximately 10% of the topics in the two sets.
A very crude way to go about this but I was looking for rough numbers that may provoke some discussion and more refined measurements.
I mention that because one thought I had was to simply “cat” the various topic maps at the topicmapslab.de in CTM format together into one file and to “cat” that file until I have 1 million, 10 million and 100 million topic sets (approximately). Just a starter set to see what works/doesn’t work before scaling up the data sets.
Creating the files in this manner is going to result in a “merge heavy” topic map due to the duplication of content. That may not be a serious issue and perhaps better that it be that way in order to stress algorithms, etc. It would have the advantage that we could merge the original set and then project the number of merges that should be found in the various sets.
Suggestions/comments?