Experiments in genetic programming
Lars Marius Garshol writes:
I made an engine called Duke that can automatically match records to see if they represent the same thing. For more background, see a previous post about it. The biggest problem people seem to have with using it is coming up with a sensible configuration. I stumbled across a paper that described using so-called genetic programming to configure a record linkage engine, and decided to basically steal the idea.
You need to read about the experiments in the post but I can almost hear Lars saying the conclusion:
The result is pretty clear: the genetic configurations are much the best. The computer can configure Duke better than I can. That’s almost shocking, but there you are. I guess I need to turn the script into an official feature.
😉
Excellent post and approach by the way!
Lars also posted a link to Reddit about his experiments. Several links appear in comments that I have turned into short posts to draw more attention to them.
Another tool for your topic mapping toolbox.
Question: I wonder what it would look like to have the intermediate results used for mapping, only to be replaced as “better” mappings become available? Has a terminating condition but new content can trigger additional cycles but only as relevant to its content.
Or would queries count as new content? If they expressed synonymy or other relations?