Entity Matching for Semistructured Data in the Cloud by Marcus Paradies.
From the slides:
Main Idea
…
- Use MapReduce and ChuQL to process semistructured data
- Use a search-based blocking to generate candidate pairs
- Apply similarity functions to candidate pairs within a block
Uses two of my favorite sources, CiteSeer and Wikipedia.
Looks like the start of an authoring stage of topic map work flow to me. You?