A couple of interesting posts from the LingPipe blog:
Processing Tweets with LingPipe #1: Search and CSV Data Structures
Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization
The second one on duplicates being the one that caught my eye.
After all, what are merging conditions the in TMDM other than the detection of duplicates?
Of course, I am interested in TMDM merging but also in the detection of fuzzy subject identity.
Whether than is then represented by an IRI or kept as a native merging condition being an implementation type issue.
This could be very important for some future leak of diplomatic tweets. 😉