Processing Tweets with LingPipe #3: Near duplicate detection and evaluation
Good coverage of tokenization of tweets and the use of the Jaccard Distance measure to determine similarity.
Of course, for a topic map, similarity may not lead to being discarded but trigger other operations instead.