I was reading Detecting Near-Duplicates for Web Crawling when I ran across the following requirement:
Near-Duplicate Detection
Why is it hard in a crawl setting?
- Scale
- Tens of billions of documents indexed
- Millions of pages crawled every day
- Need to decide quickly!
This presentation and SimHash: Hash-based Similarity Detection are both of interest to the topic maps community, since your near-duplicate may be my same subject.
But the other aspect of this work that caught my eye was the starting presumption that near-duplicate detection always occurs under extreme conditions.
Questions:
- Do my considerations change if I have only a few hundred thousand documents? (3-5 pages, no citations)
- What similarity tests are computationally too expensive for millions/billions but that work for hundred’s of thousands? (3-5 pages, no citations)
- How would you establish empirically the break point for the application of near-duplicate techniques? (3-5 pages, no citations)
- Establish the break points for selected near-duplicate measures. (project)
- Analysis of near-duplicate measures. What accounts for the different in performance? (project)