Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 14, 2011

SimHash – Depends on Where You Start

Filed under: Duplicates — Patrick Durusau @ 7:58 am

I was reading Detecting Near-Duplicates for Web Crawling when I ran across the following requirement:

Near-Duplicate Detection

Why is it hard in a crawl setting?

  • Scale
    • Tens of billions of documents indexed
    • Millions of pages crawled every day
  • Need to decide quickly!

This presentation and SimHash: Hash-based Similarity Detection are both of interest to the topic maps community, since your near-duplicate may be my same subject.

But the other aspect of this work that caught my eye was the starting presumption that near-duplicate detection always occurs under extreme conditions.

Questions:

  1. Do my considerations change if I have only a few hundred thousand documents? (3-5 pages, no citations)
  2. What similarity tests are computationally too expensive for millions/billions but that work for hundred’s of thousands? (3-5 pages, no citations)
  3. How would you establish empirically the break point for the application of near-duplicate techniques? (3-5 pages, no citations)
  4. Establish the break points for selected near-duplicate measures. (project)
  5. Analysis of near-duplicate measures. What accounts for the different in performance? (project)

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress