Duplicate and Near Duplicate Documents Detection: A Review

Duplicate and Near Duplicate Documents Detection: A Review Authors: J Prasanna Kumar, P Govindarajulu Keywords: Web Mining, Web Content Mining, Web Crawling, Web pages, Duplicate Document, Near duplicate pages, Near duplicate detection

Abstract:

The development of Internet has resulted in the flooding of numerous copies of web documents in the search results making them futilely relevant to the users thereby creating a serious problem for internet search engines. The outcome of perpetual growth of Web and e-commerce has led to the increase in demand of new Web sites and Web applications. Duplicated web pages that consist of identical structure but different data can be regarded as clones. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. The problem has been deliberated for diverse data types (e.g. textual documents, spatial points and relational records) in diverse settings. Another contemporary materialization of the problem is the efficient identification of near-duplicate Web pages. This is certainly challenging in the web-scale due to the voluminous data and high dimensionalities of the documents. This survey paper has a fundamental intention to present an up-to-date review of the existing literature in duplicate and near duplicate detection of general documents and web documents in web crawling. Besides, the classification of the existing literature in duplicate and near duplicate detection techniques and a detailed description of the same are presented so as to make the survey more comprehensible. Additionally a brief introduction of web mining, web crawling, and duplicate document detection are also presented.

Questions:

Duplicate document detection is a rapidly evolving field.

  1. What general considerations would govern a topic map to remain current in this field?
  2. What would we need to extract from this paper to construct such a map?
  3. What other technologies would we need to use in connection with such a map?
  4. What data sources should we use for such a map?

Comments are closed.