Archive for the ‘Duplicates’ Category

Using Excel To Squash Duplicates

Tuesday, August 9th, 2016

How to use built-in Excel features to find duplicates by Susan Harkins.

From the post:

Duplicate values aren’t bad. In fact, most are necessary. However, duplicate records can skew reporting and analysis. Whether you’re finding duplicates in a single column or looking for duplicate records, Excel can do most of the work for you. In this article, I’ll show you easy ways to find duplicates by applying advanced filtering options and conditional formatting rules. First, we’ll define the term duplicate—it isn’t ambiguous, but context determines its meaning. Then, we’ll use Excel’s built-in features to find duplicates.

If the first paragraph hadn’t caught my attention, then:

Your definition of duplicate will depend on the business rule you’re applying.

certainly would have!

The same rule holds true for subject identity. It really depends on the business rule (read requirement) for your analysis.

In some cases subject may appear as topics/proxies but be ignored. Or their associations with other subjects will be ignored.

Or for some purposes, what were separate topics/proxies may form group subjects with demographic characteristics such as age, gender, voting status, etc.

If you are required to use Excel and bedeviled by duplicates, you will find this post quite useful.

Duplicate Tool Names

Friday, July 18th, 2014

You wait ages for somebody to develop a bioinformatics tool called ‘Kraken’ and then three come along at once by Keith Bradnam.

From the post:

So Kraken is either a universal genomic coordinate translator for comparative genomics, or a tool for ultrafast metagenomic sequence classification using exact alignments, or even a set of tools for quality control and analysis of high-throughput sequence data. The latter publication is from 2013, and the other two are from this year (2014).

Yet another illustration that names are not enough.

A URL identifier would not help unless you recognize the URL.

Identification with name/value plus other key/value pairs?

Leaves everyone free to choose whatever names they like.

It also enables the rest of us to distinguish tools (or other subjects) with the same names apart.

Simply concept. Easy to apply. Disappoints people who want to be in charge of naming things.

Sounds like three good reasons to me, especially the last one.

Duplicate News Story Detection Revisited

Wednesday, December 25th, 2013

Duplicate News Story Detection Revisited by Omar Alonso, Dennis Fetterly, and Mark Manasse.


In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

Detecting duplicates or near-duplicates of subjects (such as news stories) is part and parcel of a topic maps toolkit.

What I found curious about this paper was the definition of “content” to mean the news story and not online comments as well.

That’s a rather limited view of near-duplicate content. And it has a pernicious impact.

If a story quotes a lead paragraph or two from a New York Times story, comments may be made at the “near-duplicate” site, not the New York Times.

How much of a problem is that? When was the last time you saw a comment that was not in English in the New York Times?

Answer: Very unlikely you have ever seen such a comment:

If you are writing a comment, please be thoughtful, civil and articulate. In the vast majority of cases, we only accept comments written in English; foreign language comments will be rejected. Comments & Readers’ Reviews

If a story appears in the New York Times and a “near-duplicate” in Arizona, Italy, and Sudan, with comments, according to the authors, you will not have the opportunity to see that content.

That’s replacing American Exceptionalism with American Myopia.

Doesn’t sound like a winning solution to me.

I first saw this at Full Text Reports as Duplicate News Story Detection Revisited.


Thursday, August 22nd, 2013

duplitector by Paweł Rychlik.

From the webpage:


A duplicate data detector engine based on Elasticsearch. It’s been successfully used as a proof of concept, piloting an full-blown enterprize solution.


In certain systems we have to deal with lots of low-quality data, containing some typos, malformatted or missing fields, erraneous bits of information, sometimes coming from different sources, like careless humans, faulty sensors, multiple external data providers, etc. This kind of datasets often contain vast numbers of duplicate or similar entries. If this is the case – then these systems might struggle to deal with such unnatural, often unforeseen, conditions. It might, in turn, affect the quality of service delivered by the system.

This project is meant to be a playground for developing a deduplication algorithm, and is currently aimed at the domain of various sorts of organizations (e.g. NPO databases). Still, it’s small and generic enough, so that it can be easily adjusted to handle other data schemes or data sources.

The repository contains a set of crafted organizations and their duplicates (partially fetched from IRS, partially intentionally modified, partially made up), so that it’s convenient to test the algorithm’s pieces.

Paweł also points to this article by Andrei Zmievski: Duplicates Detection with ElasticSearch. Andrei merges tags for locations based on their proximity to a particular coordinates.

I am looking forward to the use of indexing engines for deduplication of data in situ as it were. That is without transforming the data into some other format for processing.

Duplicate Detection on GPUs

Saturday, March 23rd, 2013

Duplicate Detection on GPUs by Benedikt Forchhammer, Thorsten Papenbrock, Thomas Stening, Sven Viehmeier, Uwe Draisbach, Felix Naumann.


With the ever increasing volume of data and the ability to integrate different data sources, data quality problems abound. Duplicate detection, as an integral part of data cleansing, is essential in modern information systems. We present a complete duplicate detection workflow that utilizes the capabilities of modern graphics processing units (GPUs) to increase the efficiency of finding duplicates in very large datasets. Our solution covers several well-known algorithms for pair selection, attribute-wise similarity comparison, record-wise similarity aggregation, and clustering. We redesigned these algorithms to run memory-efficiently and in parallel on the GPU. Our experiments demonstrate that the GPU-based workflow is able to outperform a CPU-based implementation on large, real-world datasets. For instance, the GPU-based algorithm deduplicates a dataset with 1.8m entities 10 times faster than a common CPU-based algorithm using comparably priced hardware.

Synonyms: Duplicate detection = entity matching = record linkage (and all the other alternatives for those terms).

This looks wicked cool!

I first saw this in a tweet by Stefano Bertolo.

PlagSpotter [Ghost of Topic Map Past?]

Monday, December 10th, 2012

I found a link to PlagSpotter in the morning mail.

I found it quite responsive, although I thought the “Share and Help Your Friends Protect Their Web Content” rather limiting.

Here’s why:

To test the software, I choose a blog entry from another blog, one I quoted late yesterday, to test the timeliness of PlagSpotter.

And it worked!

While looking at the results, I saw people I expected to quote the same post, but then noticed there were people unknown to me on the list.

Rather than detecting plagiarism, the first off-label use of PlagSpotter is to identify communities quoting the same content.

With just a little more effort, the second off-label use of PlagSpotter is to track the spread of content across a community, by time. (With a little post processing, location, language as well.)

A third off-label use of PlagSpotter is to generate a list of sources that use the same content, a great seed list for a private search engine for a particular area/community.

The earliest identifiable discussion of topic maps as topic maps, involved detection of duplicated content (with duplicated charges for that content) for documentation in government contracts.

Perhaps why topic maps never gained much traction in government contracting. Cheats dislike being identified as cheats.

Ah, a fourth off-label use of PlagSpotter, detecting duplicated documentation submitted as part of weapon system or other documentation.

I find all four off-label uses of PlagSpotter more persuasive than protecting content.

Content only has value when other people use it, hopefully with attribution.

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Sunday, July 8th, 2012

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen.

In the Foreword, William E. Winkler (U. S. Census Bureau and dean of record linkage), writes:

Within this framework of historical ideas and needed future work, Peter Christen’s monograph serves as an excellent compendium of the best existing work by computer scientists and others. Individuals can use the monograph as a basic reference to which they can gain insight into the most pertinent record linkage ideas. Interested researchers can use the methods and observations as building blocks in their own work. What I found very appealing was the high quality of the overall organization of the text, the clarity of the writing, and the extensive bibliography of pertinent papers. The numerous examples are quite helpful because they give real insight into a specific set of methods. The examples, in particular, prevent the researcher from going down some research directions that would often turn out to be dead ends.

I saw the alert for this volume today so haven’t had time to acquire and read it.

Given the high praise from Winkler, I expect it to be a pleasure to read and use.

Superfastmatch: A text comparison tool

Tuesday, April 17th, 2012

Superfastmatch: A text comparison tool by Donovan Hide.

Slides on a Chrome extension that compares news stories for unique content.

Would be interesting to compare 24-hour news channels both to themselves and to others on the basis of duplicate content.

Could even have a 15 minute, highlights of the news and deliver most of the non-duplicate content (well, omitting the commercials as well) for any 24-hour period.

Until then, visit this project and see what you think.

Identifying duplicate content using statistically improbable phrases

Friday, November 18th, 2011

Identifying duplicate content using statistically improbable phrases by Mounir Errami, Zhaohui Sun, Angela C. George, Tara C. Long, Michael A. Skinner, Jonathan D. Wren and Harold R. Garner.


Motivation: Document similarity metrics such as PubMed’s ‘Find related articles’ feature, which have been primarily used to identify studies with similar topics, can now also be used to detect duplicated or potentially plagiarized papers within literature reference databases. However, the CPU-intensive nature of document comparison has limited MEDLINE text similarity studies to the comparison of abstracts, which constitute only a small fraction of a publication’s total text. Extending searches to include text archived by online search engines would drastically increase comparison ability. For large-scale studies, submitting short phrases encased in direct quotes to search engines for exact matches would be optimal for both individual queries and programmatic interfaces. We have derived a method of analyzing statistically improbable phrases (SIPs) for assistance in identifying duplicate content.

Results: When applied to MEDLINE citations, this method substantially improves upon previous algorithms in the detection of duplication citations, yielding a precision and recall of 78.9% (versus 50.3% for eTBLAST) and 99.6% (versus 99.8% for eTBLAST), respectively.

Availability: Similar citations identified by this work are freely accessible in the Déjà vu database, under the SIP discovery method category at

I ran across this article today while looking for other material on the Déjà vu database.

Why should Amazon have all the fun? 😉

Depending on the breath of the search, I can imagine creating graphs of search data that display more than one SIP per article, allowing researchers to choose paths through the literature. Well, that is beyond what the authors intend here but adaptation of their work to search and refinement of research data seems like a natural extension.

And depending how now finely data from sensors or other automatic sources was segmented, it isn’t hard to imagine something similar for sensor data. Not really plagiarism but duplication that might warrant further investigation.

Mneme: Scalable Duplicate Filtering Service

Saturday, November 12th, 2011

Mneme: Scalable Duplicate Filtering Service

From the post:

Detecting and dealing with duplicates is a common problem: sometimes we want to avoid performing an operation based on this knowledge, and at other times, like in a case of a database, we want may want to only permit an operation based on a hit in the filter (ex: skip disk access on a cache miss). How do we build a system to solve the problem? The solution will depend on the amount of data, frequency of access, maintenance overhead, language, and so on. There are many ways to solve this puzzle.

In fact, that is the problem – they are too many ways. Having reimplemented at least half a dozen solutions in various languages and with various characteristics at PostRank, we arrived at the following requirements: we want a system that is able to scale to hundreds of millions of keys, we want it to be as space efficient as possible, have minimal maintenance, provide low latency access, and impose no language barriers. The tradeoff: we will accept a certain (customizable) degree of error, and we will not persist the keys forever.

Mneme: Duplicate filter & detection

Mneme is an HTTP web-service for recording and identifying previously seen records – aka, duplicate detection. To achieve the above requirements, it is implemented via a collection of bloomfilters. Each bloomfilter is responsible for efficiently storing the set membership information about a particular key for a defined period of time. Need to filter your keys for the trailing 24 hours? Mneme can create and automatically rotate 24 hourly filters on your behalf – no maintenance required.

Interesting in several respects:

  1. Duplicate detection
  2. Duplicate detection for a defined period of time
  3. Duplicate detection for a defined period of time with “customizable” degree of error

Would depend on your topic map project requirements. Assuming absolute truth forever and ever isn’t one of them, detecting duplicate subject representatives for some time period at a specified error rate may be the concepts you are looking for.

Enables a discussion of how much certainly (error rate) for how long (time period) for detection of duplicates (subject representatives) on what basis? All of those are going to impact project complexity and duration.

Interesting as well as a solution that for some duplicate detection requirements will work quite well.

Tracking Unique Terms in News Feeds

Thursday, October 13th, 2011

Tracking Unique Terms in News Feeds by Matthew Hurst.

From the post:

I’ve put together a simple system which reads news feeds (the BBC, NPR, the Economist and Reuters) in approximately real time and maintains a record of the distribution of terms found in the articles. It then indicates in a stream visualization the articles and unique terms that are observed by the system for the first time within them. The result being that articles which contain no new terms at all are grayed out.

The larger idea here is to build a ‘linguistic dashboard’ for the web which captures real time evolution of language.

This is a very cool idea! It could certainly be a news “filter” that would cut down on clutter in news feeds. No new terms = No new news? Something to think about.


Thursday, April 14th, 2011


Lars Marius Garshol slides from an internal Bouvet conference on deduplication of data.

And, DUplicate KillEr, DUKE.

As Lars points out, people have been here before.

I am not sure I share Lars’ assessment of the current state of record linkage software.

Consider for example, FRIL – Fine-Grained Record Integration and Linkage Tool, which is described as:

FRIL is FREE open source tool that enables fast and easy record linkage. The tool extends traditional record linkage tools with a richer set of parameters. Users may systematically and iteratively explore the optimal combination of parameter values to enhance linking performance and accuracy.
Key features of FRIL include:

  • Rich set of user-tunable parameters
  • Advanced features of schema/data reconciliation
  • User-tunable search methods (e.g. sorted neighborhood method, blocking method, nested loop join)
  • Transparent support for multi-core systems
  • Support for parameters configuration
  • Dynamic analysis of parameters
  • And many, many more…

I haven’t used FRIL but do note that it has documentation, videos, etc. for user instruction.

I have reservations about record linkage in general, but those are concerns about re-use of semantic mappings and not record linkage per se.

SimHash – Depends on Where You Start

Monday, March 14th, 2011

I was reading Detecting Near-Duplicates for Web Crawling when I ran across the following requirement:

Near-Duplicate Detection

Why is it hard in a crawl setting?

  • Scale
    • Tens of billions of documents indexed
    • Millions of pages crawled every day
  • Need to decide quickly!

This presentation and SimHash: Hash-based Similarity Detection are both of interest to the topic maps community, since your near-duplicate may be my same subject.

But the other aspect of this work that caught my eye was the starting presumption that near-duplicate detection always occurs under extreme conditions.


  1. Do my considerations change if I have only a few hundred thousand documents? (3-5 pages, no citations)
  2. What similarity tests are computationally too expensive for millions/billions but that work for hundred’s of thousands? (3-5 pages, no citations)
  3. How would you establish empirically the break point for the application of near-duplicate techniques? (3-5 pages, no citations)
  4. Establish the break points for selected near-duplicate measures. (project)
  5. Analysis of near-duplicate measures. What accounts for the different in performance? (project)

Bayesian identity resolution – Post

Friday, February 11th, 2011

Bayesian identity resolution

Lars Marius Garshol walks through finding duplicate records in data records.

As Lars notes, there are commercial products for this same task but I think this is a useful exercise.

Isn’t that hard to imagine the creation of test data sets with a variety of conditions to underscore lessons about detecting duplicate records.

I suspect such training data may already be available.

Will have to see what I can find and post about it.

PS: Lars is primary editor of the TMDM, working on TMCL and several other parts of the topic maps standard.

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation – Post

Monday, January 3rd, 2011

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

Good coverage of tokenization of tweets and the use of the Jaccard Distance measure to determine similarity.

Of course, for a topic map, similarity may not lead to being discarded but trigger other operations instead.

Duplicate and Near Duplicate Documents Detection: A Review

Tuesday, December 14th, 2010

Duplicate and Near Duplicate Documents Detection: A Review Authors: J Prasanna Kumar, P Govindarajulu Keywords: Web Mining, Web Content Mining, Web Crawling, Web pages, Duplicate Document, Near duplicate pages, Near duplicate detection


The development of Internet has resulted in the flooding of numerous copies of web documents in the search results making them futilely relevant to the users thereby creating a serious problem for internet search engines. The outcome of perpetual growth of Web and e-commerce has led to the increase in demand of new Web sites and Web applications. Duplicated web pages that consist of identical structure but different data can be regarded as clones. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. The problem has been deliberated for diverse data types (e.g. textual documents, spatial points and relational records) in diverse settings. Another contemporary materialization of the problem is the efficient identification of near-duplicate Web pages. This is certainly challenging in the web-scale due to the voluminous data and high dimensionalities of the documents. This survey paper has a fundamental intention to present an up-to-date review of the existing literature in duplicate and near duplicate detection of general documents and web documents in web crawling. Besides, the classification of the existing literature in duplicate and near duplicate detection techniques and a detailed description of the same are presented so as to make the survey more comprehensible. Additionally a brief introduction of web mining, web crawling, and duplicate document detection are also presented.


Duplicate document detection is a rapidly evolving field.

  1. What general considerations would govern a topic map to remain current in this field?
  2. What would we need to extract from this paper to construct such a map?
  3. What other technologies would we need to use in connection with such a map?
  4. What data sources should we use for such a map?

Detecting “Duplicates” (same subject?)

Friday, December 3rd, 2010

A couple of interesting posts from the LingPipe blog:

Processing Tweets with LingPipe #1: Search and CSV Data Structures

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

The second one on duplicates being the one that caught my eye.

After all, what are merging conditions the in TMDM other than the detection of duplicates?

Of course, I am interested in TMDM merging but also in the detection of fuzzy subject identity.

Whether than is then represented by an IRI or kept as a native merging condition being an implementation type issue.

This could be very important for some future leak of diplomatic tweets. 😉

Similarity and Duplicate Detection System for an OAI Compliant Federated Digital Library

Tuesday, September 28th, 2010

Similarity and Duplicate Detection System for an OAI Compliant Federated Digital Library Authors: Haseebulla M. Khan, Kurt Maly and Mohammad Zubair Keywords: OAI – duplicate detection – digital library – federation service


The Open Archives Initiative (OAI) is making feasible to build high level services such as a federated search service that harvests metadata from different data providers using the OAI protocol for metadata harvesting (OAI-PMH) and provides a unified search interface. There are numerous challenges to build and maintain a federation service, and one of them is managing duplicates. Detecting exact duplicates where two records have identical set of metadata fields is straight-forward. The problem arises when two or more records differ slightly due to data entry errors, for example. Many duplicate detection algorithms exist, but are computationally intensive for large federated digital library. In this paper, we propose an efficient duplication detection algorithm for a large federated digital library like Arc.

The authors discovered that title weight was more important than author weight in the discovery of duplicates. Working with a subset of 73 archives with 465,440 records. Would be interesting to apply this insight to a resource like WorldCat, where duplicates are a noticeable problem.