Archive for the ‘Deduplication’ Category

Some tools for lifting the patent data treasure

Monday, December 15th, 2014

Some tools for lifting the patent data treasure by by Michele Peruzzi and Georg Zachmann.

From the post:

…Our work can be summarized as follows:

  1. We provide an algorithm that allows researchers to find the duplicates inside Patstat in an efficient way
  2. We provide an algorithm to connect Patstat to other kinds of information (CITL, Amadeus)
  3. We publish the results of our work in the form of source code and data for Patstat Oct. 2011.

More technically, we used or developed probabilistic supervised machine-learning algorithms that minimize the need for manual checks on the data, while keeping performance at a reasonably high level.

The post has links for source code and data for these three papers:

A flexible, scaleable approach to the international patent “name game” by Mark Huberty, Amma Serwaah, and Georg Zachmann

In this paper, we address the problem of having duplicated patent applicants’ names in the data. We use an algorithm that efficiently de-duplicates the data, needs minimal manual input and works well even on consumer-grade computers. Comparisons between entries are not limited to their names, and thus this algorithm is an improvement over earlier ones that required extensive manual work or overly cautious clean-up of the names.

A scaleable approach to emissions-innovation record linkage by Mark Huberty, Amma Serwaah, and Georg Zachmann

PATSTAT has patent applications as its focus. This means it lacks important information on the applicants and/or the inventors. In order to have more information on the applicants, we link PATSTAT to the CITL database. This way the patenting behaviour can be linked to climate policy. Because of the structure of the data, we can adapt the deduplication algorithm to use it as a matching tool, retaining all of its advantages.

Remerge: regression-based record linkage with an application to PATSTAT by Michele Peruzzi, Georg Zachmann, Reinhilde Veugelers

We further extend the information content in PATSTAT by linking it to Amadeus, a large database of companies that includes financial information. Patent microdata is now linked to financial performance data of companies. This algorithm compares records using multiple variables, learning their relative weights by asking the user to find the correct links in a small subset of the data. Since it is not limited to comparisons among names, it is an improvement over earlier efforts and is not overly dependent on the name-cleaning procedure in use. It is also relatively easy to adapt the algorithm to other databases, since it uses the familiar concept of regression analysis.

Record linkage is a form of merging that originated in epidemiology in the late 1940’s. To “link” (read merge) records across different formats, records were transposed into a uniform format and “linking” characteristics chosen to gather matching records together. A very powerful technique that has been in continuous use and development ever since.

One major different with topic maps is that record linkage has undisclosed subjects, that is the subjects that make up the common format and the association of the original data sets with that format. I assume in many cases the mapping is documented but it doesn’t appear as part of the final work product, thereby rendering the merging process opaque and inaccessible to future researchers. All you can say is “…this is the data set that emerged from the record linkage.”

Sufficient for some purposes but if you want to reduce the 80% of your time that is spent munging data that has been munged before, it is better to have the mapping documented and to use disclosed subjects with identifying properties.

Having said all of that, these are tools you can use now on patents and/or extend them to other data sets. The disambiguation problems addressed for patents are the common ones you have encountered with other names for entities.

If a topic map underlies your analysis, the less time you will spend on the next analysis of the same information. Think of it as reducing your intellectual overhead in subsequent data sets.

Income – Less overhead = Greater revenue for you. 😉

PS: Don’t be confused, you are looking for EPO Worldwide Patent Statistical Database (PATSTAT). Naturally there is a US organization, that is just patent litigation statistics.

PPS: Sam Hunting, the source of so many interesting resources, pointed me to this post.

Rule-based deduplication…

Friday, January 17th, 2014

Rule-based deduplication of article records from bibliographic databases by Yu Jiang,


We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.

I found this report encouraging, particularly when read along side Rule-based Information Extraction is Dead!…, with regard to merging rules authored by human editors.

Both reports indicate a pressing need for more complex rules than matching a URI for purposes of deduplication (merging in topic maps terminology).

I assume such rules would need to be easier for the average users to declare than TMCL.

Duplicate News Story Detection Revisited

Wednesday, December 25th, 2013

Duplicate News Story Detection Revisited by Omar Alonso, Dennis Fetterly, and Mark Manasse.


In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

Detecting duplicates or near-duplicates of subjects (such as news stories) is part and parcel of a topic maps toolkit.

What I found curious about this paper was the definition of “content” to mean the news story and not online comments as well.

That’s a rather limited view of near-duplicate content. And it has a pernicious impact.

If a story quotes a lead paragraph or two from a New York Times story, comments may be made at the “near-duplicate” site, not the New York Times.

How much of a problem is that? When was the last time you saw a comment that was not in English in the New York Times?

Answer: Very unlikely you have ever seen such a comment:

If you are writing a comment, please be thoughtful, civil and articulate. In the vast majority of cases, we only accept comments written in English; foreign language comments will be rejected. Comments & Readers’ Reviews

If a story appears in the New York Times and a “near-duplicate” in Arizona, Italy, and Sudan, with comments, according to the authors, you will not have the opportunity to see that content.

That’s replacing American Exceptionalism with American Myopia.

Doesn’t sound like a winning solution to me.

I first saw this at Full Text Reports as Duplicate News Story Detection Revisited.

DataCleaner 3.5.7 released

Friday, November 22nd, 2013

DataCleaner 3.5.7 released

A point release but I haven’t mentioned DataCleaner since before version 2.4. Sorry.

True, DataCleaner doesn’t treat all information structures as subjects, etc., but then you don’t need a topic map for every data handling job.

Opps! I don’t think I was supposed to say that. 😉

Seriously, you need to evaluate every data technology and/or tool on the basis of your requirements.

Topic maps included.

Elasticsearch Entity Resolution

Thursday, September 12th, 2013

elasticsearch-entity-resolution by Yann Barraud.

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

It is usable as is, though cleaners are not yet implemented.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Intereactive deduplication? Now that sounds very useful for topic map authoring.

Appropriate that I saw this in a Tweet by Duke‘s author, Lars Marius Garshol.

Interactive Entity Resolution in Relational Data… [NG Topic Map Authoring]

Wednesday, June 5th, 2013

Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation by Hyunmo Kang, Lise Getoor, Ben Shneiderman, Mustafa Bilgic, Louis Licamele.


Databases often contain uncertain and imprecise references to real-world entities. Entity resolution, the process of reconciling multiple references to underlying real-world entities, is an important data cleaning process required before accurate visualization or analysis of the data is possible. In many cases, in addition to noisy data describing entities, there is data describing the relationships among the entities. This relational data is important during the entity resolution process; it is useful both for the algorithms which determine likely database references to be resolved and for visual analytic tools which support the entity resolution process. In this paper, we introduce a novel user interface, D-Dupe, for interactive entity resolution in relational data. D-Dupe effectively combines relational entity resolution algorithms with a novel network visualization that enables users to make use of an entity’s relational context for making resolution decisions. Since resolution decisions often are interdependent, D-Dupe facilitates understanding this complex process through animations which highlight combined inferences and a history mechanism which allows users to inspect chains of resolution decisions. An empirical study with 12 users confirmed the benefits of the relational context visualization on the performance of entity resolution tasks in relational data in terms of time as well as users’ confidence and satisfaction.

Talk about a topic map authoring tool!

Even chains entity resolution decisions together!

Not to be greedy, but interactive data deduplication and integration in Hadoop would be a nice touch. 😉

Software: D-Dupe: A Novel Tool for Interactive Data Deduplication and Integration.

Data deduplication tactics with HDFS and MapReduce [Contractor Plagiarism?]

Wednesday, February 13th, 2013

Data deduplication tactics with HDFS and MapReduce

From the post:

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.

Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.

From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.

Covers five (5) tactics:

  1. Using HDFS and MapReduce only
  2. Using HDFS and HBase
  3. Using HDFS, MapReduce and a Storage Controller
  4. Using Streaming, HDFS and MapReduce
  5. Using MapReduce with Blocking techniques

In these times of “Great Sequestration,” how much you are spending on duplicated contractor documentation?

You do get electronic forms of documentation. Yes?

Not that difficult to document prior contractor self-plagiarism. Teasing out what you “mistakenly” paid for it may be harder.

Question: Would you rather find out now and correct or have someone else find out?

PS: For the ambitious in government employment. You might want to consider how discovery of contractor self-plagiarism reflects on your initiative and dedication to “good” government.

…Efficient Approximate Data De-Duplication in Streams [Approximate Merging?]

Tuesday, December 18th, 2012

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams by Suman K. Bera, Sourav Dutta, Ankur Narang, Souvik Bhattacherjee.


Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent.
In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

If you think of more than one representative for a subject as “duplication,” then merging is a special class of “deduplication.”

Deduplication that discards redundant information but that preserves unique additional information and relationships to other subjects.

As you move away from static and towards transient topic maps, representations of subjects in real time data streams, this and similar techniques will become very important.

I first saw this in a tweet from Stefano Bertolo.

PS: A new equivalent term (to me) for deduplication: “intelligent compression.” Pulls about 46K+ “hits” in a popular search engine today. May want to add it to your routine search queries.

Swoosh: a generic approach to entity resolution

Wednesday, August 1st, 2012

Swoosh: a generic approach to entity resolution by Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal.

Do you remember Swoosh?

I saw it today in Five Short Links by Pete Warden.


We consider the Entity Resolution (ER) problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the 4 properties. F-Swoosh in addition assumes knowledge of the “features” ( e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.

It sounds familiar.

Running some bibliographic searches, looks like 100 references since 2011. That’s going to take a while! But it all looks like good stuff.

GNU C++ hash_set vs STL std::set: my notebook

Tuesday, July 10th, 2012

GNU C++ hash_set vs STL std::set: my notebook by Pierre Lindenbaum.

Pierre compares the C++ template set of the C++ Standard Template library to the GNU non-standard hash-based set on a set of random numbers to insert/remove.

The results may surprise you.

Worth investigating if you are removing duplicates post-query.

A Parallel Architecture for In-Line Data De-duplication

Monday, March 19th, 2012

A Parallel Architecture for In-Line Data De-duplication by Seetendra Singh Sengar, Manoj Mishra. (2012 Second International Conference on Advanced Computing & Communication Technologies)


Recently, data de-duplication, the hot emerging technology, has received a broad attention from both academia and industry. Some researches focus on the approach by which more redundant data can be reduced and others investigate how to do data de-duplication at high speed. In this paper, we show the importance of data de-duplication in the current digital world and aim at reducing the time and space requirement for data de-duplication. Then, we present a parallel architecture with one node designated as a server and multiple storage nodes. All the nodes, including the server, can do block level in-line de-duplication in parallel. We have built a prototype of the system and present some performance results. The proposed system uses magnetic disks as a storage technology.

Apologies but all I have at the moment is the abstract.

Duke 0.4

Friday, January 13th, 2012

Duke 0.4

New release of deduplication software written in Java on top of Lucene by Lars Marius Garshol.

From the release notes:

This version of Duke introduces:

  • Added JNDI data source for connecting to databases via JNDI (thanks to FMitzlaff).
  • In-memory data source added (thanks to FMitzlaff).
  • Record linkage mode now more flexible: can implement different strategies for choosing optimal links (with FMitzlaff).
  • Record linkage API refactored slightly to be more flexible (with FMitzlaff).
  • Added utilities for building equivalence classes from Duke output.
  • Made the XML config loader more robust.
  • Added a special cleaner for English person names.
  • Fixed bug in NumericComparator ( issue 66 )
  • Uses own Lucene query parser to avoid issues with search strings.
  • Upgraded to Lucene 3.5.0.
  • Added many more tests.
  • Many small bug fixes to core, NTriples reader, ec.

BTW, the documentation is online only:

Factual Resolve

Friday, October 28th, 2011

Factual Resolve

Factual has a new API – Resolve:

From the post:

The Internet is awash with data. Where ten years ago developers had difficulty finding data to power applications, today’s difficulty lies in making sense of its abundance, identifying signal amidst the noise, and understanding its contextual relevance. To address these problems Factual is today launching Resolve — an entity resolution API that makes partial records complete, matches one entity against another, and assists in de-duping and normalizing datasets.

The idea behind Resolve is very straightforward: you tell us what you know about an entity, and we, in turn, tell you everything we know about it. Because data is so commonly fractured and heterogeneous, we accept fragments of an entity and return the matching entity in its entirety. Resolve allows you to do a number of things that will make your data engineering tasks easier:

  • enrich records by populating missing attributes, including category, lat/long, and address
  • de-dupe your own place database
  • convert multiple daily deal and coupon feeds into a single normalized, georeferenced feed
  • identify entities unequivocally by their attributes

For example: you may be integrating data from an app that provides only the name of a place and an imprecise location. Pass what you know to Factual Resolve via a GET request, with the attributes included as JSON-encoded key/value pairs:

I particularly like the line:

identify entities unequivocally by their attributes

I don’t know about the “unequivocally” part but the rest of it rings true. At least in my experience.

Dedupe, Merge, and Purge: the Art of Normalization

Friday, October 28th, 2011

Dedupe, Merge, and Purge: the Art of Normalization by Tyler Bell and Leo Polovets.

From the description:

Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.

I like the Big Noise line. That may have some traction. Certainly will be the case that when users started having contact with unfiltered big data, they are likely to be as annoyed as they are with web searching.

The answer to their questions is likely “out there” but it lies just beyond their grasp. Failing they won’t blame the Big Noise or their own lack of skill but the inoffensive (and ineffective) tools at hand. Guess it is a good thing search engines are free save for advertising.

PS: The slides have a number of blanks. I have written to the authors, well, to their company, asking for corrected slides to be posted.

Paper: A Study of Practical Deduplication

Thursday, June 9th, 2011

Paper: A Study of Practical Deduplication

From the post:

With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice. That’s the radically simple and powerful notion behind data deduplication. If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you’ll love this one, but it is a powerful feature and one I don’t hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code.

Someone asked the other day about how to make topic maps profitable.

Well, selling a solution to issues like duplication of data would be one of them.

You do know that the kernel of the idea for topic maps arose out of a desire to avoid paying 2X, 3X, 4X, or more for the same documentation on military equipment. Yes? Didn’t fly ultimately because of the markup that contractors get on documentation, which then funds their hiring military retirees. That doesn’t mean the original idea was a bad one.

Now, applying a topic map to military documentation systems and demonstrating the duplication of content, perhaps using one of Lars Marius Garshol’s similarity measures, that sounds like a rocking topic map application. Particularly in budget cutting times.