Archive for the ‘Deduplication’ Category

Data deduplication tactics with HDFS and MapReduce [Contractor Plagiarism?]

Wednesday, February 13th, 2013

Data deduplication tactics with HDFS and MapReduce

From the post:

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.

Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.

From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.

Covers five (5) tactics:

  1. Using HDFS and MapReduce only
  2. Using HDFS and HBase
  3. Using HDFS, MapReduce and a Storage Controller
  4. Using Streaming, HDFS and MapReduce
  5. Using MapReduce with Blocking techniques

In these times of “Great Sequestration,” how much you are spending on duplicated contractor documentation?

You do get electronic forms of documentation. Yes?

Not that difficult to document prior contractor self-plagiarism. Teasing out what you “mistakenly” paid for it may be harder.

Question: Would you rather find out now and correct or have someone else find out?

PS: For the ambitious in government employment. You might want to consider how discovery of contractor self-plagiarism reflects on your initiative and dedication to “good” government.

…Efficient Approximate Data De-Duplication in Streams [Approximate Merging?]

Tuesday, December 18th, 2012

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams by Suman K. Bera, Sourav Dutta, Ankur Narang, Souvik Bhattacherjee.

Abstract:

Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent.
.
In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

If you think of more than one representative for a subject as “duplication,” then merging is a special class of “deduplication.”

Deduplication that discards redundant information but that preserves unique additional information and relationships to other subjects.

As you move away from static and towards transient topic maps, representations of subjects in real time data streams, this and similar techniques will become very important.

I first saw this in a tweet from Stefano Bertolo.

PS: A new equivalent term (to me) for deduplication: “intelligent compression.” Pulls about 46K+ “hits” in a popular search engine today. May want to add it to your routine search queries.

Swoosh: a generic approach to entity resolution

Wednesday, August 1st, 2012

Swoosh: a generic approach to entity resolution by Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal.

Do you remember Swoosh?

I saw it today in Five Short Links by Pete Warden.

Abstract:

We consider the Entity Resolution (ER) problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the 4 properties. F-Swoosh in addition assumes knowledge of the “features” ( e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.

It sounds familiar.

Running some bibliographic searches, looks like 100 references since 2011. That’s going to take a while! But it all looks like good stuff.

GNU C++ hash_set vs STL std::set: my notebook

Tuesday, July 10th, 2012

GNU C++ hash_set vs STL std::set: my notebook by Pierre Lindenbaum.

Pierre compares the C++ template set of the C++ Standard Template library to the GNU non-standard hash-based set on a set of random numbers to insert/remove.

The results may surprise you.

Worth investigating if you are removing duplicates post-query.

A Parallel Architecture for In-Line Data De-duplication

Monday, March 19th, 2012

A Parallel Architecture for In-Line Data De-duplication by Seetendra Singh Sengar, Manoj Mishra. (2012 Second International Conference on Advanced Computing & Communication Technologies)

Abstract:

Recently, data de-duplication, the hot emerging technology, has received a broad attention from both academia and industry. Some researches focus on the approach by which more redundant data can be reduced and others investigate how to do data de-duplication at high speed. In this paper, we show the importance of data de-duplication in the current digital world and aim at reducing the time and space requirement for data de-duplication. Then, we present a parallel architecture with one node designated as a server and multiple storage nodes. All the nodes, including the server, can do block level in-line de-duplication in parallel. We have built a prototype of the system and present some performance results. The proposed system uses magnetic disks as a storage technology.

Apologies but all I have at the moment is the abstract.

Duke 0.4

Friday, January 13th, 2012

Duke 0.4

New release of deduplication software written in Java on top of Lucene by Lars Marius Garshol.

From the release notes:

This version of Duke introduces:

  • Added JNDI data source for connecting to databases via JNDI (thanks to FMitzlaff).
  • In-memory data source added (thanks to FMitzlaff).
  • Record linkage mode now more flexible: can implement different strategies for choosing optimal links (with FMitzlaff).
  • Record linkage API refactored slightly to be more flexible (with FMitzlaff).
  • Added utilities for building equivalence classes from Duke output.
  • Made the XML config loader more robust.
  • Added a special cleaner for English person names.
  • Fixed bug in NumericComparator ( issue 66 )
  • Uses own Lucene query parser to avoid issues with search strings.
  • Upgraded to Lucene 3.5.0.
  • Added many more tests.
  • Many small bug fixes to core, NTriples reader, ec.

BTW, the documentation is online only: http://code.google.com/p/duke/wiki/GettingStarted.

Factual Resolve

Friday, October 28th, 2011

Factual Resolve

Factual has a new API – Resolve:

From the post:

The Internet is awash with data. Where ten years ago developers had difficulty finding data to power applications, today’s difficulty lies in making sense of its abundance, identifying signal amidst the noise, and understanding its contextual relevance. To address these problems Factual is today launching Resolve — an entity resolution API that makes partial records complete, matches one entity against another, and assists in de-duping and normalizing datasets.

The idea behind Resolve is very straightforward: you tell us what you know about an entity, and we, in turn, tell you everything we know about it. Because data is so commonly fractured and heterogeneous, we accept fragments of an entity and return the matching entity in its entirety. Resolve allows you to do a number of things that will make your data engineering tasks easier:

  • enrich records by populating missing attributes, including category, lat/long, and address
  • de-dupe your own place database
  • convert multiple daily deal and coupon feeds into a single normalized, georeferenced feed
  • identify entities unequivocally by their attributes

For example: you may be integrating data from an app that provides only the name of a place and an imprecise location. Pass what you know to Factual Resolve via a GET request, with the attributes included as JSON-encoded key/value pairs:

I particularly like the line:

identify entities unequivocally by their attributes

I don’t know about the “unequivocally” part but the rest of it rings true. At least in my experience.

Dedupe, Merge, and Purge: the Art of Normalization

Friday, October 28th, 2011

Dedupe, Merge, and Purge: the Art of Normalization by Tyler Bell and Leo Polovets.

From the description:

Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.

I like the Big Noise line. That may have some traction. Certainly will be the case that when users started having contact with unfiltered big data, they are likely to be as annoyed as they are with web searching.

The answer to their questions is likely “out there” but it lies just beyond their grasp. Failing they won’t blame the Big Noise or their own lack of skill but the inoffensive (and ineffective) tools at hand. Guess it is a good thing search engines are free save for advertising.

PS: The slides have a number of blanks. I have written to the authors, well, to their company, asking for corrected slides to be posted.

Paper: A Study of Practical Deduplication

Thursday, June 9th, 2011

Paper: A Study of Practical Deduplication

From the post:

With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice. That’s the radically simple and powerful notion behind data deduplication. If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you’ll love this one, but it is a powerful feature and one I don’t hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code.

Someone asked the other day about how to make topic maps profitable.

Well, selling a solution to issues like duplication of data would be one of them.

You do know that the kernel of the idea for topic maps arose out of a desire to avoid paying 2X, 3X, 4X, or more for the same documentation on military equipment. Yes? Didn’t fly ultimately because of the markup that contractors get on documentation, which then funds their hiring military retirees. That doesn’t mean the original idea was a bad one.

Now, applying a topic map to military documentation systems and demonstrating the duplication of content, perhaps using one of Lars Marius Garshol’s similarity measures, that sounds like a rocking topic map application. Particularly in budget cutting times.