Data deduplication tactics with HDFS and MapReduce
From the post:
As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Covers five (5) tactics:
- Using HDFS and MapReduce only
- Using HDFS and HBase
- Using HDFS, MapReduce and a Storage Controller
- Using Streaming, HDFS and MapReduce
- Using MapReduce with Blocking techniques
In these times of “Great Sequestration,” how much you are spending on duplicated contractor documentation?
You do get electronic forms of documentation. Yes?
Not that difficult to document prior contractor self-plagiarism. Teasing out what you “mistakenly” paid for it may be harder.
Question: Would you rather find out now and correct or have someone else find out?
PS: For the ambitious in government employment. You might want to consider how discovery of contractor self-plagiarism reflects on your initiative and dedication to “good” government.