Mneme: Scalable Duplicate Filtering Service
From the post:
Detecting and dealing with duplicates is a common problem: sometimes we want to avoid performing an operation based on this knowledge, and at other times, like in a case of a database, we want may want to only permit an operation based on a hit in the filter (ex: skip disk access on a cache miss). How do we build a system to solve the problem? The solution will depend on the amount of data, frequency of access, maintenance overhead, language, and so on. There are many ways to solve this puzzle.
In fact, that is the problem – they are too many ways. Having reimplemented at least half a dozen solutions in various languages and with various characteristics at PostRank, we arrived at the following requirements: we want a system that is able to scale to hundreds of millions of keys, we want it to be as space efficient as possible, have minimal maintenance, provide low latency access, and impose no language barriers. The tradeoff: we will accept a certain (customizable) degree of error, and we will not persist the keys forever.
Mneme: Duplicate filter & detection
Mneme is an HTTP web-service for recording and identifying previously seen records – aka, duplicate detection. To achieve the above requirements, it is implemented via a collection of bloomfilters. Each bloomfilter is responsible for efficiently storing the set membership information about a particular key for a defined period of time. Need to filter your keys for the trailing 24 hours? Mneme can create and automatically rotate 24 hourly filters on your behalf – no maintenance required.
Interesting in several respects:
- Duplicate detection
- Duplicate detection for a defined period of time
- Duplicate detection for a defined period of time with “customizable” degree of error
Would depend on your topic map project requirements. Assuming absolute truth forever and ever isn’t one of them, detecting duplicate subject representatives for some time period at a specified error rate may be the concepts you are looking for.
Enables a discussion of how much certainly (error rate) for how long (time period) for detection of duplicates (subject representatives) on what basis? All of those are going to impact project complexity and duration.
Interesting as well as a solution that for some duplicate detection requirements will work quite well.