From the RapLeaf blog:
We’re really excited to announce the open-source debut of a cool piece of Rapleaf’s internal infrastructure, a distributed database project we call Hank.
Our use case is very particular: we have tons of data that needs to get processed, producing a lot of data points for individual people, which then need to be made randomly accessible so they can be served through our API. You can think of it as the “process and publish” pattern.
For the processing component, Hadoop and Cascading were an obvious choice. However, making our results randomly accessible for the API was more challenging. We couldn’t find an existing solution that was fast, scalable, and perhaps most importantly, wouldn’t degrade performance during updates. Our API needs to have lightning-fast responses so that our customers can use it in realtime to personalize their users’ experiences, and it’s just not acceptable for us to have periods where reads contend with writes while we’re updating.
- Random reads need to be fast – reliably on the order of a few milliseconds.
- Datastores need to scale to terabytes, with keys and values on the order of kilobytes.
- We need to be able to push out hundreds of millions of updates a day, but they don’t have to happen in realtime. Most will come from our Hadoop cluster.
- Read performance should not suffer while updates are in progress.
- During the update process, it doesn’t matter if there is more than one version of our datastores available. Our application is tolerant of this inconsistency.
- We have no need for random writes.
If you read updates as merges then the relevance of this posting to topic maps becomes a bit clearer.
Not all topic map systems will have the same requirements and non-requirements.
(This resource pointed out to me by Jack Park.)