Open Source Release: postgresql-hll
From the post:
We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation.
What is it and what can I do with it?
The extension introduces a new data type, hll, which represents a probabilistic distinct value counter that is a hybrid between a HyperLogLog data structure (for large cardinalities) and a simple set (for small cardinalities). These structures support the basic HLL methods: insert, union, and cardinality, and we’ve also provided aggregate and debugging functions that make using and understanding these things a breeze. We’ve also included a way to do schema versioning of the binary representations of hlls, which should allow a clear path to upgrading the algorithm, as new engineering insights come up.
A quick overview of what’s included in the release:
- C-based extension that provides the hll data structure and algorithms
- Austin Appleby’s MurmurHash3 implementation and SQL-land wrappers for integer numerics, bytes, and text
- Full storage specification in STORAGE.markdown
- Full function reference in REFERENCE.markdown
- .spec file for rpmbuild
- Full test suite
A quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like HyperLogLog and we came to the conclusion that it wouldn’t be very user-friendly to force the user to figure out how to get a good hash function into SQL-land. Sure, there are plenty of cryptographic hash functions available, but those are (computationally) overkill for what is needed. We did the research and found MurmurHash3 to be an excellent non-cryptographic hash function in both theory and practice. We’ve been using it in production for a while now with excellent results. As mentioned in the README, it’s of crucial importance to reliably hash the inputs to hlls.
Would you say topic maps aggregate data?
I thought so.
How would you adapt HLL to synonymous identifications?
I ask because of this line in the post:
Essentially, we’re doing interactive set-intersection of operands with millions or billions of entries in milliseconds. This is intersection computed using the inclusion-exclusion principle as applied to hlls:
Performing “…interactive set-intersection of operands with millions or billions of entries in milliseconds…” sounds like an attractive feature for topic map software.
Yes?