Damn Cool Algorithms: Cardinality Estimation by Nick Johnson.
From the post:
Suppose you have a very large dataset – far too large to hold in memory – with duplicate entries. You want to know how many duplicate entries, but your data isn’t sorted, and it’s big enough that sorting and counting is impractical. How do you estimate how many unique entries the dataset contains? It’s easy to see how this could be useful in many applications, such as query planning in a database: the best query plan can depend greatly on not just how many values there are in total, but also on how many unique values there are.
I’d encourage you to give this a bit of thought before reading onwards, because the algorithms we’ll discuss today are quite innovative – and while simple, they’re far from obvious.
They are singing our song!
I found this looking around for newer entries after stumbling on the older one.