MongoDB + Fractal Tree Indexes = High Compression by Tim Callaghan.
You may have heard that MapR Technologies broke the MinuteSort Record by sorting 15 billion 100-btye records in 60 seconds. Used 2,103 virtual instances in the Google Compute Engine and each instance had four virtual cores and one virtual disk, totaling 8,412 virtual cores and 2,103 virtual disks. Google Compute Engine, MapR Break MinuteSort Record.
So, the next time you have 8,412 virtual cores and 2,103 virtual disks, you know what is possible, š
But if you have less firepower than that, you will need to be clever:
One doesn’t have to look far to see that there is strong interest in MongoDB compression. MongoDB has an open ticket from 2009 titled āOption to Store Data Compressedā with Fix Version/s planned but not scheduled. The ticket has a lot of comments, mostly from MongoDB users explaining their use-cases for the feature. For example, Khalid SalomĆ£o notes that “Compression would be very good to reduce storage cost and improve IO performance” and Andy notes that “SSD is getting more and more common for servers. They are very fast. The problems are high costs and low capacity.” There are many more in the ticket.
In prior blogs we’ve written about significant performance advantages when using Fractal Tree Indexes with MongoDB. Compression has always been a key feature of Fractal Tree Indexes. We currently support the LZMA, quicklz, and zlib compression algorithms, and our architecture allows us to easily add more. Our large block size creates another advantage as these algorithms tend to compress large blocks better than small ones.
Given the interest in compression for MongoDB and our capabilities to address this functionality, we decided to do a benchmark to measure the compression achieved by MongoDB + Fractal Tree Indexes using each available compression type. The benchmark loads 51 million documents into a collection and measures the size of all files in the file system (–dbpath).
More benchmarks to follow and you should remember that all benchmarks are just that, benchmarks.
Benchmarks do not represent experience with your data, under your operating load and network conditions, etc.
Investigate software based on the first, purchase software based on the second.