A quick summary of the Tokutek repositories at Github and pointers to Google groups for discussion of TokuDB.
Archive for the ‘Tokutek’ Category
Wanted: Evaluators to Try MongoDB with Fractal Tree Indexing by Tim Callaghan.
From the post:
We recently resumed our discussion around bringing Fractal Tree indexes to MongoDB. This effort includes Tokutek’s interview with Jeff Kelly at Strata as well as my two recent tech blogs which describe the compression achieved on a generic MongoDB data set and performance improvements we measured using on our implementation of Sysbench for MongoDB. I have a full line-up of benchmarks and blogs planned for the next few months, as our project continues. Many of these will be deeply technical and written by the Tokutek developers.
We have a group of evaluators running MongoDB with Fractal Tree Indexes, but more feedback is always better. So …
Do you want to participate in the process of bringing high compression and extreme performance gains to MongoDB? We’re looking for MongoDB experts to test our build on your real-world workloads and benchmarks. Evaluator feedback will be used in creating the product road map. Please email me at firstname.lastname@example.org if interested.
You keep reading about the performance numbers on MongoDB.
Aren’t you curious if those numbers are true for your use case?
Here’s your opportunity to find out!
NoSQL is Great, But You Still Need Indexes by Martin Farach-Colton.
From the post:
I’ve said it before, and, as is the nature of these things, I’ll almost certainly say it again: your database performance is only as good as your indexes.
That’s the grand thesis, so what does that mean? In any DB system — SQL, NoSQL, NewSQL, PostSQL, … — data gets ingested and organized. And the system answers queries. The pain point for most users is around the speed to answer queries. And the query speed (both latency and throughput, to be exact) depend on how the data is organized. In short: Good Indexes, Fast Queries; Poor Indexes, Slow Queries.
But building indexes is hard work, or at least it has been for the last several decades, because almost all indexing is done with B-trees. That’s true of commercial databases, of MySQL, and of most NoSQL solutions that do indexing. (The ones that don’t do indexing solve a very different problem and probably shouldn’t be confused with databases.)
It’s not true of TokuDB. We build Fractal Tree Indexes, which are much easier to maintain but can still answer queries quickly. So with TokuDB, it’s Fast Indexes, More Indexes, Fast Queries. TokuDB is usually thought of as a storage engine for MySQL and MariaDB. But it’s really a B-tree substitute, so we’re always on the lookout for systems where we can improving the indexing.
Enter MongoDB. MongoDB is beloved because it makes deployment fast. But when you peel away the layers, you get down to a B-tree, with all the performance headaches and workarounds that they necessitate.
That’s the theory, anyway. So we did some testing. We ripped out the part of MongoDB that takes care of secondary indices and plugged in TokuDB. We’ve posted the blogs before, but here they are again, the greatest hits of TokuDB+MongoDB: we show a 10x insertion performance, a 268x query performance, and a 532x (or 53,200% if you prefer) multikey index insertion performance. We also discussed covered indexes vs. clustered Fractal Tree Indexes.
Did somebody declare February 20th to be performance release day?
Did I miss that memo?
Like every geek, I like faster. But, here’s my question:
Have there been any studies on the impact of faster systems on searching and decision making by users?
My assumption is the faster I get a non-responsive result from a search, the sooner I can improve it.
But that’s an assumption on my part.
Is that really true?
From the post:
In Part 1, we showed performance results of some of the work that’s gone in to TokuDB v6.6. In this post, we’ll take a closer look at how this happened, on the engineering side, and how to think about the performance characteristics in the new version.
It’s easiest to think about our concurrency changes in terms of a Fractal Tree® index that has nodes like a B-tree index, and buffers on each node that batch changes for the subtree rooted at that node. We have materials that describe this available here, but we can proceed just knowing that:
- To inject data into the tree, you need to store a message in a buffer at the root of the tree. These messages are moved down the tree, so you can find messages in all the internal nodes of the tree (the mechanism that moves them is irrelevant for now).
- To read data out of the tree, you need to find a leaf node that contains your key, check the buffers on the path up to the root for messages that affect your query, and apply any such messages to the value in the leaf before using that value to answer your query.
It’s these operations that modify and examine the buffers in the root that were the main reason we used to serialize operations inside a single index.
Just so not everything today is “soft” user stuff.
Interesting avoidance of the root node as an I/O bottleneck.
Sort of thing that gets me to thinking about distributed topic map writing/querying.
Tracking 5.3 Billion Mutations: Using MySQL for Genomic Big Data by Lawrence Schwartz.
From the post:
The Organization: The The Philip Awadalla Laboratory is the Medical and Population Genomics Laboratory at the University of Montreal. Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations. Its research includes work relevant to all types of human diseases: genetic, immunological, infectious, chronic and cancer. Using genomic data from single-nucleotide polymorphisms (SNP), next-generation re-sequencing, and gene expression, along with modern statistical tools, the lab is able to locate genome regions that are associated with disease pathology and virulence as well as study the mechanisms that cause the mutations.
The Challenge: The lab’s genomic research database is following 1400 individuals with 3.7 million shared mutations, which means it is tracking 5.3 billion mutations. Because the representation of genomic sequence is a highly compressible series of letters, the database requires less hardware than a typical one. However, it must be able to store and retrieve data quickly in order to respond to research requests.
Thibault de Malliard, the researcher tasked with managing the lab’s data, adds hundreds of thousands of records every day to the lab’s MySQL database. The database must be able to process the records ASAP so that the researchers can make queries and find information quickly. However, as the database grew to 200 GB, its performance plummeted. de Malliard determined that the database’s MyISAM storage engine was having difficulty keeping up with the fire hose of data, pointing out that a single sequencing batch could take days to run.
Anticipating that the database could grow to 500 GB or even 1 TB within the next year, de Malliard began to search for a storage engine that would maintain performance no matter how large his database got.
Insertion Performance: “For us, TokuDB proved to be over 50x faster to add or update data into big tables,” according to de Malliard. “Adding 1M records took 51 min for MyISAM, but 1 min for TokuDB. So inserting one sequencing batch with 48 samples and 1.5M positions would take 2.5 days for MyISAM but one hour with TokuDB.”
OK, so it’s not “big data.” But it was critical data to the lab.
Maybe instead of “big data” we should be talking about “critical” or even “relevant” data.
Remember the story of the data analyst with “830 million GPS records of 80 million taxi trips” whose analysis confirmed what taxi drivers already knew, they stop driving when it rains. Could have asked a taxi driver or two. Starting Data Analysis with Assumptions
Take a look at TukoDB when you need a “relevant” data solution.
Fractal Tree Indexing Overview by Martin Farach-Colton.
From the post:
We get a lot of questions about how Fractal Tree indexes work. It’s a write-optimized index with fast queries, but which write-optimized indexing structure is it?
Suggestion: Watch the video along with the slides. (Some of the slides are less than intuitive. Trust me on this one.)
Martin Gardner explaining fractals in SciAm it’s not but it will give you a better appreciation for fractal trees.
BTW, did you know B-Trees are forty years old this year?
Best Practices for a Successful TokuDB Evaluation by Gerry Narvaja
Date: December 11th
Time: 2 PM EST / 11 AM PST
From the webpage:
In this webinar we will show step by step how to install, configure, and test TokuDB for a typical performance evaluation. We’ll also be flagging potential pitfalls that can ruin the eval results. It will describe the differences between installing from scratch and replacing an existing MySQL / MariaDB installation. It will also review the most common issues that may arise when running TokuDB binaries.
You have seen the TokuDB numbers on their data.
Now you can see what numbers you can get with your data.
From the post:
The tutorial was organized as follows:
- Module 0: Tutorial overview and introductions. We describe an observed (but not necessary) tradeoff in ingestion, querying, and freshness in traditional database.
- Module 1: I/O model and cache-oblivious analysis.
- Module 2: Write-optimized data structures. We give the optimal trade-off between inserts and point queries. We show how to build data structures that lie on this tradeoff curve.
- Module 2 continued: Write-optimized data structures perform writes much faster than point queries; this asymmetry affects the design of an ACID compliant database.
- Module 3: Case study – TokuFS. How to design and build a write-optimized file systems.
- Module 4: Page-replacement algorithms. We give relevant theorems on the performance of page-replacement strategies such as LRU.
- Module 5: Index design, including covering indexes.
- Module 6: Log-structured merge trees and fractional cascading.
- Module 7: Bloom filters.
These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQL and TokuDB.
The slides are available here.
If you are committed to defending your current implementation choices against all comers, don’t bother with the slides.
If you want a peek at one future path in data structures, get the slides. You won’t be disappointed.
Forbes: “Tokutek Makes Big Data Dance” by Lawrence Schwartz.
From the post:
According to the article, “Fractal Tree indexing is helping organizations analyze big data more efficiently due to its ability to improve database efficiency thanks to faster ‘database insertion speed, quicker input/output performance, operational agility, and data compression.’” As a start-up based on “the first algorithm-based breakthrough in the database world in 40 years,” Toktuetek is following in the footsteps of firms such as Google and RSA, which also relied on novel algortithm advances as core to their technology.
To read the full article, and to see how Tokutek is helping companies tackle big data, see here.
I would ignore Peter Cohan’s mistakes about the nature of credit card processing. You don’t wait for the “ok” on your account balance.
Remember What if all transactions required strict global consistency? by Matthew Aslett of the 451 Group? Eventual consistency works right now.
I would have picked “hot schema” changes as a feature to highlight but that might not play as well with a business audience.
Looking for MongoDB users to test Fractal Tree Indexing by Tim Callaghan.
In my three previous blogs I wrote about our implementation of Fractal Tree Indexes on MongoDB, showing a 10x insertion performance increase, a 268x query performance increase, and a comparison of covered indexes and clustered indexes. The benchmarks show the difference that rich and efficient indexing can make to your MongoDB workload.
It’s one thing for us to benchmark MongoDB + TokuDB and another to measure real world performance. If you are looking for a way to improve the performance or scalability of your MongoDB deployment, we can help and we’d like to hear from you. We have a preview build available for MongoDB v2.2 that you can run with your existing data folder, drop/add Fractal Tree Indexes, and measure the performance differences. Please email me at email@example.com if interested.
Here is your chance to try these speed improvements out on your data!
Tim Callaghan mentions this as coming up but here is the description from the on-demand version of this webinar:
Application performance often depends on how fast a query can respond and query performance almost always depends on good indexing. So one of the quickest and least expensive ways to increase application performance is to optimize the indexes. This talk presents three simple and effective rules on how to construct indexes around queries that result in good performance.
This is a general discussion applicable to all databases using indexes and is not specific to any particular MySQL® storage engine (e.g., InnoDB, TokuDB®, etc.). The rules are explained using a simple model that does NOT rely on understanding B-trees, Fractal Tree® indexing, or any other data structure used to store the data on disk.
Zardosht Kasheff presenting.
From the post:
Last week I wrote about our 10x insertion performance increase with MongoDB. We’ve continued our experimental integration of Fractal Tree® Indexes into MongoDB, adding support for clustered indexes. A clustered index stores all non-index fields as the “value” portion of the index, as opposed to a standard MongoDB index that stores a pointer to the document data. The benefit is that indexed lookups can immediately return any requested values instead of needing to do an additional lookup (and potential disk IOs) for the requested fields.
I’m trying to recover from learning about scalable subgraph matching, Efficient Subgraph Matching on Billion Node Graphs [Parallel Graph Processing], and now the nice folks at Tokutek post a 26,816% query performance increase for MongoDB.
They claim to not be MongoDB experts. I guess that’s right. The increase in performance would have been higher.
Serious question: How long will it take this sort of performance increase to impact the modeling and design of information systems?
And in what way?
With high enough performance, can subject identity be modeled interactively?
It doesn’t get much better or fresher (for non-attendees) than this!
- Dr Jim Webber of Neo Technology starts the day by welcoming everyone to the first of many annual NOSQL eXchanges. View the podcast here…
- Emil Eifrém gives a Keynote talk to the NOSQL eXchange on the past, present and future of NOSQL, and the state of NOSQL today. View the podcast here…
- HANDLING CONFLICTS IN EVENTUALLY CONSISTENT SYSTEMS In this talk, Russell Brown examines how conflicting values are kept to a minimum in Riak and illustrates some techniques for automating semantic reconciliation. There will be practical examples from the Riak Java Client and other places.
- MONGODB + SCALA: CASE CLASSES, DOCUMENTS AND SHARDS FOR A NEW DATA MODEL Brendan McAdams — creator of Casbah, a Scala toolkit for MongoDB — will give a talk on “MongoDB + Scala: Case Classes, Documents and Shards for a New Data Model”
- REAL LIFE CASSANDRA Dave Gardner: In this talk for the NOSQL eXchange, Dave Gardner introduces why you would want to use Cassandra, and focuses on a real-life use case, explaining each Cassandra feature within this context.
- DOCTOR WHO AND NEO4J Ian Robinson: Armed only with a data store packed full of geeky Doctor Who facts, by the end of this session we’ll have you tracking down pieces of memorabilia from a show that, like the graph theory behind Neo4j, is older than Codd’s relational model.
- BUILDING REAL WORLD SOLUTION WITH DOCUMENT STORAGE, SCALA AND LIFT Aleksa Vukotic will look at how his company assessed and adopted CouchDB in order to rapidly and successfully deliver a next generation insurance platform using Scala and Lift.
- ROBERT REES ON POLYGLOT PERSISTENCE Robert Rees: Based on his experiences of mixing CouchDB and Neo4J at Wazoku, an idea management startup, Robert talks about the theory of mixing your stores and the practical experience.
- PARKBENCH DISCUSSION This Park Bench discussion will be chaired by Jim Webber.
- THE FUTURE OF NOSQL AND BIG DATA STORAGE Tom Wilkie: Tom Wilkie takes a whistle-stop tour of developments in NOSQL and Big Data storage, comparing and contrasting new storage engines from Google (LevelDB), RethinkDB, Tokutek and Acunu (Castle).
And yes, I made a separate blog post on Neo4j and Dr. Who. What can I say? I am a fan of both.