Archive for the ‘MySQL’ Category

Deep Feature Synthesis:… [Replacing Human Intuition?, Calling Bull Shit]

Friday, October 30th, 2015

Deep Feature Synthesis: Towards Automating Data Science Endeavors by James Max Kanter and Kalyan Veeramachaneni.


In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. Second, we implement a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach. We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor’s score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score.

The most common phrase I saw in headlines about this paper included some variation on: MIT algorithm replaces human intuition or words to that effect. For example, MIT developing a system that replaces human intuition for big data analysis siliconAngle, An Algorithm May Be Better Than Humans at Breaking Down Big Data Newsweek, Is an MIT algorithm better than human intuition? Christian Science Monitor, and A new AI algorithm can outperform human intuition The World Weekly, just to name a few.

Being the generous sort of reviewer that I am, ;-), I am going to assume that the reporters who wrote about the imperiled status of human intuition either didn’t read the article or were working from a poorly written press release.

The error is not buried in a deeply mathematical or computational part of the paper.

Take a look at the second, fourth and seventh paragraphs of the introduction to see if you can spot the error:

To begin with, we observed that many data science problems, such as the ones released by KAGGLE, and competitions at conferences (KDD cup, IJCAI, ECML) have a few common properties. First, the data is structured and relational, usually presented as a set of tables with relational links. Second, the data captures some aspect of human interactions with a complex system. Third, the presented problem attempts to predict some aspect of human behavior, decisions, or activities (e.g., to predict whether a customer will buy again after a sale [IJCAI], whether a project will get funded by donors [KDD Cup 2014], or even where a taxi rider will choose to go [ECML]). [Second paragraph of introduction]

Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition. While recent developments in deep learning and automated processing of images, text, and signals have enabled significant automation in feature engineering for those data types, feature engineering for relational and human behavioral data remains iterative, human-intuition driven, and challenging, and hence, time consuming. At the same time, because the efficacy of a machine learning algorithm relies heavily on the input features [1], any replacement for a human must be able to engineer them acceptably well. [Fourth paragraph of introduction]

With these components in place, we present the Data Science Machine — an automated system for generating predictive models from raw data. It starts with a relational database and automatically generates features to be used for predictive modeling. Most parameters of the system are optimized automatically, in pursuit of good general purpose performance. [Seventh paragraph of introduction]

Have you spotted the problem yet?

In the first paragraph the authors say:

First, the data is structured and relational, usually presented as a set of tables with relational links.

In the fourth paragraph the authors say:

Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition.

In the seventh paragraph the authors say:

…an automated system for generating predictive models from raw data. It starts with a relational database and automatically generates features…

That is the first time I have ever heard relational database tables and links called raw data.

Human intuition was baked into the data by the construction of the relational tables and links between them, before the Data Science Machine was ever given the data.

The Data Science Machine is wholly and solely dependent upon the human intuition already baked into the relational database data to work at all.

The researchers say as much in the seventh paragraph, unless you think data spontaneously organizes itself into relational tables. Spontaneous relational tables?

If you doubt that human intuition (decision making) is involved in the creation of relational tables, take a quick look at: A Quick-Start Tutorial on Relational Database Design.

This isn’t to take anything away from Kanter and Veeramachaneni. Their Data Science Machine builds upon human intuition captured in relational databases. That is no mean feat. Human intuition should be captured and used to augment machine learning whenever possible.

That isn’t the same as “replacing” human intuition.

PS: Please forward to any news outlet/reporter who has been repeating false information about “deep feature synthesis.”

I first saw this in a tweet by Kirk Borne.

Amazon Aurora – New Cost-Effective MySQL-Compatible Database Engine for Amazon RDS

Friday, November 14th, 2014

Amazon Aurora – New Cost-Effective MySQL-Compatible Database Engine for Amazon RDS by Jeff Barr.

From the post:

We launched the Amazon Relational Database Service (RDS) service way back in 2009 to help you to set up, operate, and scale a MySQL database in the cloud. Since that time, we have added a multitude of options to RDS including extensive console support, three additional database engines (Oracle, SQL Server, and PostgreSQL), high availability (multiple Availability Zones) and dozens of other features.

We have come a long way in five years, but there’s always room to do better! The database engines that I listed above were designed to function in a constrained and somewhat simplistic hardware environment — a constrained network, a handful of processors, a spinning disk or two, and limited opportunities for parallel processing or a large number of concurrent I/O operations.

The RDS team decided to take a fresh look at the problem and to create a relational database designed for the cloud. Starting from a freshly scrubbed white board, they set as their goal a material improvement in the price-performance ratio and the overall scalability and reliability of existing open source and commercial database engines. They quickly realized that they had a unique opportunity to create an efficient, integrated design that encompassed the storage, network, compute, system software, and database software, purpose-built to handle demanding database workloads. This new design gave them the ability to take advantage of modern, commodity hardware and to eliminate bottlenecks caused by I/O waits and by lock contention between database processes. It turned out that they were able to increase availability while also driving far more throughput than before.

In preview now but you can sign up at the end of Jeff’s post.

Don’t become confused between Apache Aurora (“a service scheduler that runs on top of Mesos”) and Amazon Aurora, the MySQL compatible database from Amazon. (I guess all the good names have been taken for years.)

What am I missing?

Oh, following announcement of open source from Microsoft, Intel, Mapillary (to name the ones I noticed this week), I can’t find any reference to the source code for Amazon Aurora.

Do you think Amazon Aurora is closed source? One of those hiding places for government surveillance/malware? Hopefully not.

Perhaps Jeff just forgot to mention the GitHub respository with the Amazon Aurora source code.

It’s Friday (my location) so let’s see what develops by next Monday, 17 November 2014. If there is no announcement that Amazon Aurora is open source, …, well, at least everyone can factor that into their database choices.

PS: Open source does not mean bug or malware free. Open source means that you have a sporting chance at finding (and correcting) bugs and malware. Non-open source software may have bugs and malware which you will experience but not be able to discover/fix/correct.

multiMiR R package and database:…

Sunday, August 10th, 2014

The multiMiR R package and database: integration of microRNA–target interactions along with their disease and drug associations by Yuanbin Ru, et al. ( Nucl. Acids Res. (2014) doi: 10.1093/nar/gku631)


microRNAs (miRNAs) regulate expression by promoting degradation or repressing translation of target transcripts. miRNA target sites have been catalogued in databases based on experimental validation and computational prediction using various algorithms. Several online resources provide collections of multiple databases but need to be imported into other software, such as R, for processing, tabulation, graphing and computation. Currently available miRNA target site packages in R are limited in the number of databases, types of databases and flexibility. We present multiMiR, a new miRNA–target interaction R package and database, which includes several novel features not available in existing R packages: (i) compilation of nearly 50 million records in human and mouse from 14 different databases, more than any other collection; (ii) expansion of databases to those based on disease annotation and drug microRNAresponse, in addition to many experimental and computational databases; and (iii) user-defined cutoffs for predicted binding strength to provide the most confident selection. Case studies are reported on various biomedical applications including mouse models of alcohol consumption, studies of chronic obstructive pulmonary disease in human subjects, and human cell line models of bladder cancer metastasis. We also demonstrate how multiMiR was used to generate testable hypotheses that were pursued experimentally.

Amazing what you can do with R and a MySQL database!

The authors briefly describe their “cleaning” process for the consolidation of these databases on page 2 but then note on page 4:

For many of the databases, the links are available. However, in Supplementary Table S2 we have listed the databases where links may be broken due to outdated identifiers in those databases. We also listed the databases that do not have the option to search by miR NA-gene pairs.

Perhaps due to editing standards (available for free lance work) I have allergy to terms like “many,” especially when it is possible to enumerate the “many.”

In this particular case, you have to download and consult Supplementary Table S2, which reads:


The explanation for this table reads:

For each database, the columns indicate whether external links are available to include as part of multiMiR, whether those databases use identifiers that are updated and whether the links are based on miRNA-gene pairs. For those database that do not have updated identifiers, some links may be broken. For the other databases, where you can only search by miRNA or gene but not pairs, the links are provided by gene, except for ElMMo which is by miRNA because of its database structure.

Counting I see ten (10) databases with a blank under “Undated Identifiers” or Search by miRNA-gene,” or both.

I guess ten (10) out of fourteen (14) qualifies as “many,” but saying seventy-one percent (71%) of the databases in this study lack either “Updated Identifiers,” “Search by miRNA-gene,” or both, would have been more informative.

Potential records with these issues? EIMMo, version 4 has human (50M) and mouse (15M), MicroCosm / miRBase human (879054), and miRanda (assuming human, Good mirSVR score, Conserved miRNA), 1097069. For the rest you can consult Supplemental Table 1, which lists URLs for the databases and dates of access, but where multiple human options are available, not which one(s) were selected.

The number of records for each database that may have these problems also merits mention in the description of the data.

I can’t comment on the usefulness of this R package for exploring the data but the condition of the data it explores needs more prominent mention.

How to find bugs in MySQL

Sunday, April 20th, 2014

How to find bugs in MySQL by Roel Van de Paar.

From the post:

Finding bugs in MySQL is not only fun, it’s also something I have been doing the last four years of my life.

Whether you want to become the next Shane Bester (who is generally considered the most skilled MySQL bug hunter worldwide), or just want to prove you can outsmart some of the world’s best programmers, finding bugs in MySQL is a skill not reserved anymore to top QA engineers armed with a loads of scripts, expensive flash storage and top-range server hardware. Off course, for professionals that’s still the way to go, but now anyone with an average laptop and a standard HDD can have a lot of fun trying to find that elusive crash…

If you follow this post carefully, you may well be able to find a nice crashing bug (or two) running RQG (an excellent database QA tool). Linux would be the preferred testing OS, but if you are using Windows as your main OS, I would recommend getting Virtual Box and running a Linux guest in a suitably sized (i.e. large) VM. In terms of the acronym “RQG”, this stands for “Random Query Generator,” also named “randgen.”

If you’re not just after finding any bug out there (“bug hunting”), you can tune the RQG grammars (files that define what sort of SQL RQG executes) to more or less match your “issue area.” For example, if you are always running into a situation where the server crashes on a DELETE query (as seen at the end of the mysqld error log for example), you would want an SQL grammar that definitely has a variety of DELETE queries in it. These queries should be closely matched with the actual crashing query – crashes usually happen due to exactly the same, or similar statements with the same clauses, conditions etc.

Just in case you feel a bit old for an Easter egg hunt today, consider going on a MySQL bug hunt.

Curious, do you know of RQG-like suites for noSQL databases?

PS: RQG Documentation (github)


Thursday, March 27th, 2014


From the webpage:

What is WebScaleSQL?

WebScaleSQL is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale, and seek greater performance from a database technology tailored for their needs.

Our goal in launching WebScaleSQL is to enable the scale-oriented members of the MySQL community to work more closely together in order to prioritize the aspects that are most important to us. We aim to create a more integrated system of knowledge-sharing to help companies leverage the great features already found in MySQL 5.6, while building and adding more features that are specific to deployments in large scale environments. In the last few months, engineers from all four companies have contributed code and provided feedback to each other to develop a new, more unified, and more collaborative branch of MySQL.

But as effective as this collaboration has been so far, we know we’re not the only ones who are trying to solve these particular challenges. So we will keep WebScaleSQL open as we go, to encourage others who have the scale and resources to customize MySQL to join in our efforts. And of course we will welcome input from anyone who wants to contribute, regardless of what they are currently working on.

Who is behind WebScaleSQL?

WebScaleSQL currently includes contributions from MySQL engineering teams at Facebook, Google, LinkedIn, and Twitter. Together, we are working to share a common base of code changes to the upstream MySQL branch that we can all use and that will be made available via open source. This collaboration will expand on existing work by the MySQL community, and we will continue to track the upstream branch that is the latest, production-ready release (currently MySQL 5.6).

Correct me if I’m wrong but don’t teams from Facebook, Google, LinkedIn and Twitter know a graph when they see one? 😉

Even people who recognize graphs may need an SQL solution every now and again. Besides, solutions should not drive IT policy.

Requirements and meeting those requirements should drive IT policy. You are less likely to own very popular, expensive and ineffectual solutions when requirements rule. (Even iterative requirements in the agile approach are requirements.)

A reminder that MySQL/WebScaleSQL compiles from source with:

A working ANSI C++ compiler. GCC 4.2.1 or later, Sun Studio 10 or later, Visual Studio 2008 or later, and many current vendor-supplied compilers are known to work. (INSTALL-SOURCE)

Which makes it a target, sorry, subject for analysis of any vulnerabilities with joern.

I first saw this in a post by Derrick Harris, Facebook — with help from Google, LinkedIn, Twitter — releases MySQL built to scale.

Data Preparation for Machine Learning using MySQL

Thursday, October 31st, 2013

Data Preparation for Machine Learning using MySQL

From the post:

Most Machine Learning algorithms require data to be into a single text file in tabular format, with each row representing a full instance of the input dataset and each column one of its features. For example, imagine data in normal form separated in a table for users, another for movies, and another for ratings. You can get it in machine-learning-ready format in this way (i.e., joining by userid and movieid and removing ids and names):

Just in case you aren’t up to the Stinger level of SQL but still need to prepare data for machine learning.

Excellent tutorial on using MySQL for machine learning data preparation.

Database and Query Analysis Tools for MySQL:…

Thursday, October 24th, 2013

Database and Query Analysis Tools for MySQL: Exploiting Hypertree and Hypergraph Decompositions by Selvameenal Chokkalingam.


A database is an organized collection of data. Database systems are widely used and have a broad range of applications. It is thus essential to find efficient database query evaluation techniques. In the recent years, new theories and algorithms for database query optimization have been developed that exploit advanced graph theoretic concepts. In particular, the graph theoretic concepts of hypergraphs, hypergraph decompositions, and hypertree decompositions have played an important role in the recent research.

This thesis studies algorithms that employ hypergraph decompositions in order to detect the cyclic or acyclic degree of database schema, and describes implementations of those algorithms. The main contribution of this thesis is a collection of software tools for MySQL that exploit hypergraph properties associated with database schema and query structures.

If you remember hypergraphs from database theory this may be the refresher for you.

I stumbled across it earlier today while running down references on hypergraphs.

Benchmarking Graph Databases

Wednesday, September 25th, 2013

Benchmarking Graph Databases by Alekh Jindal.

Speaking of data skepticism.

From the post:

Graph data management has recently received a lot of attention, particularly with the explosion of social media and other complex, inter-dependent datasets. As a result, a number of graph data management systems have been proposed. But this brings us to the question: What happens to the good old relational database systems (RDBMSs) in the context of graph data management?

The article names some of the usual graph database suspects.

But for its comparison, it selects only one (Neo4j) and compares it against three relational databases, MySQL, Vertica and VoltDB.

What’s missing? How about expanding to include GraphLab (GraphLab – Next Generation [Johnny Come Lately VCs]) and Giraph (Scaling Apache Giraph to a trillion edges) or some of the other heavy hitters (insert your favorite) in the graph world?

Nothing against Neo4j. It is making rapid progress on a query language and isn’t hard to learn. But it lacks the raw processing power of an application like Apache Giraph. Giraph, after all, is used to process the entire Facebook data set, not a “4k nodes and 88k edges” Facebook sample as in this comparison.

Not to mention that only two algorithms were used in this comparison: PageRank and Shortest Paths.

Personally I can imagine users being interested in running more than two algorithms. But that’s just me.

Every benchmarking project has to start somewhere but this sort of comparison doesn’t really advance the discussion of competing technologies.

Not that any comparison would be complete without a discussion of typical uses cases and user observations on how each candidate did or did not meet their expectations.

Comparing MongoDB, MySQL, and TokuMX Data Layout

Tuesday, July 30th, 2013

Comparing MongoDB, MySQL, and TokuMX Data Layout by Zardosht Kasheff.

From the post:

A lot is said about the differences in the data between MySQL and MongoDB. Things such as “MongoDB is document based”, “MySQL is relational”, “InnoDB has a clustering key”, etc.. Some may wonder how TokuDB, our MySQL storage engine, and TokuMX, our MongoDB product, fit in with these data layouts. I could not find anything describing the differences with a simple google search, so I figured I’d write a post explaining how things compare.

So who are the players here? With MySQL, users are likely familiar with two storage engines: MyISAM, the original default up until MySQL 5.5, and InnoDB, the current default since MySQL 5.5. MongoDB has only one storage engine, and we’ll refer to it as “vanilla Mongo storage”. And of course, there is TokuDB for MySQL, and TokuMX.

First, let’s get some quick terminology out of the way. Documents and collections in MongoDB can be thought of as rows and tables in MySQL, respectively. And while not identical, fields in MongoDB are similar to columns in MySQL. A full SQL to MongoDB mapping can be found here. When I refer to MySQL, what I say applies to TokuDB, InnoDB, and MyISAM. When I say MongoDB, what I say applies to TokuMX and vanilla Mongo storage.

Great contrast of MongoDB and MySQL data formats.

Data formats are essential to understanding the capabilities and limitations of any software package.

Mapping Wikipedia – Update

Thursday, July 18th, 2013

I have spent a good portion of today trying to create an image of the MediaWiki table structure.

While I think the SQLFairy (aka SQL Translator) is going to work quite well, it has rather cryptic error messages.

For instance, if the SQL syntax isn’t supported by its internal parser, the error message references the start of the table.

Which means, of course, that you have to compare statements in the table to the subset of SQL that is supported.

I am rapidly losing my SQL parsing skills as the night wears on so I am stopping with a little over 50% of the MediaWiki schema parsing.

Hopefully will finish correcting the SQL file tomorrow and will post the image of the MediaWiki schema.

Plus notes on what I found to not be recognized in SQLFairy to ease your use of it on other SQL schemas.

Announcing TokuDB v7: Open Source and More

Tuesday, April 23rd, 2013

Announcing TokuDB v7: Open Source and More by Martin Farach-Colton.

From the post:

The free Community Edition is fully functional and fully performant. It has all the compression you’ve come to expect from TokuDB. It has hot schema changes: no-down-time column insertion, deletion, renaming, etc., as well as index creation. It has clustering secondary keys. We are also announcing an Enterprise Edition (coming soon) with additional benefits, such as a support package and advanced backup and recovery tools.

You may have noticed those screaming performance numbers I have cited from TokuDB posts?

Now the origin of those numbers is open source.

Curious, what questions are you going to ask differently or what different questions will you ask as processing power increases?

Or to ask it the other way, what questions have you not asked because of a lack of processing power?

Really Large Queries: Advanced Optimization Techniques, Feb. 27

Tuesday, February 19th, 2013

Percona MySQL Webinar: Really Large Queries: Advanced Optimization Techniques, Feb. 27 by Peter Boros.

From the post:

Do you have a query you never dared to touch?
Do you know it’s bad, but it’s needed?
Does it fit your screen?
Does it really have to be that expensive?
Do you want to do something about it?

During the next Percona webinar on February 27, I will present some techniques that can be useful when troubleshooting such queries. We will go through case studies (each case study is made from multiple real-world cases). In these cases we were often able to reduce query execution time from 10s of seconds to a fraction of a second.

If you have SQL queries in your work flow, this will definitely be of interest.


Friday, February 8th, 2013

Shard-Query: Open Source MPP database engine

From the webpage:

What is Shard-Query

Shard-Query is a high performance MySQL query engine which offers increased parallelism compared to stand-alone MySQL. This increased parallelism is achieved by taking advantage of MySQL partitioning, sharding, common query features, or some combination thereof (see more below).

The primary goal of Shard-Query is to enable low-latency query access to extremely large volumes of data utilizing commodity hardware and open source database software. Shard-Query is a federated query engine which is designed to perform as much work in parallel as possible.

What kind of queries are supported?

  • You can run just about all SQL queries over your dataset:
  • For SELECT queries:
    • All aggregate functions are supported.
      • SUM,COUNT,MIN,MAX and AVG are the fastest aggregate operations
      • SUM/COUNT(DISTINCT ..) are supported, but are slower
      • STD/VAR/etc are supported but aggregation is not pushed down at all (slowest)
      • Custom aggregate functions are now also supported.
        • PERCENTILE(expr, N) – take a percentile, for example percentile(score,90)
  • JOINs are supported (no self joins, or joins of tables sharded by different keys)
  • Also upports INSERT, UPDATE, DELETE
  • Also supports DDL such as CREATE TABLE, ALTER TABLE and DROP TABLE

The numbers on a 24 core server are impressive. Worth a closer look.

I first saw this at Justin Swanhart’s webinar announcement: Building a highly scaleable distributed… [Webinar, MySQL/Shard-Query]

Building a highly scaleable distributed… [Webinar, MySQL/Shard-Query]

Friday, February 8th, 2013

Webinar: Building a highly scaleable distributed row, document or column store with MySQL and Shard-Query by Justin Swanhart.

From the post:

On Friday, February 15, 2013 10:00am Pacific Standard Time, I will be delivering a webinar entitled “Building a highly scaleable distributed row, document or column store with MySQL and Shard-Query”

The first part of this webinar will focus on why distributed databases are needed, and on the techniques employed by Shard-Query to implement a distributed MySQL database. The focus will then proceed to the types of distributed (massively parallel processing) database applications which can be deployed with Shard-Query and the performance aspects of each.

The following types of implementations will be described:

  • Distributed row store using XtraDB cluster
  • Distributed append-only column store using Infobright Community Edition
  • Distributed “document store” using XtraDB cluster and Flexviews

If you are using (or planning on using) MySQL as a topic map backend, this could be the webinar for you!

Oracle’s MySQL 5.6 released

Wednesday, February 6th, 2013

Oracle’s MySQL 5.6 released

From the post:

Just over two years after the release of MySQL 5.5, the developers at Oracle have released a GA (General Availability) version of Oracle MySQL 5.6, labelled MySQL 5.6.10. In MySQL 5.5, the developers replaced the old MyISAM backend and used the transactional InnoDB as the default for database tables. With 5.6, the retrofitting of full-text search capabilities has enabled InnoDB to now take on the position of default storage engine for all purposes.

Accelerating the performance of sub-queries was also a focus of development; they are now run using a process of semi-joins and materialise much faster; this means it should not be necessary to replace subqueries with joins. Many operations that change the data structures, such as ALTER TABLE, are now performed online, which avoids long downtimes. EXPLAIN also gives information about the execution plans of UPDATE, DELETE and INSERT commands. Other optimisations of queries include changes which can eliminate table scans where the query has a small LIMIT value.

MySQL’s row-oriented replication now supports “row image control” which only logs the columns needed to identify and make changes on each row rather than all the columns in the changing row. This could be particularly expensive if the row contained BLOBs, so this change not only saves disk space and other resources but it can also increase performance. “Index Condition Pushdown” is a new optimisation which, when resolving a query, attempts to use indexed fields in the query first, before applying the rest of the WHERE condition.

MySQL 5.6 also introduces a “NoSQL interface” which uses the memcached API to offer applications direct access to the InnoDB storage engine while maintaining compatibility with the relational database engine. That underlying InnoDB engine has also been enhanced with persistent optimisation statistics, multithreaded purging and more system tables and monitoring data available.

Download MySQL 5.6.

I mentioned Oracle earlier today (When Oracle bought MySQL [Humor]) so it’s only fair that I point out their most recent release of MySQL.

Tracking 5.3 Billion Mutations: Using MySQL for Genomic Big Data

Friday, February 1st, 2013

Tracking 5.3 Billion Mutations: Using MySQL for Genomic Big Data by Lawrence Schwartz.

From the post:

The Organization: The The Philip Awadalla Laboratory is the Medical and Population Genomics Laboratory at the University of Montreal. Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations. Its research includes work relevant to all types of human diseases: genetic, immunological, infectious, chronic and cancer. Using genomic data from single-nucleotide polymorphisms (SNP), next-generation re-sequencing, and gene expression, along with modern statistical tools, the lab is able to locate genome regions that are associated with disease pathology and virulence as well as study the mechanisms that cause the mutations.

The Challenge: The lab’s genomic research database is following 1400 individuals with 3.7 million shared mutations, which means it is tracking 5.3 billion mutations. Because the representation of genomic sequence is a highly compressible series of letters, the database requires less hardware than a typical one. However, it must be able to store and retrieve data quickly in order to respond to research requests.

Thibault de Malliard, the researcher tasked with managing the lab’s data, adds hundreds of thousands of records every day to the lab’s MySQL database. The database must be able to process the records ASAP so that the researchers can make queries and find information quickly. However, as the database grew to 200 GB, its performance plummeted. de Malliard determined that the database’s MyISAM storage engine was having difficulty keeping up with the fire hose of data, pointing out that a single sequencing batch could take days to run.

Anticipating that the database could grow to 500 GB or even 1 TB within the next year, de Malliard began to search for a storage engine that would maintain performance no matter how large his database got.

Insertion Performance: “For us, TokuDB proved to be over 50x faster to add or update data into big tables,” according to de Malliard. “Adding 1M records took 51 min for MyISAM, but 1 min for TokuDB. So inserting one sequencing batch with 48 samples and 1.5M positions would take 2.5 days for MyISAM but one hour with TokuDB.”

OK, so it’s not “big data.” But it was critical data to the lab.

Maybe instead of “big data” we should be talking about “critical” or even “relevant” data.

Remember the story of the data analyst with “830 million GPS records of 80 million taxi trips” whose analysis confirmed what taxi drivers already knew, they stop driving when it rains. Could have asked a taxi driver or two. Starting Data Analysis with Assumptions

Take a look at TukoDB when you need a “relevant” data solution.

Announcing TokuDB v6.6: Performance Improvements

Wednesday, January 9th, 2013

Announcing TokuDB v6.6: Performance Improvements

From the post:

We are excited to announce TokuDB® v6.6, the latest version of Tokutek’s flagship storage engine for MySQL and MariaDB.

This version offers three types of performance improvements: in-memory, multi-client and fast updates.

Although TokuDB is optimized for large tables, which are larger than memory, many workloads consist of a mix of large and small tables. TokuDB v6.6 offers improvements on in-memory performance, with a more than 100% improvement on Sysbench at many concurrency levels and more than 200% improvement on TPC-C at many concurrency levels. Details to follow.

We have also made improvements in multi-threaded performance. For example, single threaded trickle loads have always been fast in TokuDB. But now multi-threaded trickle loads are even faster. An iibench run with four writers shows an increase from ~18K insertions/sec to ~28K insertions/sec. With a writer and reader running concurrently, we achieve ~13K insertions/sec.

Leif Walsh, one of our engineers, will be posting some details of how this particular improvement was achieved. So stay tuned for this and posts comparing our concurrent iibench performance with InnoDB’s.

A bit late for Christmas but performance improvements on top of already impressive performance are always welcome!

Looking forward to hearing more of the details!

Up to Date on Open Source Analytics

Tuesday, October 23rd, 2012

Up to Date on Open Source Analytics by Steve Miller.

Steve updates his Wintel laptop with the latest releases of open source analytics tools.

Steve’s list:

What’s on your list?

I first saw this mentioned at KDNuggets.

Forbes: “Tokutek Makes Big Data Dance”

Saturday, October 6th, 2012

Forbes: “Tokutek Makes Big Data Dance” by Lawrence Schwartz.

From the post:

Recently, our CEO, John Partridge had a chance to talk about novel database technologies for “Big Data” with Peter Cohan of Forbes.

According to the article, “Fractal Tree indexing is helping organizations analyze big data more efficiently due to its ability to improve database efficiency thanks to faster ‘database insertion speed, quicker input/output performance, operational agility, and data compression.’” As a start-up based on “the first algorithm-based breakthrough in the database world in 40 years,” Toktuetek is following in the footsteps of firms such as Google and RSA, which also relied on novel algortithm advances as core to their technology.

To read the full article, and to see how Tokutek is helping companies tackle big data, see here.

I would ignore Peter Cohan’s mistakes about the nature of credit card processing. You don’t wait for the “ok” on your account balance.

Remember What if all transactions required strict global consistency? by Matthew Aslett of the 451 Group? Eventual consistency works right now.

I would have picked “hot schema” changes as a feature to highlight but that might not play as well with a business audience.

Webinar: Introduction to TokuDB v6.5 (Oct. 10, 2012)

Saturday, October 6th, 2012

Webinar: Introduction to TokuDB v6.5

From the post:

TokuDB® is a proven solution that scales MySQL® and MariaDB® from GBs to TBs with unmatched insert and query speed, compression, replication performance and online schema flexibility. Tokutek’s recently launched TokuDB v6.5 delivers all of these features and more, not just for HDDs, but also for flash memory.

Date: October 10th
Time: 2 PM EST / 11 AM PST

TokuDB v6.5:

  • Stores 10x More Data – TokuDB delivers 10x compression without any performance degradation. Users can therefore take advantage of much greater amounts of available space without paying more for additional storage.
  • Delivers High Insertion Speed – TokuDB Fractal Tree® indexes continue to change the game with huge insertion rates and greater scalability. Our latest release delivers an order of magnitude faster insertion performance than the competition, ideal for applications that must simultaneously query and update large volumes of rapidly arriving data (e.g., clickstream analytics).
  • Allows Hot Schema Changes — Hot column addition/deletion/rename/resize provides the ability to add/drop/change a column to a database without taking the database offline, enabling database administrators to redefine or add new fields with no downtime.
  • Extends Wear Life for Flash– TokuDB’s proprietary Fractal Tree indexing writes fewer, larger blocks which reduces overall wear, and more efficiently utilizes the FTL (Flash Translation Layer). This extends the life of flash memory by an order of magnitude for many applications.

This webinar covers TokuDB features, latest performance results, and typical use cases.

You have seen the posts about fractal indexing! Now see the demos!

MySQL Schema Agility on SSDs

Saturday, September 29th, 2012

MySQL Schema Agility on SSDs by Tim Callaghan.

From the post:

TokuDB v6.5 adds the ability to expand certain column types without downtime. Users can now enlarge char, varchar, varbinary, and integer columns with no interruption to insert/update/delete statements on the altered table. Prior to this feature, enlarging one of these column types required a full table rebuild. InnoDB blocks all insert/update/delete operations to a table during column expansion as it rebuilds the table and all indexes.

Not sure how often you will need the ability to enlarge columns types without downtime but when you do, suspect it is mission critical.

Something to keep in mind while planning for uncertain data futures.

Three Ways that Fractal Tree Indexes Improve SSD for MySQL

Friday, September 28th, 2012

Three Ways that Fractal Tree Indexes Improve SSD for MySQL

The three advantages:

  • Advantage 1: Index maintenence performance.
  • Advantage 2: Compression.
  • Advantage 3: Reduced wear.

See the post for details and the impressive numbers one expects from Fractal tree indexes.

Announcing TokuDB v6.5: Optimized for Flash [Disambiguation]

Tuesday, September 25th, 2012

Announcing TokuDB v6.5: Optimized for Flash

Semantic confusion follows me around. Like the harpies that tormented Phineus. Well, maybe not quite that bad. 😉

But I see in the news feed that TukoDB v6.5 has been optimized for Flash.

First thought: Why? Who would want a database optimized for Flash?

But they did not mean Flash, or one of the other seventy-five (75) meanings of Flash, but Flash.

I’m glad we had this conversation and cleared that up!

The “Flash” in this case refers to “flash memory.” And so this is an exciting announcement:

We are excited to announce TokuDB® v6.5, the latest version of Tokutek’s flagship storage engine for MySQL and MariaDB.

This version offers optimization for Flash as well as more hot schema change operations for improved agility.

We’ll be posting more details about the new features and performance, so here’s an overview of what’s in store.

TokuDB v6.5 continues the great Toku-tradition of fast insertions. On flash drives, we show an order-of-magnitude (9x) faster insertion rate than InnoDB. TokuDB’s standard compression works just as well on flash and helps you get the most out of your storage system. And TokuDB reduces wear on solid-state drives by more than an order of magnitude. The full technical details will be subject of a future blog post. In summary though, when TokuDB writes to disk, it updates many rows, whereas InnoDB may write a leaf to disk with a single modified row, in some circumstances. More changes per write means fewer writes, which makes the flash drive wear out much more slowly.

More Hot Schema Changes
TokuDB already has hot column addition, deletion and renaming. In this release we add hot column expansion, so you can change the size of the integers in a column or the number of characters in a field. These operations incurs no down time and the changes are immediately available on the table. In this release, we have also extended hot schema changes to partitioned tables.

Every disambiguation page at, in every language, is testimony to a small part of the need for semantic disambiguation.

Did you know that as of today, there are 218,765 disambiguation pages in Wikipedia? Disambiguation Pages.

How many disambiguations could you use for an index at work, that don’t appear in Wikipedia?

You can stop at ten (10). Point made.


Friday, September 14th, 2012

JMyETL, an easy to use ETL tool that supports 10 different RDBMS by Esen Sagynov.

From the post:

JMyETL is a very useful and simple Java based application for Windows OS which allows users to import and export data from/to various database systems. For example:

  • CUBRID –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, Access, SQLite
  • MySQL –> Sybase ASE/ASA, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Sybase ASE –> Sybase ASA, MySQL, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Sybase ASA –> Sybase ASE, MySQL, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Oracle –> Sybase ASA, Sybase ASE, MySQL, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Access –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • PostgreSQL –> Sybase ASE, Sybase ASA, MySQL, Oracle, Access, SQL Server, DB2, SQLite, CUBRID
  • SQL Server –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, Access, DB2, SQLite, CUBRID
  • DB2 –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, Access, SQLite, CUBRID
  • SQLite –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, Access, CUBRID

Just in case you need a database to database ETL utility.

I first saw this at DZone.

Small Data (200 MB up to 10 GB) [MySQL, MapReduce and Hive by the Numbers]

Monday, September 3rd, 2012

Study Stacks MySQL, MapReduce and Hive

From the post:

Many small and medium sized businesses would like to get in on the big data game but do not have the resources to implement parallel database management systems. That being the case, which relational database management system would provide small businesses the highest performance?

This question was asked and answered by Marissa Hollingsworth of Boise State University in a graduate case study that compared the performance rates of MySQL, Hadoop MapReduce, and Hive at scales no larger than nine gigabytes.

Hollingsworth also used only relational data, such as payment information, which stands to reason since anything more would require a parallel system. “This experiment,” said Hollingsworth “involved a payment history analysis which considers customer, account, and transaction data for predictive analytics.”

The case study, the full text of which can be found here, concluded that MapReduce would beat out MySQL and Hive for datasets larger than one gigabyte. As Hollingsworth wrote, “The results show that the single server MySQL solution performs best for trial sizes ranging from 200MB to 1GB, but does not scale well beyond that. MapReduce outperforms MySQL on data sets larger than 1GB and Hive outperforms MySQL on sets larger than 2GB.”

Although your friends may not admit it, some of them have small data. Or interact with clients with small data.

You print this post out and put it in their inbox. Anonymously. They will appreciate it even if they can’t acknowledge having seen it.

When thinking about data and data storage, you might want to keep the comparisons you will find at: How much is 1 byte, kilobyte, megabyte, gigabyte, etc.? in mind.

Roughly speaking, 1 GB is the equivalent of 4,473 books.

The 10 GB limit in this study is roughly 44,730 books.

Sometimes all you need is small data.

Where to get a BZR tree of the latest MySQL releases

Friday, August 17th, 2012

Where to get a BZR tree of the latest MySQL releases by Stewart Smith.

Sometimes, being difficult can develop into its own reward. Not always appreciated when it arrives but a reward none the less.

Announcing Percona Server 5.6 Alpha

Wednesday, August 15th, 2012

Announcing Percona Server 5.6 Alpha by Stewart Smith

From the post:

We are very happy to announce our first alpha of Percona Server 5.6. Based on MySQL 5.6.5 and all the improvements contained within, this is the first step towards a full Percona Server 5.6 release.

Binaries are available to download from our downloads site here:

We will post binaries to our EXPERIMENTAL repositories later, we’re undergoing final testing to ensure that it won’t cause problems for those running Percona Server < 5.6 from EXPERIMENTAL. Percona Server 5.6.5-alpha60.0 does not contain all the features of Percona Server 5.5. We are going to “release early, release often” as we add features from Percona Server 5.5. As such, our documentation will not be complete for a little while yet and these release notes are currently the best source of information – please bear with us.

Go ahead, take a walk on the wild side! 😉

Using MySQL Full-Text Search in Entity Framework

Wednesday, July 25th, 2012

Using MySQL Full-Text Search in Entity Framework

Another database/text search post not for the faint of heart.

MySQL database supports an advanced functionality of full-text search (FTS) and full-text indexing described comprehensively in the documentation:

We decided to meet the needs of our users willing to take advantage of the full-text search in Entity Framework and implemented the full-text search functionality in our Devart dotConnect for MySQL ADO.NET Entity Framework provider.

Hard to say why Beyond Search picked up the Oracle post but left the MySQL one hanging.

I haven’t gone out and counted noses but I suspect there are a lot more installs of MySQL than Oracle 11g. Just my guess. Don’t buy or sell stock based on my guesses.

MicrobeDB: a locally maintainable database of microbial genomic sequences

Sunday, July 8th, 2012

MicrobeDB: a locally maintainable database of microbial genomic sequences by Morgan G. I. Langille, Matthew R. Laird, William W. L. Hsiao, Terry A. Chiu, Jonathan A. Eisen, and Fiona S. L. Brinkman. (Bioinformatics (2012) 28 (14): 1947-1948. doi: 10.1093/bioinformatics/bts273)


Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.

Availability: MicrobeDB is freely available under the GNU-GPL at:

No doubt a useful project but the article seems to be at war with itself:

Although many of these centers provide genomic data in a variety of static formats such as Genbank and Fasta, these are often inadequate for complex queries. To carry out these analyses efficiently, a relational database such as MySQL ( can be used to allow rapid querying across many genomes at once. Some existing data providers such as CMR allow downloading of their database files directly, but these databases are designed for large web-based infrastructures and contain numerous tables that demand a steep learning curve. Also, addition of unpublished genomes to these databases is often not supported. A well known and widely used system is the Generic Model Organism Database (GMOD) project ( GMOD is an open-source project that provides a common platform for building model organism databases such as FlyBase (McQuilton et al., 2011) and WormBase (Yook et al., 2011). GMOD supports a variety of options such as GBrowse (Stein et al., 2002) and a variety of database choices including Chado (Mungall and Emmert, 2007) and BioSQL ( GMOD provides a comprehensive system, but for many researchers such a complex system is not needed.

On one hand, current solutions are “…often inadequate for complex queries” and just a few lines later, “…such a complex system is not needed.”

I have no doubt that using unfamiliar and complex table structures is a burden on any user. Not to mention lacking the ability to add “unpublished genomes” or fixing versions of data for analysis.

What concerns me is the “solution” being seen as yet another set of “local” options. Which impedes the future use of the now “localized” data.

The issue raised here need to be addressed but one-off solutions seem like a particularly poor choice.

Jetpants: a toolkit for huge MySQL topologies [“action” on MySQL]

Friday, June 8th, 2012

Jetpants: a toolkit for huge MySQL topologies

From the webpage:

Tumblr is one of the largest users of MySQL on the web. At present, our data set consists of over 60 billion relational rows, adding up to 21 terabytes of unique relational data. Managing over 200 dedicated database servers can be a bit of a handful, so naturally we engineered some creative solutions to help automate our common processes.

Today, we’re happy to announce the open source release of Jetpants, Tumblr’s in-house toolchain for managing huge MySQL database topologies. Jetpants offers a command suite for easily cloning replicas, rebalancing shards, and performing master promotions. It’s also a full Ruby library for use in developing custom billion-row migration scripts, automating database manipulations, and copying huge files quickly to multiple remote destinations.

Dynamically resizable range-based sharding allows you to scale MySQL horizontally in a robust manner, without any need for a central lookup service or massive pre-allocation of tiny shards. Jetpants supports this range-based model by providing a fast way to split shards that are approaching capacity or I/O limitations. On our hardware, we can split a 750GB, billion-row pool in half in under six hours.

Jetpants can be obtained via GitHub or RubyGems.

Interested in this type of work? We’re hiring!

I am reminded of the line from The Blues Brothers film when Ray of Ray’s Music Exchange (played by Ray Charles) tells Jake and Elwood, “E-excuse me, uh, I don’t think there’s anything wrong with the action on this piano.”

Doesn’t look like there is anything wrong with the “action” on MySQL. 😉

Certainly worth taking a look.

I first saw this Alex Popescu’s myNoSQL.