Archive for the ‘SQL’ Category

Streaming SQL for Apache Kafka

Wednesday, December 27th, 2017

Streaming SQL for Apache Kafka by Jojjat Jafarpour.

From the post:

We are very excited to announce the December release of KSQL, the streaming SQL engine for Apache Kafka! As we announced in the November release blog, we are releasing KSQL on a monthly basis to make it even easier for you to get up and running with the latest and greatest functionality of KSQL to solve your own business problems.

The December release, KSQL 0.3, includes both new features that have been requested by our community as well as under-the-hood improvements for better robustness and resource utilization. If you have already been using KSQL, we encourage you to upgrade to this latest version to take advantage of the new functionality and improvements.

The KSQL Github page links to:

  • KSQL Quick Start: Demonstrates a simple workflow using KSQL to write streaming queries against data in Kafka.
  • Clickstream Analysis Demo: Shows how to build an application that performs real-time user analytics.

These are just quick start materials but are your ETL projects ever as simple as USERID to USERID? Or have such semantically transparent fields? Or what Itake to be semantically transparent fields (they may not be).

As I pointed out in Where Do We Write Down Subject Identifications? earlier today, where do I record what I know about what appears in those fields? Including on what basis to merge them with other data?

If you see where KSQL is offering that ability, please ping me because I’m missing it entirely. Thanks!

Datasette: instantly create and publish an API for your SQLite databases

Tuesday, November 14th, 2017

Datasette: instantly create and publish an API for your SQLite databases by Simon Willison.

From the webpage:

I just shipped the first public version of datasette, a new tool for creating and publishing JSON APIs for SQLite databases.

You can try out out right now at, where you can explore SQLite databases I built from Creative Commons licensed CSV files published by FiveThirtyEight. Or you can check out, derived from the database of world political parties which illustrates some advanced features such as SQLite views.

That sounds really great but then I read:

Or you can try it out on your own machine. If you run OS X and use Google Chrome, try running the following:

pip3 install datasette
datasette ~/Library/Application\ Support/Google/Chrome/Default/History

This will start a web server on displaying an interface that will let you browse your Chrome browser history, which is conveniently stored in a SQLite database.

Warning – Warning:: Don’t have datasette on your laptop at a conference. Yes?

Other than the caution about your own security, this looks very cool!


Cloudera Introduces Topic Maps Extra-Lite

Wednesday, May 10th, 2017

New in Cloudera Enterprise 5.11: Hue Data Search and Tagging by Romain Rigaux.

From the post:

Have you ever struggled to remember table names related to your project? Does it take much too long to find those columns or views? Hue now lets you easily search for any table, view, or column across all databases in the cluster. With the ability to search across tens of thousands of tables, you’re able to quickly find the tables that are relevant for your needs for faster data discovery.

In addition, you can also now tag objects with names to better categorize them and group them to different projects. These tags are searchable, expediting the exploration process through easier, more intuitive discovery.

Through an integration with Cloudera Navigator, existing tags and indexed objects show up automatically in Hue, any additional tags you add appear back in Cloudera Navigator, and the familiar Cloudera Navigator search syntax is supported.
… (emphasis in original)

Seventeen (17) years ago, ISO/IEC 13250:2000 offered users the ability to have additional names for tables, columns and/or any other subject of interest.

Additional names that could have scope (think range of application, such as a language), that could exist in relationships to their creators/users, exposing as much or as little information to a particular user as desired.

For commonplace needs, perhaps tagging objects with names, displayed as simple string is sufficient.

But if viewed from a topic maps perspective, that string display to one user could in fact represent that string, along with who created it, what names it is used with, who uses similar names, just to name a few of the possibilities.

All of which makes me think topic maps should ask users:

  • What subjects do you need to talk about?
  • How do you want to identify those subjects?
  • What do you want to say about those subjects?
  • Do you need to talk about associations/relationships?

It could be, that for day to day users, a string tag/name is sufficient. That doesn’t mean that greater semantics don’t lurk just below the surface. Perhaps even on demand.

Google Spanner Inspires CockroachDB To Outrun It

Thursday, February 23rd, 2017

Google Spanner Inspires CockroachDB To Outrun It by Timothy Prickett Morgan.

From the post:

There is an old joke that in the post-apocalyptic world that comes about because of plague or nuclear war, only two things will be left alive: cockroaches and Keith Richards, the guitarist for the Rolling Stones. As it hails from New York City, you can understand why Cockroach Labs, the upstart software company that is cloning Google’s Spanner distributed relational database, chose that particular bug to epitomize a system that will stay alive no matter what. But, they could have just as easily called it RichardsDB.

When discussing Google’s cloud implementation of Spanner, which launched in beta earlier this week, we promised that we would have a sit down with the people behind the CockroachDB effort, who not coincidentally all hail from the search engine giant and who all worked on key parts of the software infrastructure that sets Google apart. We think that CockroachDB has the potential to shake up the database market, and not just because the people behind it understand deeply what Google did but more importantly because they have come to understand the very different approach that is necessary to commercialize the ideas that are embodied in Spanner and that are being coded into a free and open source CockroachDB.

Cockroach Labs has the potential to be Yahoo’s Hadoop and HDFS to Google’s MapReduce and Google File System, but in a database market that is ripe for change and among enterprise customers who want simplicity from complex systems and they want them to scale instantly, easily, and geographically.

“The backbreaking work is making CockroachDB deployable,” Kimball explains. “In fact, CockroachDB is more deployable than any other distributed database out there. It is an incredibly simple model in that every single node – we call them roach nodes – is symmetric and self-contained with a single binary. There are no external dependencies, so you don’t need Zookeeper running, you don’t need etcd or a distributed or shared file system underneath like Oracle needs with its Real Application Clusters. It is simple as taking one statically linked binary with no dependencies and putting that out on some VM somewhere and pointing it at any other node in a CockroachDB cluster and it joins up and they self organize. So making that work with a single binary was a huge challenge and a big departure from the way Google chose to build Spanner.”

Another big difference is support for SQL, and Kimball had plenty to say about this.

“This is a really interesting point,” he says excitedly. “We have had to build CockroachDB to be deployable as open source in a way that doesn’t need TrueTime and that meets all of the needs of these companies. With open source and there being such a low barrier to try it out and kick the tires, you really need to have a very comprehensive, compatible interface with the outside world. You can’t really cut corners. So we decided to make CockroachDB look like Postgres because it is a better thought out database than MySQL. The nice thing about making it look like Postgres is that all of the applications and languages have drivers for it.”

There are other capabilities built into CockroachDB but being deployable (compared to other distributed databases) and looking like Postgres are keys to a large marketplace share.

The gap between software available to governments/industry and individuals/small groups continues to shrink.

Only you can answer the question of your skill versus theirs.

Fast and Flexible Query Analysis at MapD with Apache Calcite [Merging Data?]

Thursday, February 9th, 2017

Fast and Flexible Query Analysis at MapD with Apache Calcite by Alex Şuhan.

From the post:

After evaluating a few other options, we decided for Apache Calcite, an incubation stage project at the time. It takes SQL queries and generates extended relational algebra, using a highly configurable cost-based optimizer. Several projects use Calcite already for SQL parsing and query optimization.

One of the main strengths of Calcite is its highly modular structure, which allows for multiple integration points and creative uses. It offers a relational algebra builder, which makes moving to a different SQL parser (or adding a non-SQL frontend) feasible.

In our product, we need runtime functions which are not recognized by Calcite by default. For example, trigonometric functions are necessary for on-the-fly geo projections used for point map rendering. Fortunately, Calcite allows specifying such functions and they become first-class citizens, with proper type checking in place.

Calcite also includes a highly capable and flexible cost-based optimizer, which can apply high-level transformations to the relational algebra based on query patterns and statistics. For example, it can push part of a filter through a join in order to reduce the size of the input, like the following figure shows:


You can find this example and more about the cost-based optimizer in Calcite in this presentation on using it in the Apache Phoenix project. Such optimizations complement the low-level optimizations we do ourselves to achieve great speed improvements.

Relational algebra example
Let’s take a simple query: SELECT A.x, COUNT(*) FROM test JOIN B ON A.x = B.x WHERE A.y > 41 GROUP BY A.x; and analyze the relational algebra generated for it.

In Calcite relational algebra, there are a few main node types, corresponding to the theoretical extended relational algebra model: Scan, Filter, Project, Aggregate and Join. Each type of node, except Scan, has one or more (in the case of Join) inputs and its output can become the input of another node. The graph of nodes connected by data flow relationships is a
directed acyclic graph (abbreviated as “DAG”). For our query, Calcite outputs the following DAG:


The Scan nodes have no inputs and output all the rows and the columns in tables A and B, respectively. The Join node specifies the join condition (in our case A.x = B.x) and its output contains the columns in A and B concatenated. The Filter node only allows the rows which pass the specified condition and its output preserves all columns of input. The Project node only preserves the specified expressions as columns in the output. Finally, the Aggregate specifies the group by expressions and aggregates.

The physical implementation of the nodes is up to the system using Calcite as a frontend. Nothing in the Join node mandates a certain implementation of the join operation (equijoin in our case). Indeed, using a condition which can’t be implemented as a hash join, like A.x < B.x, would only be reflected by the condition in the Filter node.

You’re not MapD today but that’s no excuse for poor query performance.

Besides, learning Apache Calcite will increase your attractiveness as data and queries on it become more complex.

I haven’t read all the documentation but the “metadata” in Apache Calcite is as flat as any you will find.

Which means integration of different data sources is either luck of the draw or you asked someone the “meaning” of the metadata.

The tutorial has this example:


The column header “GENDER” for example appears to presume the common male/female distinction. But without further exploration of the data set, there could be other genders encoded in that field as well.

If “GENDER” seems too easy, what would you say about “NAME,” bearing in mind that Japanese family names are written first and given names written second. How would those appear under “NAME?”

Apologies! My screen shot missed field “S.”

I have utterly no idea what “S” may or may not represent as a field header. Do you?

If the obviousness of field headers fails with “GENDER” and “NAME,” what do you suspect will happen with less “obvious” field headers?

How successful will merging of data be?

Where would you add subject identity information and how would you associate it with data processed by Apache Calcite?

RDBL – manipulate data in-database with R code only

Monday, August 29th, 2016

RDBL – manipulate data in-database with R code only by David Kun.

From the post:

In this post I introduce our own package RDBL, the R DataBase Layer. With this package you can manipulate data in-database without writing SQL code. The package interprets the R code and sends out the corresponding SQL statements to the database, fully transparently. To minimize overhead, the data is only fetched when absolutely necessary, allowing the user to create the relevant joins (merge), filters (logical indexing) and groupings (aggregation) in R code before the SQL is run on the database. The core idea behind RDBL is to let R users with little or no SQL knowledge to utilize the power of SQL database engines for data manipulation.

It is important to note that the SQL statements generated in the background are not executed unless explicitly requested by the command Hence, you can merge, filter and aggregate your dataset on the database side and load only the result set into memory for R.

In general the design principle behind RDBL is to keep the models as close as possible to the usual data.frame logic, including (as shown later in detail) commands like aggregate, referencing columns by the \($\) operator and features like logical indexing using the \([]\) operator.

RDBL supports a connection to any SQL-like data source which supports a DBI interface or an ODBC connection, including but not limited to Oracle, MySQL, SQLite, SQL Server, MS Access and more.

Not as much fun as surfing mall wifi for logins/passwords, but it is something you can use at work.

The best feature is that you load resulting data sets only. RDBL uses databases for what they do well. Odd but efficient practices do happen from time to time.

I first saw this in a tweet by Christophe Lalanne.

sqlite3 test suite

Sunday, March 20th, 2016

sqlite3 test suite by Nelson Minar.

From the post:

I felt guilty complaining about sqlite3’s source distribution, so I went to look at the real source, what the authors work with. It’s not managed by git but rather in Fossil (an SCM written by the sqlite3 author). Happily the web view is quite good.

One of the miraculous things about sqlite3 is its incredible test suite. There are 683,932 lines of test code. Compare to 273,000 lines of C code for the library and all its extensions. sqlite3 has a reputation for being solid and correct. It’s not an accident.

The test size is overcounted a bit because there’s a lot of test data. For instance the test for the Porter Stemmer is 24k lines of code, but almost all of that is a giant list of words and their correct stemming. Still very useful tests! But not quite as much human effort as it looks on first blush.

Just a quick reminder that test suites have the same mixture of code and data subjects as the code being tested.

So your software passes the test. What was being tested? What was not (the weird machines input) being tested?

If you don’t think that is a serious question, consult the page of SQLite vulnerabilities.

I saw this in a tweet by Julia Evans.

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Thursday, January 7th, 2016

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Let’s reverse the order of the announcement, to be in reader-friendly order:


Press kit

Release Notes

What’s New in 9.5

Edit: I moved my comments above the fold as it were:

Just so you know, PostgreSQL 9.5 documentation, XMLEXISTS says:

Also note that the SQL standard specifies the xmlexists construct to take an XQuery expression as first argument, but PostgreSQL currently only supports XPath, which is a subset of XQuery.

Apologies, you will have to scroll for the subsection, there was no anchor at

If you are looking to make a major contribution to PostgreSQL, note that XQuery is on the todo list.

Now for all the stuff that you will skip reading anyway. 😉

(I would save the prose for use in reports to management about using or transitioning to PostgreSQL 9.5.)

7 JANUARY 2016: The PostgreSQL Global Development Group announces the release of PostgreSQL 9.5. This release adds UPSERT capability, Row Level Security, and multiple Big Data features, which will broaden the user base for the world’s most advanced database. With these new capabilities, PostgreSQL will be the best choice for even more applications for startups, large corporations, and government agencies.

Annie Prévot, CIO of the CNAF, the French Child Benefits Office, said, “The CNAF is providing services for 11 million persons and distributing 73 billion Euros every year, through 26 types of social benefit schemes. This service is essential to the population and it relies on an information system that must be absolutely efficient and reliable. The CNAF’s information system is satisfyingly based on the PostgreSQL database management system.”


A most-requested feature by application developers for several years, “UPSERT” is shorthand for “INSERT, ON CONFLICT UPDATE”, allowing new and updated rows to be treated the same. UPSERT simplifies web and mobile application development by enabling the database to handle conflicts between concurrent data changes. This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL.

Developed over the last two years by Heroku programmer Peter Geoghegan, PostgreSQL’s implementation of UPSERT is significantly more flexible and powerful than those offered by other relational databases. The new ON CONFLICT clause permits ignoring the new data, or updating different columns or relations in ways which will support complex ETL (Extract, Transform, Load) toolchains for bulk data loading. And, like all of PostgreSQL, it is designed to be absolutely concurrency-safe and to integrate with all other PostgreSQL features, including Logical Replication.

Row Level Security

PostgreSQL continues to expand database security capabilities with its new Row Level Security (RLS) feature. RLS implements true per-row and per-column data access control which integrates with external label-based security stacks such as SE Linux. PostgreSQL is already known as “the most secure by default.” RLS cements its position as the best choice for applications with strong data security requirements, such as compliance with PCI, the European Data Protection Directive, and healthcare data protection standards.

RLS is the culmination of five years of security features added to PostgreSQL, including extensive work by KaiGai Kohei of NEC, Stephen Frost of Crunchy Data, and Dean Rasheed. Through it, database administrators can set security “policies” which filter which rows particular users are allowed to update or view. Data security implemented this way is resistant to SQL injection exploits and other application-level security holes.

Big Data Features

PostgreSQL 9.5 includes multiple new features for bigger databases, and for integrating with other Big Data systems. These features ensure that PostgreSQL continues to have a strong role in the rapidly growing open source Big Data marketplace. Among them are:

BRIN Indexing: This new type of index supports creating tiny, but effective indexes for very large, “naturally ordered” tables. For example, tables containing logging data with billions of rows could be indexed and searched in 5% of the time required by standard BTree indexes.

Faster Sorts: PostgreSQL now sorts text and NUMERIC data faster, using an algorithm called “abbreviated keys”. This makes some queries which need to sort large amounts of data 2X to 12X faster, and can speed up index creation by 20X.

CUBE, ROLLUP and GROUPING SETS: These new standard SQL clauses let users produce reports with multiple levels of summarization in one query instead of requiring several. CUBE will also enable tightly integrating PostgreSQL with more Online Analytic Processing (OLAP) reporting tools such as Tableau.

Foreign Data Wrappers (FDWs): These already allow using PostgreSQL as a query engine for other Big Data systems such as Hadoop and Cassandra. Version 9.5 adds IMPORT FOREIGN SCHEMA and JOIN pushdown making query connections to external databases both easier to set up and more efficient.

TABLESAMPLE: This SQL clause allows grabbing a quick statistical sample of huge tables, without the need for expensive sorting.

“The new BRIN index in PostgreSQL 9.5 is a powerful new feature which enables PostgreSQL to manage and index volumes of data that were impractical or impossible in the past. It allows scalability of data and performance beyond what was considered previously attainable with traditional relational databases and makes PostgreSQL a perfect solution for Big Data analytics,” said Boyan Botev, Lead Database Administrator, Premier, Inc.

The History of SQL Injection…

Friday, November 20th, 2015

The History of SQL Injection, the Hack That Will Never Go Away by Joseph Cox.

From the post:

One of the hackers suspected of being behind the TalkTalk breach, which led to the personal details of at least 150,000 people being stolen, used a vulnerability discovered two years before he was even born.

That method of attack was SQL injection (SQLi), where hackers typically enter malicious commands into forms on a website to make it churn out juicy bits of data. It’s been used to steal the personal details of World Health Organization employees, grab data from the Wall Street Journal, and hit the sites of US federal agencies.

“It’s the most easy way to hack,” the pseudonymous hacker w0rm, who was responsible for the Wall Street Journal hack, told Motherboard. The attack took only a “few hours.”

But, for all its simplicity, as well as its effectiveness at siphoning the digital innards of corporations and governments alike, SQLi is relatively easy to defend against.

So why, in 2015, is SQLi still leading to some of the biggest breaches around?

SQLi was possibly first documented by Jeff Forristal in the hacker zine Phrack. Back then, Forristal went by the handle rain.forest.puppy, but he’s now CTO of mobile security at cybersecurity vendor Bluebox security.

Joseph’s history is another data point for the proposition:

To a vendor, your security falls under “…not my problem.

Spreadsheets are graphs too!

Wednesday, August 26th, 2015

Spreadsheets are graphs too! by Felienne Hermans.

Presentation with transcript.

Felienne starts with a great spreadsheet story:

When I was in grad school, I worked with an investment bank doing spreadsheet research. On my first day, I went to the head of the Excel team.

I said, ‘Hello, can I have a list of all your spreadsheets?’

There was no such thing.

‘We don’t have a list of all the spreadsheets,’ he said. ‘You could ask Frank in Accounting or maybe Harry over at Finance. He’s always talking about spreadsheets. I don’t really know, but I think we might have 10,000 spreadsheets.’

10,000 spreadsheets was a gold mine of research, so I went to the IT department and conducted my first spreadsheet scan with root access in Windows Explorer.

Within one second, it had already found 10,000 spreadsheets. Within an hour, it was still finding more, with over one million Excel files located. Eventually, we found 2.5 million spreadsheets.

In short, spreadsheets run the world.

She continues to outline spreadsheet horror stories and then demonstrates how complex relationships between cells can be captured by Neo4j.

Which are much easier to query with Cypher than SQL!

While I applaud:

I realized that spreadsheet information is actually very graphy. All the cells are connected to references to each other and they happen to be in a worksheet or on the spreadsheet, but that’s not really what matters. What matters is the connections.

I would be more concerned with the identity of the subjects between which connections have been made.

Think of it as documenting the column headers from a five year old spreadsheet, that you are now using by rote.

Knowing the connections between cells is a big step forward. Knowing what the cells are supposed to represent is an even bigger one.

q – Text as Data

Tuesday, April 7th, 2015

q – Text as Data by Harel Ben-Attia.

From the webpage:

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.

q’s web site is It contains everything you need to download and use q in no time.

I’m not looking for an alternative to awk or sed for CSV/TSV files but you may be.

From the examples I suspect it would be “easier” in some sense of the word to teach than either awk or sed.

Give it a try and let me know what you think.

I first saw this in a tweet by Scott Chamberlain.


Wednesday, March 11th, 2015


From the webpage:

Convert text from a file or from stdin into SQL table and query it instantly. Uses sqlite as backend. The idea is to make SQL into a tool on the command line or in scripts.

Online manual:

So what can it do?

  • convert text/CSV files into sqlite database/table
  • work on stdin data on-the-fly
  • it can be used as swiss army knife kind of tool for extracting information from other processes that send their information to termsql via a pipe on the command line or in scripts
  • termsql can also pipe into another termsql of course
  • you can quickly sort and extract data
  • creates string/integer/float column types automatically
  • gives you the syntax and power of SQL on the command line

Sometimes you need the esoteric and sometimes not!


I first saw this in a tweet by Christophe Lalanne.

Airbnb open sources SQL tool built on Facebook’s Presto database

Friday, March 6th, 2015

Airbnb open sources SQL tool built on Facebook’s Presto database by Derrick Harris.

From the post:

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

From the GitHub page:

Airpal is a web-based, query execution tool which leverages Facebook’s PrestoDB to make authoring queries and retrieving results simple for users. Airpal provides the ability to find tables, see metadata, browse sample rows, write and edit queries, then submit queries all in a web interface. Once queries are running, users can track query progress and when finished, get the results back through the browser as a CSV (download it or share it with friends). The results of a query can be used to generate a new Hive table for subsequent analysis, and Airpal maintains a searchable history of all queries run within the tool.


  • Optional Access Control
  • Syntax highlighting
  • Results exported to a CSV for download or a Hive table
  • Query history for self and others
  • Saved queries
  • Table finder to search for appropriate tables
  • Table explorer to visualize schema of table and first 1000 rows


  • Java 7 or higher
  • MySQL database
  • Presto 0.77 or higher
  • S3 bucket (to store CSVs)
  • Gradle 2.2 or higher

I understand to some degree the need to make SQL “simpler” but fail to see how simpler controls translate into a solution. The controls may be obvious enough but if I don’t know the semantics of the column headers, the simplicity of the interface won’t be terribly helpful.

Or to put it another way, users seem to be assumed to know the semantics of the tables they encounter. True/False?


Sunday, November 2nd, 2014

Open Source Distributed Analytics Engine with SQL interface and OLAP on Hadoop by eBay – Kylin by Avkash Chauhan.

From the post:

Key Features:

  • Extremely Fast OLAP Engine at Scale:
    • Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
  • ANSI-SQL Interface on Hadoop:
    • Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions
  • Interactive Query Capability:
    • Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset
  • MOLAP Cube:
    • User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • Seamless Integration with BI Tools:
    • Kylin currently offers integration capability with BI Tools like Tableau.
  • Other Highlights:
    • Job Management and Monitoring
    • Compression and Encoding Support
    • Incremental Refresh of Cubes
    • Leverage HBase Coprocessor for query latency
    • Approximate Query Capability for distinct Count (HyperLogLog)
    • Easy Web interface to manage, build, monitor and query cubes
    • Security capability to set ACL at Cube/Project Level
    • Support LDAP Integration

Find it at Github:

Learn more at:

More info:

Kylin OLAP Group

Kylin Developer Mail

A useful write-up for an overview of Kylin: Announcing Kylin: Extreme OLAP Engine for Big Data, the blog post from eBay that announces the open sourcing of Kylin.

What caught my eye was the use of pre-calculation of combinations of dimensions using Hadoop. Sounds promising!

Dynamic Columns Tutorial – Part 1: Introduction

Friday, October 3rd, 2014

Dynamic Columns Tutorial – Part 1: Introduction by Max Mether.

From the post:

For certain situations, the static structure of tables in a relational database can be very limited. Each column is statically defined, has a pre-defined type and you can only enter a value of that type into the column.You can be creative and list multiple values in one column, but then those values are not generally easily accessed and manipulated with other functions. You have to use an API or contortions of a function like SUBSTRING() to pull out a value you want. Even then, you have to know what is contained in the column to be able to manipulate it properly. These methods can require too much manual intervention to assess and access the data contained in the column.

If you want to add columns as the information stored in your tables grows and your needs change you need to do fairly expensive ALTER TABLE operations. These have traditionally been very expensive in MySQL and MariaDB although the performance is a bit better starting with MariaDB 10.0 and MySQL 5.6.

The other option for having a flexible structure is to use something like Anchor Modeling . This allows you to have a very flexible schema as adding an attribute basically just means adding a table. The problem with this approach is that you’ll end up with a lot of tables which means a lot of joins when looking for results which can easily become un-manageable, or at least hard to manage.

This is where dynamic columns steps into the picture. A good solution to the static structure problem is to use dynamic columns provided in MariaDB. It allows flexibility within a defined structure, within a column. A Dynamic Column is defined as a BLOB on the DDL level. However, within the BLOB column, you may set arbitrarily and dynamically defined attributes and values–for a maximum of 64k.

Dynamic columns are not in isolation: The usual functions will work fine with the values contained within them. And they can be used as join points for joining to other table as you would normally. This allows you to retain the power of Relational SQL while still mainting a flexibility with regards to your attributes for specific tables where it makes sense.

Probably channeling topic maps when I observe that dynamic columns are associating multiple properties with a subject. 😉

Very interested in seeing how joins are performed using dynamic columns, but that awaits in a future post.

I first saw this in a tweet by MariaDB

Understanding weak isolation is a serious problem

Wednesday, September 17th, 2014

Understanding weak isolation is a serious problem by Peter Bailis.

From the post:

Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional inherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.2

That debates are occurring without full knowledge of the issues at hand isn’t all that surprising. Or as Job 38:2 (KJV) puts it: “Who is this that darkeneth counsel by words without knowledge?”

Peter raises a number of questions and points to resources that are good starting points for investigation of weak isolation.

What sort of weak isolation does your topic map storage mechanism use?

I first saw this in a tweet by Justin Sheehy.

OrientDB Manual – version 1.7.8

Tuesday, August 26th, 2014

OrientDB Manual – version 1.7.8

From the post:

Welcome to OrientDB – the first Multi-Model Open Source NoSQL DBMS that brings together the power of graphs and the flexibility of documents into one scalable, high-performance operational database. OrientDB is sponsored by Orient Technologies, LTD.

OrientDB has features of both Document and Graph DBMSs. Written in Java and designed to be exceptionally fast: it can store up to 150,000 records per second on common hardware. Not only can it embed documents like any other Document database, but it manages relationships like Graph Databases with direct connections among records. You can traverse parts of or entire trees and graphs of records in a few milliseconds.

OrientDB supports schema-less, schema-full and schema-mixed modes and it has a strong security profiling system based on users and roles. Thanks to the SQL layer, OrientDB query language is straightforward and easy to use, especially for those skilled in the relational DBMS world.

Take a look at some OrientDB Presentations.

A new version of the documentation for OrientDB. I saw this last week but forgot to mention it.


Apache™ Spark™ v1.0

Friday, May 30th, 2014

Apache™ Spark™ v1.0

From the post:

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 170 Open Source projects and initiatives, announced today the availability of Apache Spark v1.0, the super-fast, Open Source large-scale data processing and advanced analytics engine.

Apache Spark has been dubbed a “Hadoop Swiss Army knife” for its remarkable speed and ease of use, allowing developers to quickly write applications in Java, Scala, or Python, using its built-in set of over 80 high-level operators. With Spark, programs can run up to 100x faster than Apache Hadoop MapReduce in memory.

“1.0 is a huge milestone for the fast-growing Spark community. Every contributor and user who’s helped bring Spark to this point should feel proud of this release,” said Matei Zaharia, Vice President of Apache Spark.

Apache Spark is well-suited for machine learning, interactive queries, and stream processing. It is 100% compatible with Hadoop’s Distributed File System (HDFS), HBase, Cassandra, as well as any Hadoop storage system, making existing data immediately usable in Spark. In addition, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box.

New in v1.0, Apache Spark offers strong API stability guarantees (backward-compatibility throughout the 1.X series), a new Spark SQL component for accessing structured data, as well as richer integration with other Apache projects (Hadoop YARN, Hive, and Mesos).

Spark Homepage.

A bit more technical note of the release from the project:

Spark 1.0.0 is a major release marking the start of the 1.X line. This release brings both a variety of new features and strong API compatibility guarantees throughout the 1.X line. Spark 1.0 adds a new major component, Spark SQL, for loading and manipulating structured data in Spark. It includes major extensions to all of Spark’s existing standard libraries (ML, Streaming, and GraphX) while also enhancing language support in Java and Python. Finally, Spark 1.0 brings operational improvements including full support for the Hadoop/YARN security model and a unified submission process for all supported cluster managers.

You can download Spark 1.0.0 as either a source package (5 MB tgz) or a prebuilt package for Hadoop 1 / CDH3, CDH4, or Hadoop 2 / CDH5 / HDP2 (160 MB tgz). Release signatures and checksums are available at the official Apache download site.

What a nice way to start the weekend!

I first saw this in a tweet by Sean Owen.

How to find bugs in MySQL

Sunday, April 20th, 2014

How to find bugs in MySQL by Roel Van de Paar.

From the post:

Finding bugs in MySQL is not only fun, it’s also something I have been doing the last four years of my life.

Whether you want to become the next Shane Bester (who is generally considered the most skilled MySQL bug hunter worldwide), or just want to prove you can outsmart some of the world’s best programmers, finding bugs in MySQL is a skill not reserved anymore to top QA engineers armed with a loads of scripts, expensive flash storage and top-range server hardware. Off course, for professionals that’s still the way to go, but now anyone with an average laptop and a standard HDD can have a lot of fun trying to find that elusive crash…

If you follow this post carefully, you may well be able to find a nice crashing bug (or two) running RQG (an excellent database QA tool). Linux would be the preferred testing OS, but if you are using Windows as your main OS, I would recommend getting Virtual Box and running a Linux guest in a suitably sized (i.e. large) VM. In terms of the acronym “RQG”, this stands for “Random Query Generator,” also named “randgen.”

If you’re not just after finding any bug out there (“bug hunting”), you can tune the RQG grammars (files that define what sort of SQL RQG executes) to more or less match your “issue area.” For example, if you are always running into a situation where the server crashes on a DELETE query (as seen at the end of the mysqld error log for example), you would want an SQL grammar that definitely has a variety of DELETE queries in it. These queries should be closely matched with the actual crashing query – crashes usually happen due to exactly the same, or similar statements with the same clauses, conditions etc.

Just in case you feel a bit old for an Easter egg hunt today, consider going on a MySQL bug hunt.

Curious, do you know of RQG-like suites for noSQL databases?

PS: RQG Documentation (github)

Metaphor: Web-based Functorial Data Migration

Thursday, March 6th, 2014

Metaphor: Web-based Functorial Data Migration

From the webpage:

Metaphor is a web-based implementation of functorial data migration. David Spivak and Scott Morrison are the primary contributors.

I discovered this while running some of the FQL material to ground.

While I don’t doubt the ability of category theory to create mappings between relational schemas, what I am not seeing is the basis for the mapping.

In other words, assume I have two schemas with only one element in each one, firstName in one and givenName in the other. Certainly I can produce a mapping between those schemas.

Question: On what basis did I make such a mapping?

In other words, what properties of those subjects had to be the same or different in order for me to make that mapping?

Unless and until you know that, how can you be sure that your mappings agree with those I have made?

FQL: A Functorial Query Language

Thursday, March 6th, 2014

FQL: A Functorial Query Language

From the webpage:

The FQL IDE is a visual schema mapping tool for developing FQL programs. It can run FQL programs, generate SQL from FQL, generate FQL from SQL, and generate FQL from schema correspondences. Using JDBC, it can run transparently using an external SQL engine and on external database instances. It can output RDF/OWL/XML and comes with many built-in examples. David Spivak and Ryan Wisnesky are the primary contributors. Requires Java 7.

As if FQL and the IDE weren’t enough, papers, slides, source code await you.

I first saw this in a tweet by Computer Science.

Graph Databases – 250% Spike in Popularity – Really?

Saturday, January 25th, 2014

I prefer graph databases for a number of reasons but the rhetoric about them has gotten completely out of hand.

The most recent Internet rumor is that graph database had a 250% spike in popularity.


Care to guess how that “measurement” was taken? It was more intellectually honest than Office of Management and Budget‘s sequestration numbers, but only just.

Here are the parameters for the 250% increase:

  • Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google and Bing for this measurement. In order to count only relevant results, we are searching for “ database”, e.g. “Oracle database”.
  • General interest in the system. For this measurement, we use the frequency of searches in Google Trends.
  • Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.
  • Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired.
  • Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional network LinkedIn.

We calculate the popularity value of a system by standardizing and averaging of the individual parameters. These mathematical transformations are made in a way ​​so that the distance of the individual systems is preserved. That means, when system A has twice as large a value in the DB-Engines Ranking as system B, then it is twice as popular when averaged over the individual evaluation criteria.

The DB-Engines Ranking does not measure the number of installations of the systems, or their use within IT systems. It can be expected, that an increase of the popularity of a system as measured by the DB-Engines Ranking (e.g. in discussions or job offers) precedes a corresponding broad use of the system by a certain time factor. Because of this, the DB-Engines Ranking can act as an early indicator. (emphasis added) (Source: DB-Engines)

So, this 250% increase in popularity is like a high school cheerleader election. Yes?

Oracle, may have signed several nation level contracts in the past year but are outdistanced in the rankings by twitter traffic?

Not what I would call reliable intelligence.

PS: the rumor apparently originates with: Tables turning? Graph databases see 250% spike in popularity by Lucy Carey.

Personally I can’t see how Lucy got 250% out of the reported numbers. There is a story about repeating something so often that it is believed. Do you remember it?

Phoenix: Incubating at Apache!

Sunday, January 12th, 2014

Phoenix: Incubating at Apache!

From the webpage:

Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Tired of reading already and just want to get started? Take a look at our FAQs, listen to the Phoenix talks from Hadoop Summit 2013 and HBaseConn 2013, and jump over to our quick start guide here.

To see whats supported, go to our language reference. It includes all typical SQL query statement clauses, including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, etc. It also supports a full set of DML commands as well as table creation and versioned incremental alterations through our DDL commands. We try to follow the SQL standards wherever possible.

Incubating at Apache is no guarantee of success but it does mean sane licensing and a merit based organization/process.

If you are interested in non-NSA corrupted software, consider supporting the Apache Software Foundation.

Cheat Sheet: Hive for SQL Users

Sunday, December 15th, 2013

Cheat Sheet: Hive for SQL Users

What looks like a very useful quick reference to have on or near your desk.

I say “looks like” because so far I haven’t found a way to capture the file for printing.

On Hortonworks (link above), it displays in a slideshare-like window. You can scroll, mail to to others, etc., but no save-file.

I searched for the title and found another copy at Slideshare.

If you are guessing that means I can save it to my Slideshare folder, right in one. That still doesn’t get it to my local machine.

It does have all the major social networks listed for you to share/embed the slides.

But why would I want to propagate this sort of annoyance?

Better that I ask readers of this blog to ping Hortonworks and ask that the no-download approach in:

not be repeated. (Politely. Hortonworks has done an enormous about of work on the Hadoop ecosystem, all on its own dime. This is probably just poor judgment on the part of a non-techie in a business office somewhere.)

Codd’s Relational Vision…

Thursday, December 12th, 2013

Codd’s Relational Vision – Has NoSQL Come Full Circle? by Doug Turnbull.

From the post:

Recently, I spoke at NoSQL Matters in Barcelona about database history. As somebody with a history background, I was pretty excited to dig into the past, beyond the hype and marketing fluff, and look specifically at what technical problems each generation of database solved and where they in-turn fell short.

However, I got stuck at one moment in time I found utterly fascinating: the original development of relational databases. So much of the NoSQL movement feels like a rebellion against the “old timey” feeling relational databases. So I thought it would be fascinating to be a contrarian, to dig into what value relational databases have added to the world. Something everyone thinks is obvious but nobody really understands.

It’s very easy and popular to criticize relational databases. What folks don’t seem to do is go back and appreciate how revolutionary relational databases were when they came out. We forget what problems they solved. We forget how earlier databases fell short, and how relational databases solved the problems of the first generation of databases. In short, relational databases were the noSomething, and I aimed to find out what that something was.

And from that apply those lessons to today’s NoSQL databases. Are today’s databases repeating mistakes of the past? Or are they filling an important niche (or both?).

This is a must read article if you are not choosing databases based on marketing hype.

It’s nice to hear IT history taken seriously.

Neo4j: The ‘thinking in graphs’ curve

Thursday, November 28th, 2013

Neo4j: The ‘thinking in graphs’ curve by Mark Needham

From the post:

In a couple of Neo4j talks I’ve done recently I’ve been asked how long it takes to get used to modelling data in graphs and whether I felt it’s simpler than alternative approaches.

My experience of ‘thinking in graphs’™ closely mirrors what I believe is a fairly common curve when learning technologies which change the way you think:

Learning curve for graphs

There is an initial stage where it seems really hard because it’s different to what we’re used to and at this stage we might want to go back to what we’re used to.

If we persevere, however, we will get over that hump and after that it will be difficult to imagine another way of modelling data – at least in domains where the connections between different bits of data are important.

Once we’re over the hump data modelling should seem like fun and we’ll start looking to see whether we can use graphs to solve other problems we’ve got.

I wasn’t sure whether modelling in graphs is simpler than alternative approaches so as a thought experiment I decided to see what part of my football graph would look like if it was modelled in a relational database.

See Mark’s post for the comparison between a normalized relational database model versus a graph model.

I suspect Mark is right about the difficulty of moving from a fully normalized relational paradigm to graphs, but no one grows up thinking in normalized relational databases.

Remember your first encounter with databases (mine was DBase III or was that DBase II?), the normalized relational paradigm seemed unnatural. On a par with unmentionable practices.

Here’s an experiment you can try with non-IT and IT people.

Show both groups Mark’s diagrams and ask them which one is easier to understand?

I think you know where my money is riding. 😉

Could be useful empirical knowledge in terms of preparing educational materials for the respective groups.


Thursday, November 28th, 2013


From the overview:

InfiniSQL is a relational database management system (RDBMS) composed entirely from the ground up. InfiniSQL’s goals are:

  • Horizontal Scalability
  • Continuous Availability
  • High Throughput
  • Low Latency
  • High Performance For Complex, Multi-Host Transactions
  • Ubiquity

InfiniSQL has been tested to support over 500,000 complex transactions per second with over 100,000 simultaneous connections. This was on a cluster of only 12 single socket x86-64 servers. Subscribed hardware in this environment was exhausted from this effort–so the true upper limits of capacity are unknown. InfiniSQL’s scalability across multiple nodes appears to be limitless!

From what I read on the website, InfiniSQL operates entirely in memory and so has not hit the I/O barrier to storage.

Very much at alpha stage of development but the “500,000 complex transactions per second” is enough to make it worth watching.


Friday, November 15th, 2013


Billed as:

The Irresistible Database for Java Combining Great RDBMS and NoSQL Features.

Supposed to appear by the end of November 2013 so it isn’t clear if SQL, NoSQL are about to be joined by Irresistable as a database category or not. 😉

The following caught my eye:

Very fast joins with graph-based relations

A single join has O(1) time complexity. A combination of multiple joins is internally processed as graph traversal with smart query optimization.

Without details, “very fast” has too wide a range of meanings to be very useful.

I don’t agree with the evaluation of Performance for RDBMS as “Limited.” People keep saying that as a truism when performance of any data store depends upon the architecture, data model, caching, etc.

I saw a performance test recently that depended upon (hopefully) a mis-understanding of one of the subjects of comparison. No surprise that it did really poorly in the comparison.

On the other hand, I am looking forward to the release of OhmDB as an early holiday surprise!

PS: I did subscribe to the newsletter on the theory that enough legitimate email might drown out the spam I get.

An In-Depth Look at Modern Database Systems

Sunday, October 27th, 2013

An In-Depth Look at Modern Database Systems by C. Mohan.


This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source, commercial and research systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented.

This is a revised version of a tutorial presented first at the 39th International Conference on Very Large Databases (VLDB2013) in Riva del Garda, Italy in August 2013. This is also a follow up to my EDBT2013 keynote talk “History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla” (see the paper at

Latest Bibliography.

The one thing I have not found for this tutorial is a video!

While highly enjoyable (from my perspective), detailed analysis of the database platforms and the ideas they missed or incorporated would be even more valuable.

It is one thing to say generally that an idea was missed and quite another to obtain agreement on that point.

A series of workshops documenting the intellectual history of databases would go a long way to hastening progress, as opposed to proliferation of wheels.

…Hive Functions in Hadoop

Sunday, September 22nd, 2013

Cheat Sheet: How To Work with Hive Functions in Hadoop by Marc Holmes.

From the post:

Just a couple of weeks ago we published our simple SQL to Hive Cheat Sheet. That has proven immensely popular with a lot of folk to understand the basics of querying with Hive. Our friends at Qubole were kind enough to work with us to extend and enhance the original cheat sheet with more advanced features of Hive: User Defined Functions (UDF). In this post, Gil Allouche of Qubole takes us from the basics of Hive through to getting started with more advanced uses, which we’ve compiled into another cheat sheet you can download here.

The cheat sheet will be useful but so is this observation in the conclusion of the post:

One of the key benefits of Hive is using existing SQL knowledge, which is a common skill found across business analysts, data analysts, software engineers, data scientist and others. Hive has nearly no barriers for new users to start exploring and analyzing data.

I’m sure use of existing SQL knowledge isn’t the only reason for Hive’s success, but the Hive PowerBy page shows it didn’t hurt!

Something to think about in creating a topic map query language. Yes, the queries executed by an engine will be traversing a topic map graph, but presenting it to the user as a graph query isn’t required.