Archive for the ‘Database’ Category


Thursday, November 28th, 2013


From the overview:

InfiniSQL is a relational database management system (RDBMS) composed entirely from the ground up. InfiniSQL’s goals are:

  • Horizontal Scalability
  • Continuous Availability
  • High Throughput
  • Low Latency
  • High Performance For Complex, Multi-Host Transactions
  • Ubiquity

InfiniSQL has been tested to support over 500,000 complex transactions per second with over 100,000 simultaneous connections. This was on a cluster of only 12 single socket x86-64 servers. Subscribed hardware in this environment was exhausted from this effort–so the true upper limits of capacity are unknown. InfiniSQL’s scalability across multiple nodes appears to be limitless!

From what I read on the website, InfiniSQL operates entirely in memory and so has not hit the I/O barrier to storage.

Very much at alpha stage of development but the “500,000 complex transactions per second” is enough to make it worth watching.


Friday, November 15th, 2013


Billed as:

The Irresistible Database for Java Combining Great RDBMS and NoSQL Features.

Supposed to appear by the end of November 2013 so it isn’t clear if SQL, NoSQL are about to be joined by Irresistable as a database category or not. 😉

The following caught my eye:

Very fast joins with graph-based relations

A single join has O(1) time complexity. A combination of multiple joins is internally processed as graph traversal with smart query optimization.

Without details, “very fast” has too wide a range of meanings to be very useful.

I don’t agree with the evaluation of Performance for RDBMS as “Limited.” People keep saying that as a truism when performance of any data store depends upon the architecture, data model, caching, etc.

I saw a performance test recently that depended upon (hopefully) a mis-understanding of one of the subjects of comparison. No surprise that it did really poorly in the comparison.

On the other hand, I am looking forward to the release of OhmDB as an early holiday surprise!

PS: I did subscribe to the newsletter on the theory that enough legitimate email might drown out the spam I get.

List of NoSQL Databases (150 at present count)

Wednesday, October 30th, 2013

List of NoSQL Databases (150 at present count)

A tweet by John Troon pointed me to the current NoSQL listing at with 150 entries.

Is there a betting pool on how many more will appear by May 1, 2014?

Just curious.

An In-Depth Look at Modern Database Systems

Sunday, October 27th, 2013

An In-Depth Look at Modern Database Systems by C. Mohan.


This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source, commercial and research systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented.

This is a revised version of a tutorial presented first at the 39th International Conference on Very Large Databases (VLDB2013) in Riva del Garda, Italy in August 2013. This is also a follow up to my EDBT2013 keynote talk “History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla” (see the paper at

Latest Bibliography.

The one thing I have not found for this tutorial is a video!

While highly enjoyable (from my perspective), detailed analysis of the database platforms and the ideas they missed or incorporated would be even more valuable.

It is one thing to say generally that an idea was missed and quite another to obtain agreement on that point.

A series of workshops documenting the intellectual history of databases would go a long way to hastening progress, as opposed to proliferation of wheels.

Database and Query Analysis Tools for MySQL:…

Thursday, October 24th, 2013

Database and Query Analysis Tools for MySQL: Exploiting Hypertree and Hypergraph Decompositions by Selvameenal Chokkalingam.


A database is an organized collection of data. Database systems are widely used and have a broad range of applications. It is thus essential to find efficient database query evaluation techniques. In the recent years, new theories and algorithms for database query optimization have been developed that exploit advanced graph theoretic concepts. In particular, the graph theoretic concepts of hypergraphs, hypergraph decompositions, and hypertree decompositions have played an important role in the recent research.

This thesis studies algorithms that employ hypergraph decompositions in order to detect the cyclic or acyclic degree of database schema, and describes implementations of those algorithms. The main contribution of this thesis is a collection of software tools for MySQL that exploit hypergraph properties associated with database schema and query structures.

If you remember hypergraphs from database theory this may be the refresher for you.

I stumbled across it earlier today while running down references on hypergraphs.

RaptorDB – the Document Store

Friday, October 11th, 2013

RaptorDB – the Document Store by Mehdi Gholam.

From the post:

This article is the natural progression from my previous article about a persisted dictionary to a full blown NoSql document store database. While a key/value store is useful, it’s not as useful to everybody as a "real" database with "columns" and "tables". RaptorDB uses the following articles:

Some advanced R&D (for more than a year) went into RaptorDB, in regards to the hybrid bitmap index. Similar technology is being used by Microsoft’s Power Pivot for Excel and US Department of Energy Berkeley labs project called fastBit to track terabytes of information from particle simulations. Only the geeks among us care about this stuff and the normal person just prefer to sit in the Bugatti Veyron and drive, instead of marvel at the technological underpinnings.

To get here was quite a journey for me as I had to create a lot of technology from scratch, hopefully RaptorDB will be a prominent alternative, built on the .net platform to other document databases which are either java or c++ based. 

RaptorDB puts the joy back into programming, as you can see in the sample application section.

If you want to take a deep dive into a .net project, this may be the one for you.

The use of fastBit, developed at US Department of Energy Berkeley, is what caught my attention.

A project using DOE developed software merits a long pause.

Latest version is dated October 10, 2013.

Bitsy 1.5

Friday, October 11th, 2013

Bitsy 1.5

Version 1.5 of Bitsy is out!

Bitsy is a small, fast, embeddable, durable in-memory graph database that implements the Blueprints API.

Slides: Improvements in Bitsy 1.5 by Sridhar Ramachandran.

The current production version is Bitsy 1.2 and Bitsy 1.5 is for research, evaluation and development. Webpage reports that Bitsy 1.5 should be available for production by the end of 2013.


PostgreSQL 9.3 released!

Wednesday, September 11th, 2013

PostgreSQL 9.3 released!

From the post:

The PostgreSQL Global Development Group announces the release of PostgreSQL 9.3, the latest version of the world’s leading open source relational database system. This release expands PostgreSQL’s reliability, availability, and ability to integrate with other databases. Users are already finding that they can build applications using version 9.3 which would not have been possible before.

“PostgreSQL 9.3 provides features that as an app developer I can use immediately: better JSON functionality, regular expression indexing, and easily federating databases with the Postgres foreign data wrapper. I have no idea how I completed projects without 9.3,” said Jonathan S. Katz, CTO of VenueBook.

From the what’s new page, an item of particular interest:

Writeable Foreign Tables:

“Foreign Data Wrappers” (FDW) were introduced in PostgreSQL 9.1, providing a way of accessing external data sources from within PostgreSQL using SQL. The original implementation was read-only, but 9.3 will enable write access as well, provided the individual FDW drivers have been updated to support this. At the time of writing, only the Redis and PostgreSQL drivers have write support (need to verify this).

I haven’t gotten through the documentation on FDW but for data integration it sounds quite helpful.

Assuming you document the semantics of the data you are writing back and forth. 😉

A use case for a topic map that spans both the local and “foreign” data source or separate topic maps for the local and “foreign” data source that can then be merged together.

DHS Bridging Siloed Databases [Comments?]

Monday, August 26th, 2013

DHS seeks to bridge siloed databases by Adam Mazmanian.

From the post:

The Department of Homeland Security plans to connect databases containing information on legal foreign visitors as a prototype of a system to consolidate identity information from agency sources. The prototype is a first step in what could turn into comprehensive records overhaul that would erase lines between the siloed databases kept by DHS component agencies.

Currently, DHS personnel can access information from across component databases under the “One DHS” policy, but access can be hindered by the need to log into multiple systems and make multiple queries. The Common Entity Index (CEI) prototype pulls biographical information from DHS component agencies and correlates the data into a single comprehensive record. The CEI prototype is designed to find linkages inside source data – names and addresses as well as unique identifiers like passport and alien registration numbers – and connect the dots automatically, so DHS personnel do not have to.

DHS is trying to determine whether it is feasible to create “a centralized index of select biographic information that will allow DHS to provide a consolidated and correlated record, thereby facilitating and improving DHS’s ability to carry out its national security, homeland security, law enforcement, and benefits missions,” according to a notice in the Aug. 23 Federal Register.
(…) (emphasis added)

Adam goes on to summarize the data sources that DHS wants to include in its “centralized index of select biographic information.”

There isn’t enough information in the Federal Register notice to support technical comments on the prototype.

However, some comments about subject identity and the role of topic maps in collating information from diverse resources would not be inappropriate.

Especially since all public comments are made visible at:

DB-Engines Ranking

Wednesday, August 7th, 2013

Method of calculating the scores of the DB-Engines Ranking

From the webpage:

The DB-Engines Ranking is a list of database management systems ranked by their current popularity. We measure the popularity of a system by using the following parameters:

  • Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google and Bing for this measurement. In order to count only relevant results, we are searching for “<system name> database”, e.g. “Oracle database”.
  • General interest in the system. For this measurement, we use the frequency of searches in Google Trends.
  • Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.
  • Number of job offers, in which the system is mentioned.
    We use the number of offers on the leading job search engines Indeed and Simply Hired.

  • Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional network LinkedIn.

We calculate the popularity value of a system by standardizing and averaging of the individual parameters. These mathematical transformations are made in a way ​​so that the distance of the individual systems is preserved. That means, when system A has twice as large a value in the DB-Engines Ranking as system B, then it is twice as popular when averaged over the individual evaluation criteria.

The DB-Engines Ranking does not measure the number of installations of the systems, or their use within IT systems. It can be expected, that an increase of the popularity of a system as measured by the DB-Engines Ranking (e.g. in discussions or job offers) precedes a corresponding broad use of the system by a certain time factor. Because of this, the DB-Engines Ranking can act as an early indicator. (emphasis added in last paragraph)

I mention this ranking explanation for two reasons.

First, it is a remarkably honest statement about how databases are ranked. It is as if the RIAA were to admit their “piracy” estimates are chosen for verbal impact than any relationship to a measurable reality.

Second, it demonstrates that semantics are lurking just behind the numbers of any ranking. True, DB-Engines said some ranking was X, but anyone who relies on that ranking needs to evaluate how it was arrived at.

Updated Database Landscape map – June 2013

Tuesday, June 11th, 2013

Updated Database Landscape map – June 2013 by Matthew Aslett.

database map

I appreciate all the work that went into the creation of the map but even in a larger size (see Matthew’s post), I find it difficult to use.

Or perhaps that’s part of the problem, I don’t know what use it was intended to serve?

If I understand the legend, then “search” isn’t found in the relational or grid/cache zones. Which I am sure would come as a surprise to the many vendors and products in those zones.

Moreover, the ordering of entries along each colored line isn’t clear. Taking graph databases for example, they are listed from top to bottom:

But GrapheneDB is Neo4j as a service. So shouldn’t they be together?

I have included links to all the listed graph databases in case you can see a pattern that I am missing.

BTW, GraphLab, in May of 2013, raised $6.75M for further development of GraphLab (GraphLab – Next Generation [Johnny Come Lately VCs]) and GraphChi, a project at GraphLab, were both omitted from this list.

Are there other graph databases that are missing?

How would you present this information differently? What ordering would you use? What other details would you want to have accessible?

A Trillion Dollar Math Trick

Saturday, June 1st, 2013

A Trillion Dollar Math Trick by Dick Lipton.

Dick reviews a presentation by Mike Stonebraker at TTI Vanguard meeting on “Ginormous Systems” in DC.

In part:

In Mike’s wonderful talk he made seven points about the past, present, and the future of database technology. He has a great track record, so likely he is mostly right on his guesses. One of his predictions was about a way of re-organizing databases that has several remarkable properties:

  • It speeds up database operations 50x. That is to say, on typical queries—ones that companies actually do—it is fifty times faster than classical database implementations. As a theorist we like speedups, especially asymptotic ones. But 50x is pretty cool. That is enough to change a query from an hour to a minute.
  • It is not a new idea. But the time is finally right, and Mike predicts that future databases will use this method.
  • It is an idea that no one seems to know who invented it. I asked Mike, I asked other experts at the conference, and all shrugged and said effectively: “I have no idea.” Curious.

Let’s look quickly at the way databases work, and then consider the trick.

I won’t spoil the surprise for you, see Dick’s post for the details.

BTW, read the comments on historical uses of the same idea.

Then think about how to apply to topic maps.

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

Designing Databases for Historical Research

Friday, May 3rd, 2013

Designing Databases for Historical Research by Matt Phillpott.

From the post:

The Institute of Historical Research now offer a wide selection of digital research training packages designed for historians and made available online on History SPOT. Most of these have received mention on this blog from time to time and hopefully some of you will have had had a good look at them. These courses are freely available and we only ask that you register for History SPOT to access them (which is a free and easy process). Full details of our online and face-to-face courses can also be found on the IHR website. Here is a brief look at one of them.

Designing Databases for Historical Research was one of two modules that we launched alongside History SPOT late in 2011. Unlike most courses on databases that are generic in scope, this module focuses very much on the historian and his/her needs. The module is written in a handbook format by Dr Mark Merry. Mark runs our face to face databases course and is very much the man to go to for advice on building databases to house historical data.

The module looks at the theory behind using databases rather than showing you how to build them. It is very much a starting point, a place to go to before embarking on the lengthy time that databases require of their creators. Is your historical data appropriate for database use or should a different piece of software be used? What things should you consider before starting the database? Getting it right from the very beginning does save you a lot of time and frustration later on.

If you need more convincing then here is a snippet from the module, where Mark discusses the importance of thinking about the data and database before you even open up the software.

Great background material if you are working in history or academic circles.

LevelDB Review (in 18 parts, seriously)

Wednesday, May 1st, 2013

My first encounter with this series by Oren Eini was: Reviewing LevelDB: Part XVIII–Summary.

At first I thought it had to be a late April Fool’s day joke.

On further investigation, much to my delight, it was not!

Searching his blog returned a hodge-podge listing in no particular order, with some omissions.

As a service to you (and myself), I have collated the posts in order:

Reviewing LevelDB, Part I: What is this all about?

Reviewing LevelDB: Part II, Put some data on the disk, dude

Reviewing LevelDB: Part III, WriteBatch isn’t what you think it is

Reviewing LevelDB: Part IV: On std::string, buffers and memory management in C++

Reviewing LevelDB: Part V, into the MemTables we go

Reviewing LevelDB: Part VI, the Log is base for Atomicity

Reviewing LevelDB: Part VII–The version is where the levels are

Reviewing LevelDB: Part VIII–What are the levels all about?

Reviewing RaveDB [LevelDB]: Part IX- Compaction is the new black

Reviewing LevelDB: Part X–table building is all fun and games until…

Reviewing LevelDB: Part XI–Reading from Sort String Tables via the TableCache

Reviewing RavenDB [LevelDB]: Part XII–Reading an SST

Reviewing LevelDB: Part XIII–Smile, and here is your snapshot

Reviewing LevelDB: Part XIV– there is the mem table and then there is the immutable memtable

Reviewing LevelDB: Part XV–MemTables gets compacted too

Reviewing LevelDB: Part XVI–Recovery ain’t so tough?

Reviewing LevelDB: Part XVII– Filters? What filters? Oh, those filters …

Reviewing LevelDB: Part XVIII–Summary

Parts IX and XII have typos in the titles, RavenDB instead of LevelDB.

Now there is a model for reviewing database software!

Strange Loop 2013

Saturday, April 27th, 2013

Strange Loop 2013


  • Call for presentation opens: Apr 15th, 2013
  • Call for presentation ends: May 9, 2013
  • Speakers notified by: May 17, 2013
  • Registration opens: May 20, 2013
  • Conference dates: Sept 18-20th, 2013

From the webpage:

Below is some guidance on the kinds of topics we are seeking and have historically accepted.

  • Frequently accepted or desired topics: functional programming, logic programming, dynamic/scripting languages, new or emerging languages, data structures, concurrency, database internals, NoSQL databases, key/value stores, big data, distributed computing, queues, asynchronous or dataflow concurrency, STM, web frameworks, web architecture, performance, virtual machines, mobile frameworks, native apps, security, biologically inspired computing, hardware/software interaction, historical topics.
  • Sometimes accepted (depends on topic): Java, C#, testing frameworks, monads
  • Rarely accepted (nothing wrong with these, but other confs cover them well): Agile, JavaFX, J2EE, Spring, PHP, ASP, Perl, design, layout, entrepreneurship and startups, game programming

It isn’t clear why Strange Loop claims to have “archives:”


As far as I can tell, these are listings with bios of prior presentations, but no substantive content.

Am I missing something?

Schema on Read? [The virtues of schema on write]

Friday, April 19th, 2013

Apache Hadoop and Data Agility by Ofer Mendelevitch.

From the post:

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.

In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

If a schema is supplied “on read,” how is data validation accomplished?

I don’t mean in terms of datatypes such as string, integer, double, etc. That are trivial forms of data validation.

How do we validate the semantics of data when a schema is supplied on read?”

Mistakes do happen in RDBMS systems but with a schema, which defines data semantics, applications can attempt to police those semantics.

I don’t doubt that schema “on read” supplies a lot of useful flexibility, but how do we limit the damage that flexibility can cause?

For example, many years ago, area codes (for telephones) in the USA were tied to geographic exchanges. Data from the era still exists in the bowels of some data stores. That is no longer true in many cases.

Let’s assume I have older data that has area codes tied to geographic areas and newer data that has area codes that are not. Without a schema to define the area code data in both cases, how would I know to treat the area code data differently?

I concede that schema “on read” can be quite flexible.

On the other hand, let’s not discount the value of schema “on write” as well.

How to Compare NoSQL Databases

Friday, April 19th, 2013

How to Compare NoSQL Databases by Ben Engber. (video)

From the description:

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB, Cassandra, HBase, others) and to do so in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.

We also discuss how to extend benchmarking tests to simulate different kinds of failure scenarios to help evaluate the maintainablility and recoverability of different systems. This requires carefully constructed tests and significant knowledge of the underlying databases — the talk will help evaluators overcome the common pitfalls and time sinks involved in trying to measure this.

Lastly we discuss the YCSB benchmarking tool, its significant limitations, and the significant extensions and supplementary tools Thumbtack has created to provide distributed load generation and failure simulation.

Ben makes a very good case for understanding the details of your use case versus the characteristics of particular NoSQL solutions.

Where you will find “better” performance depends on non-obvious details.

Watch the use of terms like “consistency” in this presentation.

The paper Ben refers to: Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs.

Forty-three pages of analysis and charts.

Slow but interesting reading.

If you are into the details of performance and NoSQL databases.


Saturday, April 6th, 2013

ODBMS.ORG – Object Database Management Systems

From the “about” page:

Launched in 2005, ODBMS.ORG was created to serve faculty and students at educational and research institutions as well as software developers in the open source community or at commercial companies.

It is designed to meet the fast-growing need for resources focusing on Big Data, Analytical data platforms, Scalable Cloud platforms, Object databases, Object-relational bindings, NoSQL databases, Service platforms, and new approaches to concurrency control

This portal features an easy introduction to ODBMSs as well as free software, lecture notes, tutorials, papers and other resources for free download. It is complemented by listings of relevant books and vendors to provide a comprehensive and up-to-date overview of available resources.

The Expert Section contains exclusive contributions from 130+ internationally recognized experts such as Suad Alagic, Scott Ambler, Michael Blaha, Jose Blakeley, Rick Cattell, William Cook, Ted Neward, and Carl Rosenberger.

The ODBMS Industry Watch Blog is part of this portal and contains up to date Information, Trends, and Interviews with industry leaders on Big Data, New Data Stores (NoSQL, NewSQL Databases), New Developments and New Applications for Objects and Databases, New Analytical Data Platforms, Innovation.

The portal’s editor, Roberto V. Zicari, is Professor of Database and Information Systems at Frankfurt University and representative of the Object Management Group (OMG) in Europe. His interest in object databases dates back to his work at the IBM Research Center in Almaden, CA, in the mid ’80s, when he helped craft the definition of an extension of the relational data model to accommodate complex data structures. In 1989, he joined the design team of the Gip Altair project in Paris, later to become O2, one of the world’s first object database products.

All materials and downloads are free and anonymous.

Non-profit ODBMS.ORG is made possible by contributions from ODBMS.ORG’s Panel of Experts,and the support of the sponsors displayed in the right margin of these pages.

The free download page is what first attracted my attention.

By any measure, a remarkable collection of material.

Ironic isn’t it?

CS needs to develop better access strategies for its own output.

Database Landscape Map – February 2013

Wednesday, March 27th, 2013

Database Landscape Map – February 2013 by 451 Research.

Database map

A truly awesome map of available databases.

Originated from Neither fish nor fowl: the rise of multi-model databases by Matthew Aslett.

Matthew writes:

One of the most complicated aspects of putting together our database landscape map was dealing with the growing number of (particularly NoSQL) databases that refuse to be pigeon-holed in any of the primary databases categories.

I have begun to refer to these as “multi-model databases” in recognition of the fact that they are able to take on the characteristics of multiple databases. In truth though there are probably two different groups of products that could be considered “multi-model”:

I think I understand the grouping from the key to the map but the ordering within groups, if meaningful, escapes me.

I am sure you will recognize most of the names but equally sure there will be some you can’t quite describe.


VLDB 2013

Monday, March 18th, 2013

39th International Conference on Very Large Data Bases


Submissions still open:

Industrial & Application Papers, Demonstration Proposals, Tutorial Proposals, PhD Workshop Papers, due by March 31st, 2013, author notification: May 31st, 2013

Conference: August 26 – 30, 2013.

From the webpage:

VLDB is a premier annual international forum for data management and database researchers, vendors, practitioners, application developers, and users. The conference will feature research talks, tutorials, demonstrations, and workshops. It will cover current issues in data management, database and information systems research. Data management and databases remain among the main technological cornerstones of emerging applications of the twenty-first century.

VLDB 2013 will take place at the picturesque town of Riva del Garda, Italy. It is located close to the city of Trento, on the north shore of Lake Garda, which is the largest lake in Italy, formed by glaciers at the end of the last ice age. In the 17th century, Lake Garda became a popular destination for young central European nobility. The list of its famous guests includes Goethe, Freud, Nietzsche, the Mann brothers, Kafka, Lawrence, and more recently James Bond. Lake Garda attracts many tourists every year, and offers numerous opportunities for sightseeing in the towns along its shores (e.g., Riva del Garda, Malcesine, Torri del Benaco, Sirmione), outdoors activities (e.g., hiking, wind-surfing, swimming), as well as fun (e.g., Gardaland amusement theme park).

Smile when you point “big data” colleagues to 1st Very Large Data Bases VLDB 1975: Framingham, Massachusetts.

Some people catch on sooner than others. 😉

The god Architecture

Saturday, March 9th, 2013

The god Architecture

From the overview:

god is a scalable, performant, persistent, in-memory data structure server. It allows massively distributed applications to update and fetch common data in a structured and sorted format.

Its main inspirations are Redis and Chord/DHash. Like Redis it focuses on performance, ease of use and a small, simple yet powerful feature set, while from the Chord/DHash projects it inherits scalability, redundancy, and transparent failover behaviour.

This is a general architectural overview aimed at somewhat technically inclined readers interested in how and why god does what it does.

To try it out right now, install Go, git, Mercurial and gcc, go get, run god_server, browse to http://localhost:9192/.

For API documentation, go to

For the source, go to

I know, “in memory” means its not “web scale” but to be honest, I have a lot of data needs that aren’t “web scale.”

There, I’ve said it. Some (most?) important data is not “web scale.”

And when it is, I only have to check my spam filter for options to deal with “web scale” data.

The set operations in particular look quite interesting.


I first saw this in Nat Torkington’s Four short links: 1 March 2013.

Survey of graph database models

Thursday, February 14th, 2013

Survey of graph database models by Renzo Angles and Claudio Gutierrez. (ACM Computing Surveys (CSUR) Surveys, Volume 40 Issue 1, February 2008, Article No. 1 )


Graph database models can be defined as those in which data structures for the schema and instances are modeled as graphs or generalizations of them, and data manipulation is expressed by graph-oriented operations and type constructors. These models took off in the eighties and early nineties alongside object-oriented models. Their influence gradually died out with the emergence of other database models, in particular geographical, spatial, semistructured, and XML. Recently, the need to manage information with graph-like nature has reestablished the relevance of this area. The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.

If you need an antidote for graph database hype, look no further than this thirty-nine (39) page survey article.

You will come away with a deeper appreciate for graph databases and their history.

If you are looking for a self-improvement reading program, you could do far worse than starting with this article and reading the cited references one by one.

History SPOT

Friday, February 8th, 2013

History SPOT

I discovered this site via a post entitled: Text Mining for Historians: Natural Language Processing.

From the webpage:

Welcome to History SPOT. This is a subsite of the IHR [Institute of Historical Research] website dedicated to our online research training provision. On this page you will find the latest updates regarding our seminar podcasts, online training courses and History SPOT blog posts.

Currently offered online training courses (free registration required):

  • Designing Databases for Historians
  • Podcasting for Historians
  • Sources for British History on the Internet
  • Data Preservation
  • Digital Tools
  • InScribe Palaeography

Not to mention over 300 pod casts!

Two thoughts:

First, a good way to learn about the tools and expectations that historians have of their digital tools. That should help you prepare an answer to: “What do topic maps have to offer over X technology?”

Second, I rather like the site and its module orientation. A possible template for topic map training online?

Oracle’s MySQL 5.6 released

Wednesday, February 6th, 2013

Oracle’s MySQL 5.6 released

From the post:

Just over two years after the release of MySQL 5.5, the developers at Oracle have released a GA (General Availability) version of Oracle MySQL 5.6, labelled MySQL 5.6.10. In MySQL 5.5, the developers replaced the old MyISAM backend and used the transactional InnoDB as the default for database tables. With 5.6, the retrofitting of full-text search capabilities has enabled InnoDB to now take on the position of default storage engine for all purposes.

Accelerating the performance of sub-queries was also a focus of development; they are now run using a process of semi-joins and materialise much faster; this means it should not be necessary to replace subqueries with joins. Many operations that change the data structures, such as ALTER TABLE, are now performed online, which avoids long downtimes. EXPLAIN also gives information about the execution plans of UPDATE, DELETE and INSERT commands. Other optimisations of queries include changes which can eliminate table scans where the query has a small LIMIT value.

MySQL’s row-oriented replication now supports “row image control” which only logs the columns needed to identify and make changes on each row rather than all the columns in the changing row. This could be particularly expensive if the row contained BLOBs, so this change not only saves disk space and other resources but it can also increase performance. “Index Condition Pushdown” is a new optimisation which, when resolving a query, attempts to use indexed fields in the query first, before applying the rest of the WHERE condition.

MySQL 5.6 also introduces a “NoSQL interface” which uses the memcached API to offer applications direct access to the InnoDB storage engine while maintaining compatibility with the relational database engine. That underlying InnoDB engine has also been enhanced with persistent optimisation statistics, multithreaded purging and more system tables and monitoring data available.

Download MySQL 5.6.

I mentioned Oracle earlier today (When Oracle bought MySQL [Humor]) so it’s only fair that I point out their most recent release of MySQL.

SQL, NoSQL =? CoSQL? Category Theory to the Rescue

Wednesday, January 30th, 2013

A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman.

I missed this when it appeared in March of 2011.

From the conclusion:

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL’s foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

The ACM Digital Library reports only 3 citations, which is unfortunate for such an interesting proposal.

I have heard about key/value pairs somewhere else. I will have to think about that and get back to you. (Hint for the uninitiated, try the Topic Maps Reference Model (TMRM). A new draft of the TMRM is due to appear in a week or so.)

A Formalism for Graph Databases and its Model of Computation

Tuesday, January 29th, 2013

A Formalism for Graph Databases and its Model of Computation by Tony Tan and Juan Reutter.


Graph databases are directed graphs in which the edges are labeled with symbols from a finite alphabet. In this paper we introduce a logic for such graphs in which the domain is the set of edges. We compare its expressiveness with the standard logic in which the domain the set of vertices. Furthermore, we introduce a robust model of computation for such logic, the so called graph pebble automata.

The abstract doesn’t really do justice to the importance of this paper for graph analysis. From the paper:

For querying graph structured data, one normally wishes to specify certain types of paths between nodes. Most common examples of these queries are conjunctive regular path queries [1, 14, 6, 3]. Those querying formalisms have been thoroughly studied, and their algorithmic properties are more or less understood. On the other hand, there has been much less work devoted on other formalisms other than graph reachability patterns, say, for example, the integrity constraints such as labels with unique names, typing constraints on nodes, functional dependencies, domain and range of properties. See, for instance, the survey [2] for more examples of integrity constraints.

The survey referenced in that quote is: Renzo Angles and Claudio Gutierrez. 2008. Survey of graph database models. ACM Comput. Surv. 40, 1, Article 1 (February 2008), 39 pages. DOI=10.1145/1322432.1322433

The abstract for the survey reads:

Graph database models can be defined as those in which data structures for the schema and instances are modeled as graphs or generalizations of them, and data manipulation is expressed by graph-oriented operations and type constructors. These models took off in the eighties and early nineties alongside object-oriented models. Their influence gradually died out with the emergence of other database models, in particular geographical, spatial, semistructured, and XML. Recently, the need to manage information with graph-like nature has reestablished the relevance of this area. The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.

Recommended if you want to build upon what is already known and well-established about graph databases.

Schemaless Data Structures

Saturday, January 12th, 2013

Schemaless Data Structures by Martin Fowler.

From the first slide:

In recent years, there’s been an increasing amount of talk about the advantages of schemaless data. Being schemaless is one of the main reasons for interest in NoSQL databases. But there are many subtleties involved in schemalessness, both with respect to databases and in-memory data structures. These subtleties are present both in the meaning of schemaless and in the advantages and disadvantages of using a schemaless approach.

Martin points out that “schemaless” does not mean the lack of a schema but rather the lack of an explicit schema.

Sounds a great deal like the implicit subjects that topic maps have the ability to make explicit.

Is there a continuum of explicitness for any given subject/schema?

Starting from entirely implied, followed by an explicit representation, then further explication as in a data dictionary, and at some distance from the start, a subject defined as a set of properties, which are themselves defined as sets of properties, in relationships with other sets of properties.

How far you go down that road depends on your requirements.


Wednesday, December 12th, 2012


From the webpage:

A universal open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript/Ruby extensions.

Design considerations:

In a nutshell:

  • Schema-free schemas with shapes: Inherent structures at hand are automatically recognized and subsequently optimized.
  • Querying: ArangoDB is able to accomplish complex operations on the provided data (query-by-example and query-language).
  • Application Server: ArangoDB is able to act as application server on Javascript-devised routines.
  • Mostly memory/durability: ArangoDB is memory-based including frequent file system synchronizing.
  • AppendOnly/MVCC: Updates generate new versions of a document; automatic garbage collection.
  • ArangoDB is multi-threaded.
  • No indices on file: Only raw data is written on hard disk.
  • ArangoDB supports single nodes and small, homogenous clusters with zero administration.

I have mentioned this before but ran across it again at: An experiment with Vagrant and Neo4J by Patrick Mulder.

Introduction to Databases [MOOC, Stanford, January 2013]

Thursday, December 6th, 2012

Introduction to Databases (info/registration link) – Starts January 15, 2013.

From the webpage:

About the Course

“Introduction to Databases” had a very successful public offering in fall 2011, as one of Stanford’s inaugural three massive open online courses. Since then, the course materials have been improved and expanded, and we’re excited to be launching a second public offering of the course in winter 2013. The course includes video lectures and demos with in-video quizzes to check understanding, in-depth standalone quizzes, a wide variety of automatically-checked interactive programming exercises, midterm and final exams, a discussion forum, optional additional exercises with solutions, and pointers to readings and resources. Taught by Professor Jennifer Widom, the curriculum draws from Stanford’s popular Introduction to Databases course.

Why Learn About Databases?

Databases are incredibly prevalent — they underlie technology used by most people every day if not every hour. Databases reside behind a huge fraction of websites; they’re a crucial component of telecommunications systems, banking systems, video games, and just about any other software system or electronic device that maintains some amount of persistent information. In addition to persistence, database systems provide a number of other properties that make them exceptionally useful and convenient: reliability, efficiency, scalability, concurrency control, data abstractions, and high-level query languages. Databases are so ubiquitous and important that computer science graduates frequently cite their database class as the one most useful to them in their industry or graduate-school careers.

Course Syllabus

This course covers database design and the use of database management systems for applications. It includes extensive coverage of the relational model, relational algebra, and SQL. It also covers XML data including DTDs and XML Schema for validation, and the query and transformation languages XPath, XQuery, and XSLT. The course includes database design in UML, and relational design principles based on dependencies and normal forms. Many additional key database topics from the design and application-building perspective are also covered: indexes, views, transactions, authorization, integrity constraints, triggers, on-line analytical processing (OLAP), JSON, and emerging NoSQL systems. Working through the entire course provides comprehensive coverage of the field, but most of the topics are also well-suited for “a la carte” learning.


Jennifer Widom is the Fletcher Jones Professor and Chair of the Computer Science Department at Stanford University. She received her Bachelors degree from the Indiana University School of Music in 1982 and her Computer Science Ph.D. from Cornell University in 1987. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering and the American Academy of Arts & Sciences; she received the ACM SIGMOD Edgar F. Codd Innovations Award in 2007 and was a Guggenheim Fellow in 2000; she has served on a variety of program committees, advisory boards, and editorial boards.

Another reason to take the course:

The structure and capabilities of databases shape the way we create solutions.

Consider normalization. An investment of time and effort that may be needed, for some problems, but not others.

Absent alternative approaches, you see every data problem as requiring normalization.

(You may anyway after taking this course. Education cannot impart imagination.)

Towards a Scalable Dynamic Spatial Database System [Watching Watchers]

Tuesday, November 20th, 2012

Towards a Scalable Dynamic Spatial Database System by Joaquín Keller, Raluca Diaconu, Mathieu Valero.


With the rise of GPS-enabled smartphones and other similar mobile devices, massive amounts of location data are available. However, no scalable solutions for soft real-time spatial queries on large sets of moving objects have yet emerged. In this paper we explore and measure the limits of actual algorithms and implementations regarding different application scenarios. And finally we propose a novel distributed architecture to solve the scalability issues.

At least in this version, you will find two copies of the same paper, the second copy sans the footnotes. So read the first twenty (20) pages and ignore the second eighteen (18) pages.

I thought the limitation of location to two dimensions understandable, for the use cases given, but am less convinced that treating a third dimension as an extra attribute is always going to be suitable.

Still, the results here are impressive as compared to current solutions so an additional dimension can be a future improvement.

The use case that I see missing is an ad hoc network of users feeding geo-based information back to a collection point.

While the watchers are certainly watching us, technology may be on the cusp of answering the question: “Who watches the watchers?” (The answer may be us.)

I first saw this in a tweet by Stefano Bertolo.