Archive for the ‘NoSQL’ Category

Making Sense Out of Datomic,…

Friday, June 28th, 2013

Making Sense Out of Datomic, The Revolutionary Non-NoSQL Database by Jakub Holy.

From the post:

I have finally managed to understand one of the most unusual databases of today, Datomic, and would like to share it with you. Thanks to Stuart Halloway and his workshop!

Why? Why?!?

As we shall see shortly, Datomic is very different from the traditional RDBMS databases as well as the various NoSQL databases. It even isn’t a database – it is a database on top of a database. I couldn’t wrap my head around that until now. The key to the understanding of Datomic and its unique design and advantages is actually simple.

The mainstream databases (and languages) have been designed around the following constraints of 1970s:

  • memory is expensive
  • storage is expensive
  • it is necessary to use dedicated, expensive machines

Datomic is essentially an exploration of what database we would have designed if we hadn’t these constraints. What design would we choose having gigabytes of RAM, networks with bandwidth and speed matching and exceeding harddisk access, the ability to spin and kill servers at a whim.

But Datomic isn’t an academical project. It is pragmatic, it wants to fit into our existing environments and make it easy for us to start using its futuristic capabilities now. And it is not as fresh and green as it might seem. Rich Hickey, the master mind behind Clojure and Datomic, has reportedly thought about both these projects for years and the designs have been really well thought through.


Deeply interesting summary of Datomic.

The only point I would have added about traditional databases was the requirement for normalized data. Placing load on designers and users instead of the software.

Cassandra project chair: We’re taking on Oracle (Cassandra 2.0)

Sunday, June 16th, 2013

Cassandra project chair: We’re taking on Oracle by Paul Krill.

From the post:

Apache Cassandra is an open source, NoSQL database accommodating large-scale workloads and attracting a lot of attention, having been deployed in such organizations as Netflix, eBay, and Twitter. It was developed at Facebook, which open-sourced it in 2008, and its database can be deployed across multiple data centers and in cloud environments.

Jonathan Ellis is the chair of the project at Apache, and he serves as chief technical officer at DataStax, which has built a business around Cassandra. InfoWorld Editor-at-Large Paul Krill spoke with Ellis at the company’s recent Cassandra Summit 2013 conference in San Francisco, where Ellis discussed efforts to make the database easier to use and how it has become a viable competitor to Oracle’s relational database technology.

InfoWorld: What is the biggest value-add for Cassandra?

Ellis: It’s driving the Web applications. We’re the ones who power Netflix, Spotify. Cassandra is actually powering the applications directly. It lets you scale to millions of operations per second and software-as-a-service, machine-generated data, Web applications. Those are all really hot spots for Cassandra.

Cassandra 2.0 is targeted for the end of July, 2013. Lightweight transactions and triggers are on the menu.

NoSQL Matters 2013 (Videos/Slides)

Friday, June 14th, 2013

NoSQL Matters 2013 (Video/Slides)

A great set of videos but in no particular order in the original listing. I ordered them by the author’s last name for quick scanning.

Unless otherwise noted, the titles link to videos at Vimeo. Abstracts follow each title when available.


Pavlo Baron – 100% Big Data, 0% Hadoop, 0% Java

If your data is big enough, Hadoop it!” That’s simply not true – there is much more behind this term than just a tool. In this talk I will show one possible, practically working approach and the corresponding selection of tools that help collect, mine, move around, store and provision large, unstructured data amounts. Completely without Hadoop. And even completely without Java.

Pere Urbón Bayes – From Tables to Graph. Recommendation Systems, a Graph Database Use Case (No video, slides)

Recommendation engines have changed a lot during the last years and the last big change is NoSQL, especially Graph Data- bases. With this presentation we intend to show how to build a Graph Processing technology, based on our experience in doing that for environments like Digital Libraries and Movies and Digital Media. First, we will introduce the state of the art on context aware Recommendation Engines, with special interest on how peo- ple are using Graph Processing, NoSQL, systems to scale this kind of solutions. After an introduction to the ecosystem, the next step is to have something to work with. So we will show the audience how to build a Recommendation Engine with a few steps.

The demonstration part will be made using the next technology stack: Sinatra as a simple web framework. Ruby as a programming language. OrientDB, Neo4j, Redis, etc. as a NoSQL technology stack. The result of our demonstration will be a simple engine, accessible through a REST API, to play and extend, so that atten- dants can learn by doing.

In the end our audience will have a full in- troduction to the field of Recommendati- on Engines, with special interest on Graph Processing, NoSQL, systems. Based on our experience making this technology for large scale architectures, we think the best way to learn this is by doing it and having an example to play with.

Nic Caine – Leveraging InfiniteGraph for Big Data

Practical insight! Graph databases could help institutions designed for research & development in healthcare and life sciences managing Big Data sets. Researcher obtains entry in a new level for healthcare and pharmaceutical data analytics.

This talk explains the challenges for developer to detect relationships and similar cross link interaction within data analysis. Graph database technology can give the answer that nobody asked before!

William Candillon – JSONiq – The SQL of NoSQL

SQL has been good to the relational world, but what about query languages in the NoSQL space?

We introduce JSONiq: the SQL of NoSQL.

Like SQL, JSONiq enables developers to leverage the same productive high-level language across a variety of products.

But this not your grandma’s SQL; it supports novel concepts purposely designed for flexible data.

Moreover, JSONiq is a highly optimizable language to query and update NoSQL stores.

We show how JSONiq can be used on top products such as MongoDB, CouchBase, and DynamoDB.

Aparna Chaudhary – Look Ma! No more Blobs

GridFS is a storage mechanism for persisting large objects in MongoDB. The talk will cover a use case of content management using MongoDB. During the talk I would explain why we chose MongoDB over traditional relational database to store XML files. The talk would be accompanied by a live demo using Spring Data & GridFS.

Sean Cribbs – Data Structures in Riak

Since the beginning, Riak has supported high write-availability using Dynamo-style multi-valued keys – also known as conflicts or siblings. The tradeoff for this type of availability is that the application must include logic to resolve conflicting updates. While it is convenient to say that the application can reason best about conflicts, ad hoc resolution is error-prone and can result in surprising anomalies, like the reappearing item problem in Dynamo’s shopping cart.

What is needed is a more formal and general approach to the problem of conflict resolution for complex data structures. Luckily, there are some formal strategies in recent literature, including Conflict-Free Replicated Data Types (CRDTs) and BloomL lattices. We’ll review these strategies and cover some recent work we’ve done toward adding automatically-convergent data structures to Riak.

David Czarnecki – Real-World Redis

Redis is a data structure server, but yet all too often, it is used to do simple data caching. This is not because its internal data structures are not powerful, but I believe, because they require libraries which wrap the functionality into something meaningful for modeling a particular problem or domain. In this talk, we will cover 3 important use cases for Redis data structures that are drawn from real-world experience and production applications handling millions of users and GB of data:

Leaderboards – also known as scoreboards or high score tables – used in video games or in gaming competition sites

Relationships (e.g. friendships) – used in “social” sites

Activity feeds – also known as timelines in “social” sites

The talk will cover these use cases in detail and the development of libraries around each separate use case. Particular attention for each service will be devoted to service failover, scaling and performance issues.

Lucas Dohmen – Rapid API Development with Foxx

Foxx is a feature of the upcoming version of the free and open source NoSQL database ArangoDB. It allows you to build APIs directly on top of the database and therefore skip the middleman (Rails, Django or whatever your favorite web framework is). This can for example be used to build a backend for Single Page Web Applications. It is designed with simplicity and the specific use case of modern client-side MVC frameworks in mind featuring tools like an asset delivery system.

Stefan Edlich – NoSQL in 5+ years

Currently it’s getting harder and harder to keep track of all the movements in the SQL, NoSQL and NewSQL world. Furthermore even polyglot persistence can have many meanings and eculiarities. This talk shows possible directions where NoSQL might move in this decade. We will discuss db-model integration- and storage aspects together with some of the hottest systems in the market that lead the way toady. We conclude with some survival strategies for us as users / companies in this messy world.

Benjamin Engber- How to Compare NoSQL Databases: Determining True Performance and Recoverability Metrics For Real-World Use Cases

One of the major challenges in choosing an appropriate NoSQL solution is finding reliable information as to how a particular database performs for a given use case. Stories of high profile systems failures abound, and tossed around with widely varying benchmark numbers that seem to have no bearing on how tuned systems behave out of the lab. Many of the profiling tools and studies out there use deeply flawed tools or methodologies. Getting meaningful data out of published benchmark studies is difficult, and running internal benchmarks even more so.

In this presentation we discuss how to perform tuned benchmarking across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB, Cassandra, HBase, others) and to do so in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.

We also discuss how to extend benchmarking tests to simulate different kinds of failure scenarios to help evaluate the maintainablility and recoverability of different systems. This requires carefully constructed tests and significant knowledge of the underlying databases — the talk will help evaluators overcome the common pitfalls and time sinks involved in trying to measure this.

Lastly we discuss the YCSB benchmarking tool, its significant limitations, and the significant extensions and supplementary tools Thumbtack has created to provide distributed load generation and failure simulation.

Ralf S. Engelschall – Polyglot Persistence: Boon and Bane for Software Architects

RDBMS since two decades are the established single standard for every type of data storage of business information systems. In the last few years, the NoSQL movement brought us a myriad of interesting alternative data storage approaches. Now Polyglot Persistence tells us to leverage from the combination of multiple data storage approaches in a “best tool for the job” approach, including the combination of RDBMS and NoSQL.

This presentation addresses the following questions: How does Polyglot Persistence look like in practive when implementing a full-size business information system with it? What interesting use-cases towards the persistence layer exist here? What technical challenges are we faced with here? How does the persis-tence layer architecture look like when using multiple storage backends? What are the technical alternative solutions to implement Polgylot Persistence?

Martin Fowler – NoSQL Distilled to an hour

NoSQL databases offer a significant change to how enterprise applications are built, challenging to two-decade hegemony of relational databases. The question people face is whether NoSQL databases are an appropriate choice, either for new projects or to introduce to existing projects. I’ll give rapid introduction to NoSQL databases: where they came from, the nature of the data models they use, and the different way you have to think about consistency. From this I’ll outline what kinds of circumstances you should consider using them, why they will not make relational databases obsolete, and the important consequence of polyglot persistence.

Uwe Friedrichsen – How to survive in a BASE world

NoSQL, Big Data and Scale-out in general are leaving the hype plateau and start to become enterprise reality. This usally means no more ACID tranactions, but BASE transactions instead. When confronted with BASE, many developers just shrug and think “Okay, no more SQL but that’s basically it, isn’t it?”. They are terribly wrong!

BASE transactions do not guarantee data consistency at all times anymore, which is a property we became so used to in the ACID years that we barely think about it anymore. But if we continue to design and implement our applications as if there still were ACID transactions, system crashes and corrupt data will become your daily company.

This session gives a quick introduction into the challenges of BASE transactions and explains how to design and implement a BASE-aware application using real code examples. Additionally we extract some concrete patterns in order to preserve the ideas in a concise way. Let’s get ready to survive in a BASE world!

Lars George – HBase Schema Design

HBase is the Hadoop Database, a random access store that can easily scale to Petabytes of data. It employs common logical concepts, such as rows and tables. But the lack of transaction, the simply CRUD API, combined with a nearly schema-less data layout, it requires a deeper understanding of its inner workings, and how these affect performance. The talk will discuss the architecture behind HBase and lead into practical advice on how to structure data inside HBase to gain the best possible performance.

Lars George – Introduction to Hadoop

Apache Hadoop is the most popular solution for today’s big data problems. By connecting multiple servers, Hadoop provides a redundant and distributed platform to store and process large amounts of data. This presentation will introduce the architecture of Hadoop and the various interfaces to import and export data into it. Finally, a range of tools will be presented to access the data which include a NoSQL layer called HBase, a Scripting Language layer called Pig, but also a goo old SQL approach through Hive.

Kris Geusebroek – Creating Neo4j Graph Databases with Hadoop

When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns. When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).

In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.

Tugdual Grall -Introduction to Map Reduce with Couchbase 2.0

MapReduce allows systems to delegate the query processing on different machines in parallel. Couchbase 2.0 allows developer to use Map Reduce to query JSON based data on a large volume of data (and server). Couchbase 2.0 features incremental map reduce, which provides powerful aggregates and summaries, even with large datasets for distributed real-time analytic use cases. In this session you will learn:
– What is MapReduce?
– How incremental map-reduce works in Couchbase Server 2.0?
– How to utilize incremental map-reduce for real-time analytics?
– Common use cases this feature addresses?
In this session you will see in demonstration how you can create new Map Reduce function and use it into your application.

Alex Hall – Processing a Trillion Cells per Mouse Click


Column-oriented database systems have been a real game changer for the industry in recent years. Highly tuned and performant systems have evolved that provide users with the possibility of answering ad hoc queries over large datasets in an interactive manner. In this paper we present the column-oriented datastore developed as one of the central components of PowerDrill (internal Google data-analysis project).

It combines the advantages of columnar data layout with other known techniques (such as using composite range partitions) and extensive algorithmic engineering on key data structures. The main goal of the latter being to reduce the main memory footprint and to increase the efficiency in processing typical user queries. In this combination we achieve large speed-ups. These enable a highly interactive Web UI where it is common that a single mouse click leads to processing a trillion values in the underlying dataset.

Randall Hauch – Elastic, consistent and hierarchical data storage with ModeShape 3

ModeShape 3 is an elastic, strongly-consistent hierarchical database that supports queries, full-text search, versioning, events, locking and use of schema-rich or schema- less constraints. It’s perfect for storing files and hierarchically structured data that will be accessed by navigation or queries. You can choose where (if at all) you want ModeShape to enforce your schema, but your structure and schema can always evolve as your needs change. Sequencers make it easy to extract structure from stored files, and federation can bring into your database information from external systems. It’s fast, sits on top of an Infinispan data grid, and open source. Learn about the benefits of ModeShape 3, and how to deploy and use it to store your own data.

Michael Hausenblas – Apache Drill In-Depth Dissection

The Apache Drill project ostensibly has goals that make it look a lot like Dremel in that the mainline use case involves SQL or SQL-like queries applied to a large distributed data store, possibly organized in a columnar format.

In fact, however, Drill is a highly flexible architecture that allows it to serve many needs. Moreover, Drill has standardized internal API’s which allow easy extension for experimentation with parallel query evaluation. This is achieved by defining a standard logical query data flow language with a standardized and very flexible JSON syntax. Operators can be added to this framework very easily with the only assumption being that operators have inputs that are record sequences and a single output consisting of a record sequence. A SQL to logical query translator and the operators necessary to evaluate these queries are part of the standard Drill, but alternative syntax is easily added and alternative semantics are easily substituted.

This talk will describe the overall architecture of Drill, report on the progress in building an open source development community and show how Drill can be used to do machine learning, how Drill can be embedded in a language like Scala or Groovy, and how new syntax components can be added to support a language like Pig. This will be done by a description of how new parsers and operators are added. In addition, I will provide a description of how Drill uses Optiq to do cost-based query optimization.

Michael Hunger – Cypher Query Language and Neo4j

The Neo4j graph database is all about relationships. It allows to model domains of connected data easily. Querying using a imperative API is cumbersome. So we decided to develop a query language more suited to query graph data and focused on readability.

Taking inspiration from SQL, SparQL and others and using Scala to implement it turned out to be a good decision. Cypher has become one of the things people love about Neo4j. So in the talk we’ll introduce the language and its applicability for graph querying. We will focus on the expressive declaration of patterns, conditions and projections as well as the updating capabilities.

Michael Hunger – Intro to Neo4j or Domain Modeling with graphs
Slides No abstract)

Dr. Stefan Kaes & Sebastian Röbke – Triple R – Riak, Redis and RabbitMQ at XING

This talk will focus on how the team at XING, a popular social network for business professionals, rebuilt the activity stream for with Riak.

We will present interesting challenges we had to solve, like:
* Finding the proper data structures for storing very large activity streams
* Dealing with eventual consistency
* Migrating legacy data
* Setting up Riak in production

Piotr Kolaczkowski- Scalable Full-Text Search with DataStax Enterprise

Cassandra is the scalable NoSQL database powering modern applications at companies like Netflix, eBay, Disney, and hundreds of others. Solr is the popular enterprise search platform from the Apache Lucene project, offering powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, and many more. DataStax Enterprise platform integrates them both in such a way that data stored in Cassandra can be accessed and searched using Solr, and data stored in Solr can be accessed through Cassandra. This talk will describe the high level architecture of the Datastax Enterprise Search as well as specific algorithms responsible for massive scalability and good load balancing of Solr-enabled clusters.

Steffen Krause – DynamoDB – on-demand NoSQL scaling as a service

Scaling a distributed NoSQL database and making it resilient to failure can be hard. With Amazon DynamoDB, you just specify the desired throughput, consistency level and upload your data. DynamoDB does all the heavy lifting for you. Come to this session to get an overview of an automated, self-managed key-value store that can seamlessly scale to hundreds of thousands of operations per second.

Hannu Krosing – PostSQL – using PostgreSQL as a better NoSQL

This talk will describe using PostgreSQL superior ACID data engine “in a NoSQL way”, that is using PostgreSQL’s support for JSON and other dynamic/freeform types and embedded languages (pl/python, pl/jsv8) for data processing near data.

Also, scaling the Skype way using pl/proxy sharding language, pgbouncer connection pooler and skytools data moving and transforming multitool is demonstrated. Performance comparisons to popular NoSQL databases are also shown.

Fabian Lange – There and Back Again: A Cloud’s Tale

Want to see how we built our cloud based document management solution CenterDevice? This talk will cover the joy and pain of working with MongoDB, Elastic Search, Gluster FS, Rabbit MQ Java and more. I will show how we build, deploy, run and monitor our very own cloud. You will learn what we learned, what worked and where hype met reality. Warning: no unicorns or pixie dust.

Dennis Meyer & Uwe Seiler – Map/Reduce in Action: Large Scale Reporting Based on Hadoop and Vertica

German based ADTECH GmbH, an AOL Inc. Company, is a leading international supplier of digital market-ing solutions delivering approx. 6 billion advertisements on a daily basis for customers in over 25 countries. Every ad delivery needs to be logged for reasons of billing but additionally ADTECH’s customers want to know as much detail as possible about those deliveries. Until recently the reporting part of ADTECH’s infrastructure was based on a custom C++ reporting solution used in combination with multiple databases. With ever increasing traffic the performance was reaching its limits, especially for the customer-critical end-of-month reporting period. Furthermore changes to custom reports were complex and time consuming due to the highly intermeshed architecture. The delays in producing these customizations was a source of customer dissatisfaction. To solve these issues ADTECH, in partnership with codecentric AG, made the move to a more flexible and performant architecture based on the Apache Hadoop ecosystem. With this new approach all details about the ad deliveries are logged using a combination of Avro and Flume, stored into HDFS and then aggregated using Map/Reduce with Pig before being stored in the NoSQL datastore Vertica. This talk aims to give an overview of the architecture and explain the architectural decisions made with a strong focus on the lessons learned.

Konstantin Osipov – Persistent message queues with Tarantool/Box

Tarantool/Box is an open-source, persistent, transactional in-memory database with rich support of Lua as a stored procedures and extensions language.

In the diverse world of NoSQL, Tarantool/Box owns a niche of a database efficiently running important parts of Web application logic, thus smartly enabling Web application to scale. In this talk I’ll present several new use cases in which Tarantool/Box plays a special role, and our new features implemented to support them. In particular, I’ll explore the problem of persistent transactional message queues and the role they play in highly available software. I will demonstrate how Tarantool/Box can be used as a reliable message queue server, but customized with parts of application-specific logic.

I’ll show-case Tarantool/Box features, designed to support message queues:

– inter- stored procedure communication channels, to effectively exchange messages between task producers and consumers
– triggers, fired upon system events, such as connect or disconnect, and their usage in an efficient queue implementation
– new database index types: bitmap, partial and functional indexes, necessary to implement very large queues with minimal memory footprint.

Panel discussion (No abstract)

Mahesh Paolini-Subramanya – NoSQL the Telco way

Being a decent-sized Telecommunications provider, we process a lot of calls (hundreds/second), and need to keep track of all the events on each call. Technically speaking, this is “A Lot” of data – data that our clients (and our own people!) want real-time access to in a myriad of ways. We’ve ended up going through quite a few NoSQL stores in our quest to satisfy everyone – and the way we do things now has very little to do with where we started out. Join me as I describe our experience and what we’ve learned, focusing on the Big 4, viz.

– The “solution-oriented” nature of NoSQL repeatedly changed our understanding of our problem-space – sometimes drastically.
– The system behavior , particularly the failure modes, were significantly different at scale
– The software model kept getting overhauled – regardless of how much we planned ahead
– We came to value agility – the ability to change direction – above all (yes, even at a Telco!)

Eric Redmond – Distributed Patterns You Should Know

Do you use Merkle trees? How about hash rings? Vector clocks? How many message patterns do you know? In this increasingly decentralized world, a firm grasp of the pantheon of data structures and patterns used to manage decentralized systems is no longer a luxury, it’s a necessity. You may never implement a pipeline, but chances are, you already use them.

Larysa Visengeriyeva – Introduction to Predictive Analytics for NoSQL Nerds (Slides, no video)

Rolling out and running a NoSQL Database is only half the battle. It’s obvious to see that NoSQL Databases are used more and more in companies and start-ups where there is a huge need to dig the ‘big-data’ treasures. This requires a profound knowledge of mathematics, statistics, AI, data mining and machine-learning where experts are rare. This talk will give an overview of the most important concepts mentioned before. Furthermore tools, techniques and and experiences for a successful data analysis will be introduced. Finally this talk closes with a practical implementation for analyzing text – following the ‘IBM Watson’ idea.

Matthew Revell – Building scalability into your next project with Riak

Change is the one thing you can guarantee will happen in your next project. Fixing your schema before you’ve even launched means taking a massive gamble over what your users will want and how you’re going to deliver it.

The new generation of schema-free databases give you the flexibility to learn what your users need and what your application should do. As a bonus, databases such as Riak give you huge scalability meaning that you needn’t fear success.

Matthew introduces the new world of schema-free/NoSQL databases and focuses on how Riak can make your next project truly web-scale.

Martin Schoenert – CAP and Architectual Consequences

The blunt formulation of the CAP theorem states that any database system can achieve only 2 of the 3 properties: consistency, availability and partition-tolerance. We look at it more closely and see that this
formulation is misleading, because there is not a single big design decision but several smaller ones for the design of a database system. We then concentrate on the architectural consequences for massively
distributed database systems and argue that such systems must place restrictions on consistency and functionality.

Jan Steemann – Query Languages for Document Stores

SQL is the standard and established way to query relational databases.

As the name “NoSQL” suggests, NoSQL databases have gone some other way, coming up with several approaches of querying, e.g. access by key, map/reduce, and even own full-featured query languages.

We surely don’t want the super-fast key/value store require us to use a full-blown query language and slow us down – but for several other cases querying using a language can still be convenient. This is especially the case in document stores that have a wide range of use cases and allow us to look at different aspects of the same data.

As there isn’t yet an established standard for querying document databases, the talk will showcase some of the existing implementations such as UNQL, AQL, and jsoniq. Additionally, related topics such as graph query languages will be covered.

Kai Wähner – Big Data beyond Hadoop – How to integrate ALL your data

Big data represents a significant paradigm shift in enterprise technology. Big data radically changes the nature of the data management profession as it introduces new concerns about the volume, velocity and variety of corporate data.

Apache Hadoop is the open source defacto standard for implementing big data solutions on the Java platform. Hadoop consists of its kernel, MapReduce, and the Hadoop Distributed Filesystem (HDFS). A challenging task is to send all data to Hadoop for processing and storage (and then get it back to your application later), because in practice data comes from many different applications (SAP, Salesforce, Siebel, etc.) and databases (File, SQL, NoSQL), uses different technologies and concepts for communication (e.g. HTTP, FTP, RMI, JMS), and consists of different data formats using CSV, XML, binary data, or other alternatives.

This session shows the powerful combination of Apache Hadoop and Apache Camel to solve this challenging task. Learn how to use every thinkable data with Hadoop – without plenty of complex or redundant boilerplate code. Besides supporting the integration of all different technologies and data formats, Apache Camel also offers an easy, standardized DSL to transform, split or filter incoming data using the Enterprise Integration Patterns (EIP). Therefore, Apache Hadoop and Apache Camel are a perfect match for processing big data on the Java platform.

Simon Willnauer – With a hammer in your hand…

ElasticSearch combines the power of Apache Lucene (NoSQL since 2001) and the movement of distributed, scalable high-performance NoSQL solutions into easy to use schema free search engine that can serve full-text search request, key-value lookups, schema free analytics requests, facets or even suggestions in real-time. This talk will give an introduction to the key features of ElasticSearch with live examples.

The talk won’t be an exhaustive feature presentation but rather an overview of what and how ElasticSearch can do for you.

Randall Wilson – A Billion Person Family Tree with MongoDB

FamilySearch maintains a collaborative family tree with a billion individuals in it. The tree is edited in real time by thousands of concurrent users. Recent experiments to move the tree from a relational database to a MongoDB yielded a huge gain in performance. This presentation reviews the lessons learned throughout this experience, including how to deal with things such as operations that previously depended on transactional integrity. It also shares some insights into the experience gained by testing against Riak and other solutions.

I first saw this in a tweet by Eugene Dvorkin.

BrightstarDB 1.3 now available

Friday, June 14th, 2013

BrightstarDB 1.3 now available

From the post:

We are pleased to announce the release of BrightstarDB 1.3. This is the first “official” release of BrightstarDB under the open-source MIT license. All of the documentation and notices on the website should now have been updated to remove any mention of commercial licensing. To be clear: BrightstarDB is not dual licensed, the MIT license applies to all uses of BrightstarDB, commercial or non-commercial. If you spot something we missed in the docs that might indicate otherwise please let us know.

The main focus of this release has been to tidy up the licensing and use of third-party closed-source applications in the build process, but we also took the opportunity to extend the core RDF APIs to provide better support for named graphs within BrightstarDB stores. This release also incorporates the most recent version of dotNetRDF providing us with updated Turtle parsing and improved SPARQL query performance over the previous release.

Just to tempt you into looking further, the features are:

  • Schema-Free Triple Store
  • High Performance
  • LINQ & OData Support
  • Historical Data Access
  • Transactional (ACID)
  • NoSQL Entity Framework
  • SPARQL Support
  • Automatic Indexing

From Kal Ahmed and Graham Moore if you don’t recognize the software.

Updated Database Landscape map – June 2013

Tuesday, June 11th, 2013

Updated Database Landscape map – June 2013 by Matthew Aslett.

database map

I appreciate all the work that went into the creation of the map but even in a larger size (see Matthew’s post), I find it difficult to use.

Or perhaps that’s part of the problem, I don’t know what use it was intended to serve?

If I understand the legend, then “search” isn’t found in the relational or grid/cache zones. Which I am sure would come as a surprise to the many vendors and products in those zones.

Moreover, the ordering of entries along each colored line isn’t clear. Taking graph databases for example, they are listed from top to bottom:

But GrapheneDB is Neo4j as a service. So shouldn’t they be together?

I have included links to all the listed graph databases in case you can see a pattern that I am missing.

BTW, GraphLab, in May of 2013, raised $6.75M for further development of GraphLab (GraphLab – Next Generation [Johnny Come Lately VCs]) and GraphChi, a project at GraphLab, were both omitted from this list.

Are there other graph databases that are missing?

How would you present this information differently? What ordering would you use? What other details would you want to have accessible?

FoundationDB Beta 2 [NSA Scale?]

Monday, June 10th, 2013

Beta 2 is here – with 100X increased capacity!

From the post:

We’re happy to announce that we’ve released FoundationDB Beta 2!

Most of our testing and tuning in the past has focused on data sets ranging up to 1TB, but our users have told us that they’re excited to begin applying FoundationDB’s transactional processing to data sets larger than 1 TB, so we made that our major focus for Beta 2.

db scale

Beta 2 significantly reduces memory and CPU usage while increasing server robustness when working with larger data sets. FoundationDB now supports data sets up to 100 TB of aggregate key-value size. Though if you are planning on going above 10 TB you might want to talk to us at for some configuration recommendations—we’re always happy to help.

Also new in Beta 2 is support for Node 0.10 and Ruby on Windows. Of course, there are a whole lot of behind-the-scenes improvements to both the core and our APIs, some of which are documented in the release notes.

New Website!

We also recently rolled out a cool new website to explain the transformative effect that ACID transactions have on NoSQL technology. Be sure to check it out, along with our community site where you can share your insights and get questions answered.

Do you think “web scale” is rather passé nowadays?

Really should be talking about NSA scale.


MongoDB: The Definitive Guide 2nd Edition is Out!

Thursday, May 23rd, 2013

MongoDB: The Definitive Guide 2nd Edition is Out! by Kristina Chodorow.

From the webpage:

The second edition of MongoDB: The Definitive Guide is now available from O’Reilly! It covers both developing with and administering MongoDB. The book is language-agnostic: almost all of the examples are in JavaScript.

Looking forward to enjoying the second edition as much as the first!

Although, I am not really sure that always using JavaScript means you are “language-agnostic.” 😉

Apache Drill

Sunday, May 19th, 2013

Michael Hausenblas at NoSQL Matters 2013 does a great lecture on Apache Drill.


Google’s Dremel Paper

Projects “beta” for Apache Drill by second quarter and GA by end of year.

Apache Drill User.

From the rationale:

There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel.

How do you handle ad hoc exploration of data sets as part of planning a topic map?

Being able to “test” merging against data prior to implementation sounds like a good idea.

Warp: Multi-Key Transactions for Key-Value Stores

Saturday, May 18th, 2013

Warp: Multi-Key Transactions for Key-Value Stores by Robert Escriva, Bernard Wong and Emin Gün Sirer†.


Implementing ACID transactions has been a longstanding challenge for NoSQL systems. Because these systems are based on a sharded architecture, transactions necessarily require coordination across multiple servers. Past work in this space has relied either on heavyweight protocols such as Paxos or clock synchronization for this coordination.

This paper presents a novel protocol for coordinating distributed transactions with ACID semantics on top of a sharded data store. Called linear transactions, this protocol achieves scalability by distributing the coordination task to only those servers that hold relevant data for each transaction. It achieves high performance by serializing only those transactions whose concurrent execution could potentially yield a violation of ACID semantics. Finally, it naturally integrates chain-replication and can thus tolerate faults of both clients and servers. We have fully implemented linear transactions in a commercially available data store. Experiments show that the throughput of this system achieves 1-9× more throughput than MongoDB, Cassandra and HyperDex on the Yahoo! Cloud Serving Benchmark, even though none of the latter systems provide transactional guarantees.

Warp looks wicked cool!

Of particular interest is the non-ordering of transactions that have no impact on other transactions. That alone would be interesting for a topic map merging situation.

For more details, see the Warp page, or

Download Warp

Warp Tutorial

Warp Performance Benchmarks

I first saw this at High Scalability.

Solr 4, the NoSQL Search Server [Webinar]

Friday, May 17th, 2013

Solr 4, the NoSQL Search Server by Yonik Seeley

Date: Thursday, May 30, 2013
Time: 10:00am Pacific Time

From the description:

The long awaited Solr 4 release brings a large amount of new functionality that blurs the line between search engines and NoSQL databases. Now you can have your cake and search it too with Atomic updates, Versioning and Optimistic Concurrency, Durability, and Real-time Get!

Learn about new Solr NoSQL features and implementation details of how the distributed indexing of Solr Cloud was designed from the ground up to accommodate them.
Featured Presenter:

Yonik Seeley – Research creator of Apache Solr and the Chief Open Source Architect and Co-Founder at LucidWorks. Mr. Seeley is an Apache Lucene/Solr PMC member and committer and an expert in distributed search systems architecture and performance. His work experience includes CNET Networks, BEA and Telcordia. He earned his M.S. in Computer Science from Stanford University.

This could be a real treat!

Notes on the webinar to follow.

Strange Loop 2013

Saturday, April 27th, 2013

Strange Loop 2013


  • Call for presentation opens: Apr 15th, 2013
  • Call for presentation ends: May 9, 2013
  • Speakers notified by: May 17, 2013
  • Registration opens: May 20, 2013
  • Conference dates: Sept 18-20th, 2013

From the webpage:

Below is some guidance on the kinds of topics we are seeking and have historically accepted.

  • Frequently accepted or desired topics: functional programming, logic programming, dynamic/scripting languages, new or emerging languages, data structures, concurrency, database internals, NoSQL databases, key/value stores, big data, distributed computing, queues, asynchronous or dataflow concurrency, STM, web frameworks, web architecture, performance, virtual machines, mobile frameworks, native apps, security, biologically inspired computing, hardware/software interaction, historical topics.
  • Sometimes accepted (depends on topic): Java, C#, testing frameworks, monads
  • Rarely accepted (nothing wrong with these, but other confs cover them well): Agile, JavaFX, J2EE, Spring, PHP, ASP, Perl, design, layout, entrepreneurship and startups, game programming

It isn’t clear why Strange Loop claims to have “archives:”


As far as I can tell, these are listings with bios of prior presentations, but no substantive content.

Am I missing something?


Friday, April 19th, 2013


From the architecture overview:

Aerospike is a fast Key Value Store or Distributed Hash Table architected to be a flexible NoSQL platform for today’s high scale Apps. Designed to meet the reliability or ACID requirements of traditional databases, there is no single point of failure (SPOF) and data is never lost. Aerospike can be used as an in-memory database and is uniquely optimized to take advantage of the dramatic cost benefits of flash storage. Written in C, Aerospike runs on Linux.

Based on our own experiences developing mission-critical applications with high scale databases and our interactions with customers, we’ve developed a general philosophy of operational efficiency that guides product development. Three principles drive Aerospike architecture: NoSQL flexibility, traditional database reliability, and operational efficiency.

Technical details first published in Proceeding of the VLDB (Very Large Databases), Citrusleaf: A Real-Time NoSQL DB which Preserves ACID by V. Srinivasan and Brian Bulkowski.

You can guess why they changed the name. 😉

There is a free community edition, along with an SDK and documentation.

Relies on RAM and SDDs.

Timo Elliott was speculating about entirely RAM-based computing in: In-Memory Computing.

Imagine losing all the special coding tricks to get performance despite disk storage.

Simpler code and fewer operations should result in higher speed.

How to Compare NoSQL Databases

Friday, April 19th, 2013

How to Compare NoSQL Databases by Ben Engber. (video)

From the description:

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB, Cassandra, HBase, others) and to do so in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.

We also discuss how to extend benchmarking tests to simulate different kinds of failure scenarios to help evaluate the maintainablility and recoverability of different systems. This requires carefully constructed tests and significant knowledge of the underlying databases — the talk will help evaluators overcome the common pitfalls and time sinks involved in trying to measure this.

Lastly we discuss the YCSB benchmarking tool, its significant limitations, and the significant extensions and supplementary tools Thumbtack has created to provide distributed load generation and failure simulation.

Ben makes a very good case for understanding the details of your use case versus the characteristics of particular NoSQL solutions.

Where you will find “better” performance depends on non-obvious details.

Watch the use of terms like “consistency” in this presentation.

The paper Ben refers to: Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs.

Forty-three pages of analysis and charts.

Slow but interesting reading.

If you are into the details of performance and NoSQL databases.

Mumps: The Proto-Database…

Wednesday, March 27th, 2013

Mumps: The Proto-Database (Or How To Build Your Own NoSQL Database) by Rob Tweed.

From the post:

I think that one of the problems with Mumps as a database technology, and something that many people don’t like about the Mumps database is that it is a very basic and low-level engine, without any of the frills and value-added things that people expect from a database these days. A Mumps database doesn’t provide built-in indexing, for example, nor does it have any high-level query language (eg SQL, Map/Reduce) built in, though there are add-on products that can provide such capabilities.

On the other hand, a raw Mumps database, such as GT.M, is actually an interesting beast, as it turns out to provide everything you need to design and create your own NoSQL (or pretty much any other kind of) database. As I’ve discussed and mentioned a number of times in these articles, it’s a Universal NoSQL engine.

Why, you might ask, would you want to create your own NoSQL database? I’d possibly agree, but there hardly seems to be a week go by without someone doing exactly that and launching yet another NoSQL database. So, there’s clearly a perceived need or desire to do so.

I first saw this at Mumps: The Proto-Database by Alex Popescu.

Alex asks:

The question I’d ask myself is not “why would I build another NoSQL database”, but rather “why none of the popular ones are built using Mumps?”.

I suspect the answer is the same one for why are popular NoSQL databases, such as MongoDB, are re-inventing text indexing? (see MongoDB 2.4 Release)

Database Landscape Map – February 2013

Wednesday, March 27th, 2013

Database Landscape Map – February 2013 by 451 Research.

Database map

A truly awesome map of available databases.

Originated from Neither fish nor fowl: the rise of multi-model databases by Matthew Aslett.

Matthew writes:

One of the most complicated aspects of putting together our database landscape map was dealing with the growing number of (particularly NoSQL) databases that refuse to be pigeon-holed in any of the primary databases categories.

I have begun to refer to these as “multi-model databases” in recognition of the fact that they are able to take on the characteristics of multiple databases. In truth though there are probably two different groups of products that could be considered “multi-model”:

I think I understand the grouping from the key to the map but the ordering within groups, if meaningful, escapes me.

I am sure you will recognize most of the names but equally sure there will be some you can’t quite describe.


MongoDB 2.4 Release

Tuesday, March 19th, 2013

MongoDB 2.4 Release

From the webpage:

Developer Productivity

  • Capped Arrays simplify development by making it easy to incorporate fixed, sorted lists for features like leaderboards and logging.
  • Geospatial Enhancements enable new use cases with support for polygon intersections and analytics based on geospatial data.
  • Text Search provides a simplified, integrated approach to incorporating search functionality into apps (Note: this feature is currently in beta release).


  • Hash-Based Sharding simplifies deployment of large MongoDB systems.
  • Working Set Analyzer makes capacity planning easier for ops teams.
  • Improved Replication increases resiliency and reduces administration.
  • Mongo Client creates an intuitive, consistent feature set across all drivers.


  • Faster Counts and Aggregation Framework Refinements make it easier to leverage real-time, in-place analytics.
  • V8 JavaScript Engine offers better concurrency and faster performance for some operations, including MapReduce jobs.


  • On-Prem Monitoring provides comprehensive monitoring, visualization and alerting on more than 100 operational metrics of a MongoDB system in real time, based on the same application that powers 10gen’s popular MongoDB Monitoring Service (MMS). On-Prem Monitoring is only available with MongoDB Enterprise.


  • Kerberos Authentication enables enterprise and government customers to integrate MongoDB into existing enterprise security systems. Kerberos support is only available in MongoDB Enterprise.
  • Role-Based Privileges allow organizations to assign more granular security policies for server, database and cluster administration.

You can read more about the improvements to MongoDB 2.4 in the Release Notes. Also, MongoDB 2.4 is available for download on

Lots to look at in MongoDB 2.4!

But I am curious about the beta text search feature.

MongoDB Text Search: Experimental Feature in MongoDB 2.4 says:

Text search (SERVER-380) is one of the most requested features for MongoDB 10gen is working on an experimental text-search feature, to be released in v2.4, and we’re already seeing some talk in the community about the native implementation within the server. We view this as an important step towards fulfilling a community need.

MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need. (emphasis added)

So, why isn’t MongoDB incorporating Solr/Lucene instead of a home grown text search feature?

Seems like users could leverage their Solr/Lucene skills with their MongoDB installations.


Why FoundationDB Might Be All Its Cracked Up To Be

Friday, March 8th, 2013

Why FoundationDB Might Be All Its Cracked Up To Be by Doug Turnbull.

From the post:

When I first heard about FoundationDB, I couldn’t imagine how it could be anything but vaporware. Seemed like Unicorns crapping happy rainbows to solve all your problems. As I’m learning more about it though, I realize it could actually be something ground breaking.

NoSQL: Lets Review…

So, I need to step back and explain one reason NoSQL databases have been revolutionary. In the days of yore, we used to normalize all our data across multiple tables on a single database living on a single machine. Unfortunately, Moore’s law eventually crapped out and maybe more importantly hard drive space stopped increasing massively. Our data and demands on it only kept growing. We needed to start trying to distribute our database across multiple machines.

Turns out, its hard to maintain transactionality in a distributed, heavily normalized SQL database. As such, a lot of NoSQL systems have emerged with simpler features, many promoting a model based around some kind of single row/document/value that can be looked up/inserted with a key. Transactionality for these systems is limited a single key value entry (“row” in Cassandra/HBase or “document” in (Mongo/Couch) — we’ll just call them rows here). Rows are easily stored in a single node, although we can replicate this row to multiple nodes. Despite being replicated, it turns out transactionally working with single rows in distributed NoSQL is easier than guaranteeing transactionality of an SQL query visiting potentially many SQL tables in a distributed system.

There are deep design ramifications/limitations to the transactional nature of rows. First you always try to cram a lot of data related to the row’s key into a single row, ending up with massive rows of hierarchical or flat data that all relates to the row key. This lets you cover as much data as possible under the row-based transactionality guarantee. Second, as you only have a single key to use from the system, you must chose very wisely what your key will be. You may need to think hard how your data will be looked up through its whole life, it can be hard to go back. Additionally, if you need to lookup on a secondary value, you better hope that your database is friendly enough to have a secondary key feature or otherwise you’ll need to maintain secondary row for storing the relationship. Then you have the problem of working across two rows, which doesn’t fit in the transactionality guarantee. Third, you might lose the ability to perform a join across multiple rows. In most NoSQL data stores, joining is discouraged and denormalization into large rows is the encouraged best practice.

FoundationDB Is Different

FoundationDB is a distributed, sorted key-value store with support for arbitrary transactions across multiple key-values — multiple “rows” — in the database.

As Doug points out, there is must left to be known.

Still, exciting to have something new to investigate.


Monday, March 4th, 2013


FoundationDB Beta 1 is now available!

It will take a while to sort out all of its features, etc.

I should mention that it is refreshing that the documentation contains Known Limitations.

All software has limitations but few every acknowledge them up front.

You have to encounter one before one of the technical folks says: “…yes, we have been meaning to work on that.”

I would rather know up front what the limitation are.

Whether FoundationDB meets your requirements or not, it is good to see that kind of transparency.

NoSQL is Great, But You Still Need Indexes [MongoDB for example]

Wednesday, February 20th, 2013

NoSQL is Great, But You Still Need Indexes by Martin Farach-Colton.

From the post:

I’ve said it before, and, as is the nature of these things, I’ll almost certainly say it again: your database performance is only as good as your indexes.

That’s the grand thesis, so what does that mean? In any DB system — SQL, NoSQL, NewSQL, PostSQL, … — data gets ingested and organized. And the system answers queries. The pain point for most users is around the speed to answer queries. And the query speed (both latency and throughput, to be exact) depend on how the data is organized. In short: Good Indexes, Fast Queries; Poor Indexes, Slow Queries.

But building indexes is hard work, or at least it has been for the last several decades, because almost all indexing is done with B-trees. That’s true of commercial databases, of MySQL, and of most NoSQL solutions that do indexing. (The ones that don’t do indexing solve a very different problem and probably shouldn’t be confused with databases.)

It’s not true of TokuDB. We build Fractal Tree Indexes, which are much easier to maintain but can still answer queries quickly. So with TokuDB, it’s Fast Indexes, More Indexes, Fast Queries. TokuDB is usually thought of as a storage engine for MySQL and MariaDB. But it’s really a B-tree substitute, so we’re always on the lookout for systems where we can improving the indexing.

Enter MongoDB. MongoDB is beloved because it makes deployment fast. But when you peel away the layers, you get down to a B-tree, with all the performance headaches and workarounds that they necessitate.

That’s the theory, anyway. So we did some testing. We ripped out the part of MongoDB that takes care of secondary indices and plugged in TokuDB. We’ve posted the blogs before, but here they are again, the greatest hits of TokuDB+MongoDB: we show a 10x insertion performance, a 268x query performance, and a 532x (or 53,200% if you prefer) multikey index insertion performance. We also discussed covered indexes vs. clustered Fractal Tree Indexes.

Did somebody declare February 20th to be performance release day?

Did I miss that memo? 😉

Like every geek, I like faster. But, here’s my question:

Have there been any studies on the impact of faster systems on searching and decision making by users?

My assumption is the faster I get a non-responsive result from a search, the sooner I can improve it.

But that’s an assumption on my part.

Is that really true?

Hypertable Has Reached A Major Milestone

Thursday, February 14th, 2013

Hypertable Has Reached A Major Milestone by Doug Judd.

From the post:

RangeServer Failover

With the release of Hypertable version comes support for automatic RangeServer failover. Hypertable will now detect when a RangeServer has failed, logically remove it from the system, and automatically re-assign the ranges that it was managing to other RangeServers. This represents a major milestone for Hypertable and allows for very large scale deployments. We have been actively working on this feature, full-time, for 1 1/2 years. To give you an idea of the magnitude of the change, here are the commit statistics:

  • 441 changed files
  • 17,522 line additions
  • 6,384 line deletions

The reason that this feature has been a long time in the making is because we placed a very high standard of quality for this feature so that under no circumstance, a RangeServer failure would lead to consistency problems or data loss. We’re confident that we’ve achieved 100% correctness under every conceivable circumstance. The two primary goals for the feature, robustness and applicaiton transparancy, are described below.

That is a major milestone!

High-end data processing is becoming as crowded with viable options as low-end data processing. And the “low-end” of data processing keeps getting bigger.

MarkLogic Announces Free Developer License for Enterprise [With Odd Condition]

Wednesday, February 13th, 2013

MarkLogic Announces Free Developer License for Enterprise

From the post:

MarkLogic Corporation today announced the availability of a free Developer License for MarkLogic Enterprise Edition.

The Developer License provides access to the features available in MarkLogic Enterprise Edition, including integrated search, government-grade security, clustering, replication, failover, alerting, geospatial indexing, conversion, and a suite of application development tools. MarkLogic also announced the Mongo2MarkLogic converter, a Java-based tool for importing data from MongoDB into MarkLogic providing developers immediate access to features needed to build out enterprise-ready big data solutions.

“By providing a free Developer License we enable developers to quickly deliver reliable, scalable and secure information and analytic applications that are production-ready,” said Gary Bloom, CEO and President of MarkLogic. “Many of our customers first experimented with other free NoSQL products, but turned to MarkLogic when they recognized the need for search, security, support for ACID transactions and other features necessary for enterprise environments. Our goal is to eliminate the cost barrier for developers and give them access to the best enterprise NoSQL platform from the start.”

The Developer License for MarkLogic Enterprise Edition includes tools for faster application development, business intelligence (BI) tool integration, analytic functions and visualization tools, and the ability to create user-defined functions for fast and flexible analysis of huge volumes of data.

You would think that story would merit at least one link to the free developer program.

For your convenience: Developer License for Enterprise Edition. BTW, MarkLogic homepage.

That wasn’t hard. Two links and you have direct access to the topic of the story and the company.

One odd licensing condition:

Q. Can I publish my work done with MarkLogic Server?

A. We encourage you to share your work publicly, but note that you can not disclose, without MarkLogic prior written consent, any performance or capacity statistics or the results of any benchmark test performed on MarkLogic Server.

That sounds just a tad defensive doesn’t it?

I haven’t looked at MarkLogic for a couple of iterations but earlier versions had no need to fear statistics or benchmark tests.

Results vary depending on how testing is done but anyone authorized to recommend or sign acquisition orders should know that.

If they don’t, your organization has more serious problems than needing a MarkLogic server.

Oracle’s MySQL 5.6 released

Wednesday, February 6th, 2013

Oracle’s MySQL 5.6 released

From the post:

Just over two years after the release of MySQL 5.5, the developers at Oracle have released a GA (General Availability) version of Oracle MySQL 5.6, labelled MySQL 5.6.10. In MySQL 5.5, the developers replaced the old MyISAM backend and used the transactional InnoDB as the default for database tables. With 5.6, the retrofitting of full-text search capabilities has enabled InnoDB to now take on the position of default storage engine for all purposes.

Accelerating the performance of sub-queries was also a focus of development; they are now run using a process of semi-joins and materialise much faster; this means it should not be necessary to replace subqueries with joins. Many operations that change the data structures, such as ALTER TABLE, are now performed online, which avoids long downtimes. EXPLAIN also gives information about the execution plans of UPDATE, DELETE and INSERT commands. Other optimisations of queries include changes which can eliminate table scans where the query has a small LIMIT value.

MySQL’s row-oriented replication now supports “row image control” which only logs the columns needed to identify and make changes on each row rather than all the columns in the changing row. This could be particularly expensive if the row contained BLOBs, so this change not only saves disk space and other resources but it can also increase performance. “Index Condition Pushdown” is a new optimisation which, when resolving a query, attempts to use indexed fields in the query first, before applying the rest of the WHERE condition.

MySQL 5.6 also introduces a “NoSQL interface” which uses the memcached API to offer applications direct access to the InnoDB storage engine while maintaining compatibility with the relational database engine. That underlying InnoDB engine has also been enhanced with persistent optimisation statistics, multithreaded purging and more system tables and monitoring data available.

Download MySQL 5.6.

I mentioned Oracle earlier today (When Oracle bought MySQL [Humor]) so it’s only fair that I point out their most recent release of MySQL.

SQL, NoSQL =? CoSQL? Category Theory to the Rescue

Wednesday, January 30th, 2013

A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman.

I missed this when it appeared in March of 2011.

From the conclusion:

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL’s foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

The ACM Digital Library reports only 3 citations, which is unfortunate for such an interesting proposal.

I have heard about key/value pairs somewhere else. I will have to think about that and get back to you. (Hint for the uninitiated, try the Topic Maps Reference Model (TMRM). A new draft of the TMRM is due to appear in a week or so.)

11 Interesting Releases From the First Weeks of January

Thursday, January 24th, 2013

11 Interesting Releases From the First Weeks of January by Alex Popescu.

Alex has collected links for eleven (11) interesting NoSQL releases in January 2013!

Visit Alex’s post. You won’t be disappointed.

Static and Dynamic Semantics of NoSQL Languages […Combining Operators…]

Tuesday, December 25th, 2012

Static and Dynamic Semantics of NoSQL Languages (PDF) by Véronique Benzaken, Giuseppe Castagna, Kim Nguy˜ên and Jérôme Siméon.


We present a calculus for processing semistructured data that spans differences of application area among several novel query languages, broadly categorized as “NoSQL”. This calculus lets users define their own operators, capturing a wider range of data processing capabilities, whilst providing a typing precision so far typical only of primitive hard-coded operators. The type inference algorithm is based on semantic type checking, resulting in type information that is both precise, and flexible enough to handle structured and semistructured data. We illustrate the use of this calculus by encoding a large fragment of Jaql, including operations and iterators over JSON, embedded SQL expressions, and co-grouping, and show how the encoding directly yields a typing discipline for Jaql as it is, namely without the addition of any type definition or type annotation in the code.

From the conclusion:

On the structural side, the claim is that combining recursive records and pairs by unions, intersections, and negations suffices to capture all possible structuring of data, covering a palette ranging from comprehensions, to heterogeneous lists mixing typed and untyped data, through regular expressions types and XML schemas. Therefore, our calculus not only provides a simple way to give a formal semantics to, reciprocally compare, and combine operators of different NoSQL languages, but also offers a means to equip these languages, in they current definition (ie, without any type definition or annotation), with precise type inference.

With lots of work in between the abstract and conclusion.

The capacity to combine operators of different NoSQL languages sounds relevant to a topic maps query language.


I first saw this in a tweet by Computer Science.


Wednesday, December 12th, 2012


From the webpage:

A universal open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript/Ruby extensions.

Design considerations:

In a nutshell:

  • Schema-free schemas with shapes: Inherent structures at hand are automatically recognized and subsequently optimized.
  • Querying: ArangoDB is able to accomplish complex operations on the provided data (query-by-example and query-language).
  • Application Server: ArangoDB is able to act as application server on Javascript-devised routines.
  • Mostly memory/durability: ArangoDB is memory-based including frequent file system synchronizing.
  • AppendOnly/MVCC: Updates generate new versions of a document; automatic garbage collection.
  • ArangoDB is multi-threaded.
  • No indices on file: Only raw data is written on hard disk.
  • ArangoDB supports single nodes and small, homogenous clusters with zero administration.

I have mentioned this before but ran across it again at: An experiment with Vagrant and Neo4J by Patrick Mulder.

NuoDB [Everything in NuoDB is an Atom]

Saturday, November 24th, 2012


I last wrote about NuoDB in February of 2012, it was still in private beta release.

You can now download a free community edition with a limit of two nodes for the usual platforms.

The “under the hood” page reads (in part):

Everything in NuoDB is an Atom

Under the hood, NuoDB is an asynchronous, decentralized, peer-to-peer database. The NuoDB system is also object-oriented. Objects in NuoDB know how to perform various actions that create specific behaviors in the overall database. And at the heart of every object in NuoDB is the Atom. An Atom in NuoDB is like a single bird in a flock.

Atoms are self-describing objects (data and metadata) that together comprise the database. Everything in the NuoDB database is an Atom, including the schema, the indexes, and even the underlying data. For example, each table is an Atom that describes the metadata for the table and can reference other Atoms; such as Atoms that describe ranges of records in the table and their versions.

Atoms are Powerful

Atoms are intelligent, powerful, self-describing objects that together form the NuoDB database. Atoms know how to perform many actions, like these:

  • Atoms know how to make copies of themselves.
  • Atoms keep all copies of themselves up to date.
  • Atoms can broadcast messages. Atoms listen for events and changes from other Atoms.
  • Atoms can request data from other Atoms.
  • Atoms can serialize themselves to persistent storage.
  • Atoms can retrieve data from storage.

The Atoms are the Database

Everything in the database is an Atom, and the Atoms are the database. The Atoms work in concert to form both the Transaction (or Compute) Tier, and the Storage Tier.

A NuoDB Transaction Engine is a process that executes the SQL layer and is comprised completely of Atoms. The Transaction Engine operates on Atoms, listens for changes, and communicates changes with other Transaction Engines in the database.

A NuoDB Storage Manager is simply a special kind of Transaction Engine that allows Atoms to serialize themselves to permanent storage (such as a local disk or Amazon S3, for example).

A NuoDB database can be as simple as a single Transaction Engine and a single Storage Manager, or can be as complex as tens of Transaction Engines and Storage Managers distributed across dozens of computer hosts.

Some wag in a report that reminded me to look at NuoDB again was whining about how NuoDB would perform in query intensive environments? I guess downloading a free copy to see was too much effort.

Of course, you would have to define “query intensive” environment and that would be no easy task. Lots of users with simple queries? (Define “lots” and “simple.”)

Just a suspicion as I wait for my download URL to arrive, “query” in a atom based system may not have the same internal processes as a traditional relational database. Or perhaps not entirely.

For example, what if the notion of “retrieval” from a location in memory is no longer operative? That is a query is composed of atoms that begin messaging as they are composed and so receiving information before the user reaches the end of a query string?

And more than that, query atoms that occur frequently could be persisted so the creation cost is not incurred in subsequent queries.

Hard to say without knowing more about it but it definitely should be on your short list of products to watch.

10 things never to do with a relational database

Saturday, November 17th, 2012

10 things never to do with a relational database (The data explosion demands new solutions, yet the hoary old RDBMS still rules. Here’s where you really shouldn’t use it) by Andrew C. Oliver.

From the post:

I am a NoSQLer and a big data guy. That’s a nice coincidence, because as you may have heard, data growth is out of control.

Old habits die hard. The relational DBMS still reigns supreme. But even if you’re a dyed-in-the-wool, Oracle-loving, PL/SQL-slinging glutton for the medieval RAC, think twice, think many times, before using your beloved technology for the following tasks.

[ If you aren’t going to use an RDBMS, which freaking database should you use? | See InfoWorld’s comparative review of NoSQL databases. | Keep up with the latest developer news with InfoWorld’s Developer World newsletter. ]

If you guessed this post is from InfoWorld and that it’s rather ranty, you are right on both counts.

Andrew’s 10 things:

  1. Search
  2. Recommendations
  3. High-frequency trading
  4. Product cataloguing
  5. Users/groups and ACLs
  6. Log analysis
  7. Media repository
  8. Email
  9. Classified ads
  10. Time-series/forecasting

Andrew ducks and covers in his conclusion with:

Can you use the RDBMS for some or many of these? Sure — I have and people continue to. However, is it a good fit? Not really. I expect the cranky old men to disagree, but tradition alone is not a good reason to stick with the old way of doing things.

If you disagree with his assessment, you are by definition a “cranky old man,” and no one wants to be seen as a cranky old man.

Being a “cranky old man,” the label doesn’t sting so I feel free to disagree. 😉

Andrew is right that tradition alone isn’t “a good reason to stick with the old way of doing things.”

On the other hand, because something is new or venture capitalists have parted with cash, isn’t a reason to find a new way of doing things.

Your requirements aren’t only technical questions but questions of IT competence to deploy a new solution, training of staff to use a new solution, costs of retraining and construction, and others.

Ignoring the non-technical side of requirements is a step toward acquiring a white elephant to sleep in the middle of your office, interfering with day to day operations.

CodernityDB [Origin of the 3 V’s of Big Data]

Sunday, November 11th, 2012


From the webpage:

CodernityDB pure python, NoSQL, fast database¶

CodernityDB is opensource, pure Python (no 3rd party dependency), fast (really fast check Speed if you don’t believe in words), multiplatform, schema-less, NoSQL database. It has optional support for HTTP server version (CodernityDB-HTTP), and also Python client library (CodernityDB-PyClient) that aims to be 100% compatible with embedded version.

“The hills are alive, with the sound of NoSQL databases….”

Sorry, I usually only sing in the shower. 😉

I haven’t done a statistical survey (that may be in the offing) but it does seem like the stream of NoSQL databases continues unabated.

What I don’t know and you might: Has there always be a rumble of alternative databases and looking makes them appear larger/more numerous? As in a side view mirror.

If we can discover what makes NoSQL databases popular now, that may apply to semantic integration.

I don’t buy the 3 V’s, Velocity, Volume, Variety, as an explanation for NoSQL database adoption.

Doug Laney, now of Gartner, Inc., then of Meta Group coined that phrase in “3D Data Management: Controlling Data Volume, Velocity and Variety“, Date: 6 February 2001:*

E-Commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity, and variety.

I don’t recall the current level of interest in NoSQL databases when faced with the same problems in 2001.

So what else has changed? (I don’t know or I would say.)


I was alerted to the origin of the three V’s by a reference to Doug Laney by Stephen Swoyer in Big Data — Why the 3Vs Just Don’t Make Sense and then followed a reference in Big Data (Wikipedia) to find the link I reproduce above.

G-Store: [Multi key Access]

Friday, November 2nd, 2012

G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud by Sudipto Das, Divyakant Agrawal and Amr El Abbadi.


Cloud computing has emerged as a preferred platform for deploying scalable web-applications. With the growing scale of these applications and the data associated with them, scalable data management systems form a crucial part of the cloud infrastructure. Key-Value stores . such as Bigtable, PNUTS, Dynamo, and their open source analogues. have been the preferred data stores for applications in the cloud. In these systems, data is represented as Key-Value pairs, and atomic access is provided only at the granularity of single keys. While these properties work well for current applications, they are insufficient for the next generation web applications . such as online gaming, social networks, collaborative editing, and many more . which emphasize collaboration. Since collaboration by definition requires consistent access to groups of keys, scalable and consistent multi key access is critical for such applications. We propose the Key Group abstraction that defines a relationship between a group of keys and is the granule for on-demand transactional access. This abstraction allows the Key Grouping protocol to collocate control for the keys in the group to allow efficient access to the group of keys. Using the Key Grouping protocol, we design and implement G-Store which uses a key-value store as an underlying substrate to provide efficient, scalable, and transactional multi key access. Our implementation using a standard key-value store and experiments using a cluster of commodity machines show that G-Store preserves the desired properties of key-value stores, while providing multi key access functionality at a very low overhead.

I encountered this while researching the intrusive keys found in: KitaroDB [intrusive-keyed database]

The notion of a key group that consists of multiple keys, assigned to a single node, seems similar to me to the collection of item identifiers in the TMDM (13250-2) post merging.

I haven’t gotten into the details but supporting collections of item identifiers on top of scalable key-value stores is an interesting prospect.

I created a category for G-Store but had to add “(Multikey)” because in looking for more details on it, I encountered another G-Store that you will be interested in.