## Archive for the ‘NoSQL’ Category

### Why FoundationDB Might Be All Its Cracked Up To Be

Friday, March 8th, 2013

Why FoundationDB Might Be All Its Cracked Up To Be by Doug Turnbull.

From the post:

When I first heard about FoundationDB, I couldn’t imagine how it could be anything but vaporware. Seemed like Unicorns crapping happy rainbows to solve all your problems. As I’m learning more about it though, I realize it could actually be something ground breaking.

NoSQL: Lets Review…

So, I need to step back and explain one reason NoSQL databases have been revolutionary. In the days of yore, we used to normalize all our data across multiple tables on a single database living on a single machine. Unfortunately, Moore’s law eventually crapped out and maybe more importantly hard drive space stopped increasing massively. Our data and demands on it only kept growing. We needed to start trying to distribute our database across multiple machines.

Turns out, its hard to maintain transactionality in a distributed, heavily normalized SQL database. As such, a lot of NoSQL systems have emerged with simpler features, many promoting a model based around some kind of single row/document/value that can be looked up/inserted with a key. Transactionality for these systems is limited a single key value entry (“row” in Cassandra/HBase or “document” in (Mongo/Couch) — we’ll just call them rows here). Rows are easily stored in a single node, although we can replicate this row to multiple nodes. Despite being replicated, it turns out transactionally working with single rows in distributed NoSQL is easier than guaranteeing transactionality of an SQL query visiting potentially many SQL tables in a distributed system.

There are deep design ramifications/limitations to the transactional nature of rows. First you always try to cram a lot of data related to the row’s key into a single row, ending up with massive rows of hierarchical or flat data that all relates to the row key. This lets you cover as much data as possible under the row-based transactionality guarantee. Second, as you only have a single key to use from the system, you must chose very wisely what your key will be. You may need to think hard how your data will be looked up through its whole life, it can be hard to go back. Additionally, if you need to lookup on a secondary value, you better hope that your database is friendly enough to have a secondary key feature or otherwise you’ll need to maintain secondary row for storing the relationship. Then you have the problem of working across two rows, which doesn’t fit in the transactionality guarantee. Third, you might lose the ability to perform a join across multiple rows. In most NoSQL data stores, joining is discouraged and denormalization into large rows is the encouraged best practice.

FoundationDB Is Different

FoundationDB is a distributed, sorted key-value store with support for arbitrary transactions across multiple key-values — multiple “rows” — in the database.

As Doug points out, there is must left to be known.

Still, exciting to have something new to investigate.

### FoundationDB

Monday, March 4th, 2013

FoundationDB

FoundationDB Beta 1 is now available!

It will take a while to sort out all of its features, etc.

I should mention that it is refreshing that the documentation contains Known Limitations.

All software has limitations but few every acknowledge them up front.

You have to encounter one before one of the technical folks says: “…yes, we have been meaning to work on that.”

I would rather know up front what the limitation are.

Whether FoundationDB meets your requirements or not, it is good to see that kind of transparency.

### NoSQL is Great, But You Still Need Indexes [MongoDB for example]

Wednesday, February 20th, 2013

NoSQL is Great, But You Still Need Indexes by Martin Farach-Colton.

From the post:

I’ve said it before, and, as is the nature of these things, I’ll almost certainly say it again: your database performance is only as good as your indexes.

That’s the grand thesis, so what does that mean? In any DB system — SQL, NoSQL, NewSQL, PostSQL, … — data gets ingested and organized. And the system answers queries. The pain point for most users is around the speed to answer queries. And the query speed (both latency and throughput, to be exact) depend on how the data is organized. In short: Good Indexes, Fast Queries; Poor Indexes, Slow Queries.

But building indexes is hard work, or at least it has been for the last several decades, because almost all indexing is done with B-trees. That’s true of commercial databases, of MySQL, and of most NoSQL solutions that do indexing. (The ones that don’t do indexing solve a very different problem and probably shouldn’t be confused with databases.)

It’s not true of TokuDB. We build Fractal Tree Indexes, which are much easier to maintain but can still answer queries quickly. So with TokuDB, it’s Fast Indexes, More Indexes, Fast Queries. TokuDB is usually thought of as a storage engine for MySQL and MariaDB. But it’s really a B-tree substitute, so we’re always on the lookout for systems where we can improving the indexing.

Enter MongoDB. MongoDB is beloved because it makes deployment fast. But when you peel away the layers, you get down to a B-tree, with all the performance headaches and workarounds that they necessitate.

That’s the theory, anyway. So we did some testing. We ripped out the part of MongoDB that takes care of secondary indices and plugged in TokuDB. We’ve posted the blogs before, but here they are again, the greatest hits of TokuDB+MongoDB: we show a 10x insertion performance, a 268x query performance, and a 532x (or 53,200% if you prefer) multikey index insertion performance. We also discussed covered indexes vs. clustered Fractal Tree Indexes.

Did somebody declare February 20th to be performance release day?

Did I miss that memo?

Like every geek, I like faster. But, here’s my question:

Have there been any studies on the impact of faster systems on searching and decision making by users?

My assumption is the faster I get a non-responsive result from a search, the sooner I can improve it.

But that’s an assumption on my part.

Is that really true?

### Hypertable Has Reached A Major Milestone

Thursday, February 14th, 2013

Hypertable Has Reached A Major Milestone by Doug Judd.

From the post:

RangeServer Failover

With the release of Hypertable version 0.9.7.0 comes support for automatic RangeServer failover. Hypertable will now detect when a RangeServer has failed, logically remove it from the system, and automatically re-assign the ranges that it was managing to other RangeServers. This represents a major milestone for Hypertable and allows for very large scale deployments. We have been actively working on this feature, full-time, for 1 1/2 years. To give you an idea of the magnitude of the change, here are the commit statistics:

• 441 changed files
• 6,384 line deletions

The reason that this feature has been a long time in the making is because we placed a very high standard of quality for this feature so that under no circumstance, a RangeServer failure would lead to consistency problems or data loss. We’re confident that we’ve achieved 100% correctness under every conceivable circumstance. The two primary goals for the feature, robustness and applicaiton transparancy, are described below.

That is a major milestone!

High-end data processing is becoming as crowded with viable options as low-end data processing. And the “low-end” of data processing keeps getting bigger.

### MarkLogic Announces Free Developer License for Enterprise [With Odd Condition]

Wednesday, February 13th, 2013

MarkLogic Announces Free Developer License for Enterprise

From the post:

MarkLogic Corporation today announced the availability of a free Developer License for MarkLogic Enterprise Edition.

The Developer License provides access to the features available in MarkLogic Enterprise Edition, including integrated search, government-grade security, clustering, replication, failover, alerting, geospatial indexing, conversion, and a suite of application development tools. MarkLogic also announced the Mongo2MarkLogic converter, a Java-based tool for importing data from MongoDB into MarkLogic providing developers immediate access to features needed to build out enterprise-ready big data solutions.

“By providing a free Developer License we enable developers to quickly deliver reliable, scalable and secure information and analytic applications that are production-ready,” said Gary Bloom, CEO and President of MarkLogic. “Many of our customers first experimented with other free NoSQL products, but turned to MarkLogic when they recognized the need for search, security, support for ACID transactions and other features necessary for enterprise environments. Our goal is to eliminate the cost barrier for developers and give them access to the best enterprise NoSQL platform from the start.”

The Developer License for MarkLogic Enterprise Edition includes tools for faster application development, business intelligence (BI) tool integration, analytic functions and visualization tools, and the ability to create user-defined functions for fast and flexible analysis of huge volumes of data.

You would think that story would merit at least one link to the free developer program.

That wasn’t hard. Two links and you have direct access to the topic of the story and the company.

One odd licensing condition:

Q. Can I publish my work done with MarkLogic Server?

A. We encourage you to share your work publicly, but note that you can not disclose, without MarkLogic prior written consent, any performance or capacity statistics or the results of any benchmark test performed on MarkLogic Server.

That sounds just a tad defensive doesn’t it?

I haven’t looked at MarkLogic for a couple of iterations but earlier versions had no need to fear statistics or benchmark tests.

Results vary depending on how testing is done but anyone authorized to recommend or sign acquisition orders should know that.

If they don’t, your organization has more serious problems than needing a MarkLogic server.

### Oracle’s MySQL 5.6 released

Wednesday, February 6th, 2013

Oracle’s MySQL 5.6 released

From the post:

Just over two years after the release of MySQL 5.5, the developers at Oracle have released a GA (General Availability) version of Oracle MySQL 5.6, labelled MySQL 5.6.10. In MySQL 5.5, the developers replaced the old MyISAM backend and used the transactional InnoDB as the default for database tables. With 5.6, the retrofitting of full-text search capabilities has enabled InnoDB to now take on the position of default storage engine for all purposes.

Accelerating the performance of sub-queries was also a focus of development; they are now run using a process of semi-joins and materialise much faster; this means it should not be necessary to replace subqueries with joins. Many operations that change the data structures, such as ALTER TABLE, are now performed online, which avoids long downtimes. EXPLAIN also gives information about the execution plans of UPDATE, DELETE and INSERT commands. Other optimisations of queries include changes which can eliminate table scans where the query has a small LIMIT value.

MySQL’s row-oriented replication now supports “row image control” which only logs the columns needed to identify and make changes on each row rather than all the columns in the changing row. This could be particularly expensive if the row contained BLOBs, so this change not only saves disk space and other resources but it can also increase performance. “Index Condition Pushdown” is a new optimisation which, when resolving a query, attempts to use indexed fields in the query first, before applying the rest of the WHERE condition.

MySQL 5.6 also introduces a “NoSQL interface” which uses the memcached API to offer applications direct access to the InnoDB storage engine while maintaining compatibility with the relational database engine. That underlying InnoDB engine has also been enhanced with persistent optimisation statistics, multithreaded purging and more system tables and monitoring data available.

I mentioned Oracle earlier today (When Oracle bought MySQL [Humor]) so it’s only fair that I point out their most recent release of MySQL.

### SQL, NoSQL =? CoSQL? Category Theory to the Rescue

Wednesday, January 30th, 2013

A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman.

I missed this when it appeared in March of 2011.

From the conclusion:

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL’s foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

The ACM Digital Library reports only 3 citations, which is unfortunate for such an interesting proposal.

I have heard about key/value pairs somewhere else. I will have to think about that and get back to you. (Hint for the uninitiated, try the Topic Maps Reference Model (TMRM). A new draft of the TMRM is due to appear in a week or so.)

### 11 Interesting Releases From the First Weeks of January

Thursday, January 24th, 2013

11 Interesting Releases From the First Weeks of January by Alex Popescu.

Alex has collected links for eleven (11) interesting NoSQL releases in January 2013!

Visit Alex’s post. You won’t be disappointed.

### Static and Dynamic Semantics of NoSQL Languages [...Combining Operators...]

Tuesday, December 25th, 2012

Static and Dynamic Semantics of NoSQL Languages (PDF) by Véronique Benzaken, Giuseppe Castagna, Kim Nguy˜ên and Jérôme Siméon.

Abstract:

We present a calculus for processing semistructured data that spans differences of application area among several novel query languages, broadly categorized as “NoSQL”. This calculus lets users deﬁne their own operators, capturing a wider range of data processing capabilities, whilst providing a typing precision so far typical only of primitive hard-coded operators. The type inference algorithm is based on semantic type checking, resulting in type information that is both precise, and ﬂexible enough to handle structured and semistructured data. We illustrate the use of this calculus by encoding a large fragment of Jaql, including operations and iterators over JSON, embedded SQL expressions, and co-grouping, and show how the encoding directly yields a typing discipline for Jaql as it is, namely without the addition of any type deﬁnition or type annotation in the code.

From the conclusion:

On the structural side, the claim is that combining recursive records and pairs by unions, intersections, and negations sufﬁces to capture all possible structuring of data, covering a palette ranging from comprehensions, to heterogeneous lists mixing typed and untyped data, through regular expressions types and XML schemas. Therefore, our calculus not only provides a simple way to give a formal semantics to, reciprocally compare, and combine operators of different NoSQL languages, but also offers a means to equip these languages, in they current deﬁnition (ie, without any type deﬁnition or annotation), with precise type inference.

With lots of work in between the abstract and conclusion.

The capacity to combine operators of different NoSQL languages sounds relevant to a topic maps query language.

Yes?

I first saw this in a tweet by Computer Science.

### ArangoDB

Wednesday, December 12th, 2012

ArangoDB

From the webpage:

A universal open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript/Ruby extensions.

Design considerations:

In a nutshell:

• Schema-free schemas with shapes: Inherent structures at hand are automatically recognized and subsequently optimized.
• Querying: ArangoDB is able to accomplish complex operations on the provided data (query-by-example and query-language).
• Application Server: ArangoDB is able to act as application server on Javascript-devised routines.
• Mostly memory/durability: ArangoDB is memory-based including frequent file system synchronizing.
• AppendOnly/MVCC: Updates generate new versions of a document; automatic garbage collection.
• No indices on file: Only raw data is written on hard disk.
• ArangoDB supports single nodes and small, homogenous clusters with zero administration.

I have mentioned this before but ran across it again at: An experiment with Vagrant and Neo4J by Patrick Mulder.

### NuoDB [Everything in NuoDB is an Atom]

Saturday, November 24th, 2012

NuoDB

I last wrote about NuoDB in February of 2012, it was still in private beta release.

You can now download a free community edition with a limit of two nodes for the usual platforms.

The “under the hood” page reads (in part):

Everything in NuoDB is an Atom

Under the hood, NuoDB is an asynchronous, decentralized, peer-to-peer database. The NuoDB system is also object-oriented. Objects in NuoDB know how to perform various actions that create specific behaviors in the overall database. And at the heart of every object in NuoDB is the Atom. An Atom in NuoDB is like a single bird in a flock.

Atoms are self-describing objects (data and metadata) that together comprise the database. Everything in the NuoDB database is an Atom, including the schema, the indexes, and even the underlying data. For example, each table is an Atom that describes the metadata for the table and can reference other Atoms; such as Atoms that describe ranges of records in the table and their versions.

Atoms are Powerful

Atoms are intelligent, powerful, self-describing objects that together form the NuoDB database. Atoms know how to perform many actions, like these:

• Atoms know how to make copies of themselves.
• Atoms keep all copies of themselves up to date.
• Atoms can broadcast messages. Atoms listen for events and changes from other Atoms.
• Atoms can request data from other Atoms.
• Atoms can serialize themselves to persistent storage.
• Atoms can retrieve data from storage.

The Atoms are the Database

Everything in the database is an Atom, and the Atoms are the database. The Atoms work in concert to form both the Transaction (or Compute) Tier, and the Storage Tier.

A NuoDB Transaction Engine is a process that executes the SQL layer and is comprised completely of Atoms. The Transaction Engine operates on Atoms, listens for changes, and communicates changes with other Transaction Engines in the database.

A NuoDB Storage Manager is simply a special kind of Transaction Engine that allows Atoms to serialize themselves to permanent storage (such as a local disk or Amazon S3, for example).

A NuoDB database can be as simple as a single Transaction Engine and a single Storage Manager, or can be as complex as tens of Transaction Engines and Storage Managers distributed across dozens of computer hosts.

Some wag in a report that reminded me to look at NuoDB again was whining about how NuoDB would perform in query intensive environments? I guess downloading a free copy to see was too much effort.

Of course, you would have to define “query intensive” environment and that would be no easy task. Lots of users with simple queries? (Define “lots” and “simple.”)

Just a suspicion as I wait for my download URL to arrive, “query” in a atom based system may not have the same internal processes as a traditional relational database. Or perhaps not entirely.

For example, what if the notion of “retrieval” from a location in memory is no longer operative? That is a query is composed of atoms that begin messaging as they are composed and so receiving information before the user reaches the end of a query string?

And more than that, query atoms that occur frequently could be persisted so the creation cost is not incurred in subsequent queries.

Hard to say without knowing more about it but it definitely should be on your short list of products to watch.

### 10 things never to do with a relational database

Saturday, November 17th, 2012

10 things never to do with a relational database (The data explosion demands new solutions, yet the hoary old RDBMS still rules. Here’s where you really shouldn’t use it) by Andrew C. Oliver.

From the post:

I am a NoSQLer and a big data guy. That’s a nice coincidence, because as you may have heard, data growth is out of control.

Old habits die hard. The relational DBMS still reigns supreme. But even if you’re a dyed-in-the-wool, Oracle-loving, PL/SQL-slinging glutton for the medieval RAC, think twice, think many times, before using your beloved technology for the following tasks.

[ If you aren't going to use an RDBMS, which freaking database should you use? | See InfoWorld's comparative review of NoSQL databases. | Keep up with the latest developer news with InfoWorld's Developer World newsletter. ]

If you guessed this post is from InfoWorld and that it’s rather ranty, you are right on both counts.

Andrew’s 10 things:

1. Search
2. Recommendations
4. Product cataloguing
5. Users/groups and ACLs
6. Log analysis
7. Media repository
8. Email
10. Time-series/forecasting

Andrew ducks and covers in his conclusion with:

Can you use the RDBMS for some or many of these? Sure — I have and people continue to. However, is it a good fit? Not really. I expect the cranky old men to disagree, but tradition alone is not a good reason to stick with the old way of doing things.

If you disagree with his assessment, you are by definition a “cranky old man,” and no one wants to be seen as a cranky old man.

Being a “cranky old man,” the label doesn’t sting so I feel free to disagree.

Andrew is right that tradition alone isn’t “a good reason to stick with the old way of doing things.”

On the other hand, because something is new or venture capitalists have parted with cash, isn’t a reason to find a new way of doing things.

Your requirements aren’t only technical questions but questions of IT competence to deploy a new solution, training of staff to use a new solution, costs of retraining and construction, and others.

Ignoring the non-technical side of requirements is a step toward acquiring a white elephant to sleep in the middle of your office, interfering with day to day operations.

### CodernityDB [Origin of the 3 V's of Big Data]

Sunday, November 11th, 2012

CodernityDB

From the webpage:

CodernityDB pure python, NoSQL, fast database¶

CodernityDB is opensource, pure Python (no 3rd party dependency), fast (really fast check Speed if you don’t believe in words), multiplatform, schema-less, NoSQL database. It has optional support for HTTP server version (CodernityDB-HTTP), and also Python client library (CodernityDB-PyClient) that aims to be 100% compatible with embedded version.

“The hills are alive, with the sound of NoSQL databases….”

Sorry, I usually only sing in the shower.

I haven’t done a statistical survey (that may be in the offing) but it does seem like the stream of NoSQL databases continues unabated.

What I don’t know and you might: Has there always be a rumble of alternative databases and looking makes them appear larger/more numerous? As in a side view mirror.

If we can discover what makes NoSQL databases popular now, that may apply to semantic integration.

I don’t buy the 3 V’s, Velocity, Volume, Variety, as an explanation for NoSQL database adoption.

Doug Laney, now of Gartner, Inc., then of Meta Group coined that phrase in “3D Data Management: Controlling Data Volume, Velocity and Variety“, Date: 6 February 2001:*



E-Commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity, and variety.


I don’t recall the current level of interest in NoSQL databases when faced with the same problems in 2001.

So what else has changed? (I don’t know or I would say.)

I was alerted to the origin of the three V’s by a reference to Doug Laney by Stephen Swoyer in Big Data — Why the 3Vs Just Don’t Make Sense and then followed a reference in Big Data (Wikipedia) to find the link I reproduce above.

### G-Store: [Multi key Access]

Friday, November 2nd, 2012

G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud by Sudipto Das, Divyakant Agrawal and Amr El Abbadi.

Abstract:

Cloud computing has emerged as a preferred platform for deploying scalable web-applications. With the growing scale of these applications and the data associated with them, scalable data management systems form a crucial part of the cloud infrastructure. Key-Value stores . such as Bigtable, PNUTS, Dynamo, and their open source analogues. have been the preferred data stores for applications in the cloud. In these systems, data is represented as Key-Value pairs, and atomic access is provided only at the granularity of single keys. While these properties work well for current applications, they are insufficient for the next generation web applications . such as online gaming, social networks, collaborative editing, and many more . which emphasize collaboration. Since collaboration by definition requires consistent access to groups of keys, scalable and consistent multi key access is critical for such applications. We propose the Key Group abstraction that defines a relationship between a group of keys and is the granule for on-demand transactional access. This abstraction allows the Key Grouping protocol to collocate control for the keys in the group to allow efficient access to the group of keys. Using the Key Grouping protocol, we design and implement G-Store which uses a key-value store as an underlying substrate to provide efficient, scalable, and transactional multi key access. Our implementation using a standard key-value store and experiments using a cluster of commodity machines show that G-Store preserves the desired properties of key-value stores, while providing multi key access functionality at a very low overhead.

I encountered this while researching the intrusive keys found in: KitaroDB [intrusive-keyed database]

The notion of a key group that consists of multiple keys, assigned to a single node, seems similar to me to the collection of item identifiers in the TMDM (13250-2) post merging.

I haven’t gotten into the details but supporting collections of item identifiers on top of scalable key-value stores is an interesting prospect.

I created a category for G-Store but had to add “(Multikey)” because in looking for more details on it, I encountered another G-Store that you will be interested in.

### KitaroDB [intrusive-keyed database]

Thursday, November 1st, 2012

KitaroDB

From the “What is KitaroDB” page:

KitaroDB is a fast, efficient and scalable NoSQL database that operates natively in the WinRT (Windows 8 UI), the Win32 (x86 and x64) and .NET environments. It features:

• Easy-to-use interfaces for C#, VB, C++, C and HTML5/JavaScript developers;
• A proven commercial database system;
• Support of large-sector drives;
• Minimal overhead, consuming less than a megabyte of memory resources;
• Durability as on-disk data store;
• High performance, capable of handling tens of thousands of operations per second;
• Asynchronous and synchronous operations;
• Storage limit of 100 TB;
• Flexibility as either a key/value data store or an intrusive key database with segmented key support.

The phrase “intrusive-keyed database” was unfamiliar.

Keys can be segmented into up to 255 segments with what appears to be a fairly limited range of data types. Some of the KitaroDB documentation on intrusive-keyed database” is found here: Creating a KitaroDB database.

Segmented keys aren’t unique to KitraoDB and they offer some interesting possibilities. More to follow on that score.

Storage being limited to 100 TB should not be an issue for “middling” data sized applications.

### NoSQL Bibliographic Records:…

Tuesday, October 30th, 2012

From the background:

Using the Library of Congress Bibliographic Framework for the Digital Age as the starting point for software development requirements; the FRBR-Redis-Datastore project is a proof-of-concept for a next-generation bibliographic NoSQL system within the context of improving upon the current MARC catalog and digital repository of a small academic library at a top-tier liberal arts college.

The FRBR-Redis-Datastore project starts with a basic understanding of the MARC, MODS, and FRBR implemented using a NoSQL technology called Redis.

This presentation guides you through the theories and technologies behind one such proof-of-concept bibliographic framework for the 21st century.

Hadoop was just too complicated compared to the simple three-step Redis server set-up.

refreshing.

Simply because a technology is popular doesn’t mean it meets your requirements. Such as administration by non-full time technical experts.

An Oracle database supports applications that could manage garden club finances but that’s a poor choice under most circumstances.

The Redis part of the presentation is apparently not working (I get Python errors) as of today and I have sent a note with the error messages.

A “proof-of-concept” that merits your attention!

### SPARQL and Big Data (and NoSQL) [Identifying Winners and Losers - Cui Bono?]

Saturday, October 27th, 2012

SPARQL and Big Data (and NoSQL) by Bob DuCharme.

From the post:

How to pursue the common ground?

I think it’s obvious that SPARQL and other RDF-related technologies have plenty to offer to the overlapping worlds of Big Data and NoSQL, but this doesn’t seem as obvious to people who focus on those areas. For example, the program for this week’s Strata conference makes no mention of RDF or SPARQL. The more I look into it, the more I see that this flexible, standardized data model and query language align very well with what many of those people are trying to do.

But, we semantic web types can’t blame them for not noticing. If you build a better mouse trap, the world won’t necessarily beat a path to your door, because they have to find out about your mouse trap and what it does better. This requires marketing, which requires talking to those people in language that they understand, so I’ve been reading up on Big Data and NoSQL in order to better appreciate what they’re trying to do and how.

A great place to start is the excellent (free!) booklet Planning for Big Data by Edd Dumbill. (Others contributed a few chapters.) For a start, he describes data that “doesn’t fit the strictures of your database architectures” as a good candidate for Big Data approaches. That’s a good start for us. Here are a few longer quotes that I found interesting, starting with these two paragraphs from the section titled “Ingesting and Cleaning” after a discussion about collecting data from multiple different sources (something else that RDF and SPARQL are good at):

Bob has a very good point: marketing “…requires talking to those people in language that they understand….”

That is, no matter how “good” we think a solution may be, it won’t interest others until we explain it in terms they “get.”

But “marketing” requires more than a lingua franca.

Once an offer is made and understood, it must interest the other person. Or it is very poor marketing.

We may think that any sane person would jump at the chance to reduce the time and expense of data cleaning. But that isn’t necessarily the case.

I once made a proposal that would substantially reduce the time and expense for maintaining membership records. Records that spanned decades and were growing every year (hard copy). I made the proposal, thinking it would be well received.

Hardly. I was called into my manager’s office and got a lecture on how the department in question had more staff, a larger budget, etc., than any other department. They had no interest whatsoever in my proposal and that I should not presume to offer further advice. (Years later my suggestion was adopted when budget issues forced the issue.)

Efficient information flow interested me but not management.

Bob and the rest of us need to ask the traditional question: Cui bono? (To whose benefit?)

Semantic technologies, just like any other, have winners and losers.

To effectively market our wares, we need to identify both.

### Metamarkets open sources distributed database Druid

Friday, October 26th, 2012

Metamarkets open sources distributed database Druid by Elliot Bentley.

From the post:

It’s no secret that the latest challenge for the ‘big data’ movement is moving from batch processing to real-time analysis. Metamarkets, who provide “Data Science-as-a-Service” business analytics, last year revealed details of in-house distributed database Druid – and have this week released it as an open source project.

Druid was designed to solve the problem of a database which allows multi-dimensional queries on data as and when it arrives. The company originally experimented with both relational and NoSQL databases, but concluded they were not fast enough for their needs and so rolled out their own.

The company claims that Druid’s scan speed is “33M rows per second per core”, able to ingest “up to 10K incoming records per second per node”. An earlier blog post outlines how the company managed to achieve scan speeds of 26B records per second using horizontal scaling. It does this via a distributed architecture, column orientation and bitmap indices.

Now to see how exciting Druid is in fact!

Source code: https://github.com/metamx/druid

### Redis 2.6.2 Released!

Friday, October 26th, 2012

Redis 2.6.2 Released!

From the introduction to Redis:

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

You can run atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set.

In order to achieve its outstanding performance, Redis works with an in-memory dataset. Depending on your use case, you can persist it either by dumping the dataset to disk every once in a while, or by appending each command to a log.

Redis also supports trivial-to-setup master-slave replication, with very fast non-blocking first synchronization, auto-reconnection on net split and so forth.

Other features include a simple check-and-set mechanism, pub/sub and configuration settings to make Redis behave like a cache.

You can use Redis from most programming languages out there.

Redis is written in ANSI C and works in most POSIX systems like Linux, *BSD, OS X without external dependencies. Linux and OSX are the two operating systems where Redis is developed and more tested, and we recommend using Linux for deploying. Redis may work in Solaris-derived systems like SmartOS, but the support is best effort. There is no official support for Windows builds, although you may have some options.

The “in-memory” nature of Redis will be a good excuse for more local RAM.

I noticed the most recent release of Redis at Alex Popescu’s myNoSQL.

### Spanner – …SQL Semantics at NoSQL Scale

Monday, October 22nd, 2012

From the post:

A lot of people seem to passionately dislike the term NewSQL, or pretty much any newly coined term for that matter, but after watching Alex Lloyd, Senior Staff Software Engineer Google, give a great talk on Building Spanner, that’s the term that fits Spanner best.

Spanner wraps the SQL + transaction model of OldSQL around the reworked bones of a globally distributed NoSQL system. That seems NewSQL to me.

As Spanner is a not so distant cousin of BigTable, the NoSQL component should be no surprise. Spanner is charged with spanning millions of machines inside any number of geographically distributed datacenters. What is surprising is how OldSQL has been embraced. In an earlier 2011 talk given by Alex at the HotStorage conference, the reason for embracing OldSQL was the desire to make it easier and faster for programmers to build applications. The main ideas will seem quite familiar:

• There’s a false dichotomy between little complicated databases and huge, scalable, simple ones. We can have features and scale them too.
• Complexity is conserved, it goes somewhere, so if it’s not in the database it’s pushed to developers.
• Push complexity down the stack so developers can concentrate on building features, not databases, not infrastructure.
• Keys for creating a fast-moving app team: ACID transactions; global Serializability; code a 1-step transaction, not 10-step workflows; write queries instead of code loops; joins; no user defined conflict resolution functions; standardized sync; pay as you go, get what you pay for predictable performance.

Spanner did not start out with the goal of becoming a NewSQL star. Spanner started as a BigTable clone, with a distributed file system metaphor. Then Spanner evolved into a global ProtocolBuf container. Eventually Spanner was pushed by internal Google customers to become more relational and application programmer friendly.

If you can’t stay for the full show, Todd provides a useful summary of the video. But if you have the time, take the time to enjoy the presentation!.

### Big data cube

Sunday, October 14th, 2012

Big data cube by John D. Cook.

From the post:

Erik Meijer’s paper Your Mouse is a Database has an interesting illustration of “The Big Data Cube” using three axes to classify databases.

Enjoy John’s short take, then spend some time with Erik’s paper.

Some serious time with Erik’s paper.

You won’t be disappointed.

### Distributed Algorithms in NoSQL Databases

Wednesday, October 10th, 2012

Distributed Algorithms in NoSQL Databases by Ilya Katsov.

From the post:

Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.

In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:

• Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted data replication and data repair.
• Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how to distribute or rebalance data in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.
• System Coordination. Coordination techniques like leader election are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state, detect failures and topology changes. This section describes several important techniques that are used to keep the system in a coherent state.

Slow going but well worth the effort.

Not the issues discussed in the puff-piece webinars extolling NoSQL solutions to “big data.”

I first saw this at Christophe Lalanne’s A bag of tweets / September 2012

### Could Cassandra be the first breakout NoSQL database?

Thursday, October 4th, 2012

Could Cassandra be the first breakout NoSQL database? by Chris Mayer.

From the post:

Years of misunderstanding haven’t been kind to the NoSQL database. Aside from the confusing name (generally understood to mean ‘not only SQL’), there’s always been an air of reluctance from the enterprise world to move away from Oracle’s steady relational database, until there was a definite need to switch from tables to documents

The emergence of Big Data in the past few years has been the kickstart NoSQL distributors needed. Relational databases cannot cope with the sheer amount of data coming in and can’t provide the immediacy large-scale enterprises need to obtain information.

Open source offerings have been lurking in the background for a while, with the highly-tunable Apache Cassandra becoming a community favourite quickly. Emerging from the incubator in October 2011, Cassandra’s beauty lies in its flexible schema, its hybrid data model (lying somewhere between a key-value and tabular database) and also through its high availability. Being from the Apache Software Foundation, there’s also intrinsic links to the big data ‘kernel’ Apache Hadoop, and search server Apache Solr giving users an extra dimension to their data processing and storage.

Using NoSQL on cheap servers for processing and querying data is proving an enticing option for companies of all sizes, especially in combination with MapReduce technology to crunch it all.

One company that appears to be leading this data-driven charge is DataStax, who this week announced the completion of a \$25 million C round of funding. Having already permeated the environments of some large companies (notably Netflix), the San Mateo startup are making big noises about their enterprise platform, melding the worlds of Cassandra and Hadoop together. Netflix is a client worth crowing about, with DataStax’s enterprise option being used as one of their primary data stores

Chris mentions some other potential players, MongoDB comes to mind, along with the Hadoop crowd.

I take the move from tables to documents as a symptom of deeper issue.

Relational databases rely on normalization to achieve their performance and reliability. So what happens if data is too large or coming too quickly to be normalized?

Relational databases remain the weapon of choice for normalized data but that doesn’t mean they work well with “dirty” data.

“Dirty data,” as opposed to “documents,” seems to catch the real shift for which NoSQL solutions are better adapted.

Your result are only as good as the data, but you know that up front. Not when you realize your “normalized” data, wasn’t.

That has to be a sinking feeling.

### Accumulo: Why The World Needs Another NoSQL Database

Tuesday, September 4th, 2012

Accumulo: Why The World Needs Another NoSQL Database by Jeff Kelly.

From the post:

If you’ve been unable to keep up with all the competing NoSQL databases that have hit the market over the last several years, you’re not alone. To name just a few, there’s HBase, Cassandra, MongoDB, Riak, CouchDB, Redis, and Neo4J.

To that list you can add Accumulo, an open source database originally developed at the National Security Agency. You may be wondering why the world needs yet another database to handle large volumes of multi-structured data. The answer is, of course, that no one of these NoSQL databases has yet checked all the feature/functionality boxes that most enterprises require before deploying a new technology.

In the Big Data world, that means the ability to handle the three V’s (volume, variety and velocity) of data, the ability to process multiple types of workloads (analytical vs. transactional), and the ability to maintain ACID (atomicity, consistency, isolation and durability) compliance at scale. With each new NoSQL entrant, hope springs eternal that this one will prove the NoSQL messiah.

So what makes Accumulo different than all the rest? According to proponents, Accumulo is capable of maintaining consistency even as it scales to thousands of nodes and petabytes of data; it can both read and write data in near real-time; and, most importantly, it was built from the ground up with cell-level security functionality.

It’s the third feature – cell-level security – that has the Big Data community most excited. Accumulo is being positioned as an all-purpose Hadoop database and a competitor to HBase. While HBase, like Accumulo, is able to scale to thousands of machines while maintaining a relatively high level of consistency, it was not designed with any security, let alone cell-level security, in mind.

The current security documentation on Accumulo reads (in part):

Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.

Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together.

If that sounds impressive, realize that:

• Users can overwrite data they cannot see, unless you set the table visibility constraint.
• Users can avoid the table visibility constraint, using the bulk import method. (Which you can also disable.)

More secure than a completely insecure solution but nothing to write home about, yet.

Can you imagine the complexity that is likely to be exhibited in an inter-agency context for security labels?

BTW, how do I determine the semantics of a proposed security label? What if it conflicts with another security label?

I first saw this at Alex Popescu’s myNoSQL.

### MongoDB-as-a-service for private rolled out by ScaleGrid, in MongoDirector

Monday, July 23rd, 2012

MongoDB-as-a-service for private rolled out by ScaleGrid, in MongoDirector by Chris Mayer.

From the post:

Of all the NoSQL databases emerging at the moment, there appears to be one constant discussion taking place – are you using MongoDB?

It appears to be the open source, document-oriented NoSQL database solution of choice, mainly due to its high performance nature, its dynamism and its similarities to the JSON data structure (in BSON). Despite being written in C++, it is attracting attention from developers of different creeds. Its enterprise level features have helped a fair bit in its charge up the rankings to leading NoSQL database, with it being the ideal datastore for highly scalable environments. Just a look at the latest in-demand skills on Indeed.com shows you that 10gen’s flagship product has infiltrated the enterprise well and truly.

Quite often, an enterprise can find the switch from SQL to NoSQL daunting and needs a helping hand. Due to this, many MongoDB-related products are arriving just as quickly as MongoDB converts The latest of which to launch as a public beta is MongoDirector from Seattle start-up ScaleGrid. MongoDirector offers an end-to-end lifecycle manager for MongoDB to guide newcomers along.

I don’t have anything negative to say about MongoDB but I’m not sure the discussion of NoSQL solutions is quite as one-sided as Chris seems to think.

The Indeed.com site is a fun one to play around with but I would not take the numbers all that seriously. For one thing, it doesn’t appear to control for duplicate job ads posted in different source, for example. But that’s a nitpicking objection.

A more serious one is when you start to explore the site and discover the top three job titles for IT.

Care to guess what they are? Would you believe they don’t have anything to do with databases or MongoDB?

As least as of today, and I am sure it changes over time, Graphic Designer, Technical Writer, and Project Manager all rank higher than Data Analyst, where you would hope to find some MongoDB jobs. (Information Technology Industry – 23 July 2012)

BTW, for your amusement, when I was looking for information on database employment, I encountered Database Administrators, from the Bureau of Labor Statistics in the United States. The data is available for download as XLS files.

The site says blanks on the maps are from lack of data. I suspect the truth is there are no database administrators in Wyoming. Or at least I could point to the graphic as some evidence for my claim.

I think you need to consider the range of database options, from very traditional SQL vendors to bleeding edge No/New/Maybe/SQL solutions, including MongoDB. The question is which one meets your requirements, whether flavor of the month or no.

### Thinking in Datomic: Your data is not square

Tuesday, July 10th, 2012

Thinking in Datomic: Your data is not square by Pelle Braendgaard.

From the post:

Datomic is so different than regular databases that your average developer will probably choose to ignore it. But for the developer and startup who takes the time to understand it properly I think it can be a real unfair advantage as a choice for a data layer in your application.

In this article I will deal with the core fundamental definition of how data is stored in Datomic. This is very different from all other databases so before we even deal with querying and transactions I think it’s a good idea to look at it.

Yawn, “your data is not square.” Just teasing.

But we have all heard the criticism of relational tables. I think writers can assume that much, at least in technical forums.

The lasting value of the NoSQL movement (in addition to whichever software packages survive) will be its emphasis on analysis of your data. Your data may fit perfectly well into a square but you need to decide that after looking at your data, not before.

The same can be said about the various NoSQL offerings. Your data may or may not be suited for a particular NoSQL option. The data analysis “cat being out of the bag,” it should be applied to NoSQL options as well. True, almost any option will work, your question should be why is option X the best option for my data/use case?

### Intro to HBase Internals and Schema Design

Tuesday, July 10th, 2012

Intro to HBase Internals and Schema Design by Alex Baranau.

You will be disappointed by the slide that reads:

HBase will not adjust cluster settings to optimal based on usage patterns automatically.

Sorry, but we just aren’t quite to drag-n-drop software that optimizes to arbitrary data without user intervention.

Not sure we could keep that secret from management very long in any case so perhaps all for the best.

Once you get over your chagrin at having to still work, a little anyone, you will find Alex’s presentation a high level peak at the internals of HBase. Should be enough to get you motivated to learn more on your own. Not guaranteeing that but that should be the average result.

### Intro to HBase [Augmented Marketing Opportunities]

Tuesday, July 10th, 2012

Intro to HBase by Alex Baranau.

Slides from a presentation Alex did on HBase for a meetup in New York City on HBase.

Fairly high level overview but one of the better ones.

Should leave you with a good orientation to HBase and its capabilities.

Just in case you are looking for a project, , it would be interesting to point into a slide deck like this one with links into tutorial and documentation for the product.

Thinking of the old “hub” document concept from HyTime so you would not have to hard code links in the source but could update them as newer material comes along.

Just in case you need some encouragement, think of every slide deck as an augmented marketing opportunity. Where you are leveraging not only the presentation but the other documentation and materials created by your group.

### Implementing Aggregation Functions in MongoDB

Tuesday, June 26th, 2012

Implementing Aggregation Functions in MongoDB by Arun Viswanathan and Shruthi Kumar.

From the post:

With the amount of data that organizations generate exploding from gigabytes to terabytes to petabytes, traditional databases are unable to scale up to manage such big data sets. Using these solutions, the cost of storing and processing data will significantly increase as the data grows. This is resulting in organizations looking for other economical solutions such as NoSQL databases that provide the required data storage and processing capabilities, scalability and cost effectiveness. NoSQL databases do not use SQL as the query language. There are different types of these databases such as document stores, key-value stores, graph database, object database, etc.

Typical use cases for NoSQL database includes archiving old logs, event logging, ecommerce application log, gaming data, social data, etc. due to its fast read-write capability. The stored data would then require to be processed to gain useful insights on customers and their usage of the applications.

The NoSQL database we use in this article is MongoDB which is an open source document oriented NoSQL database system written in C++. It provides a high performance document oriented storage as well as support for writing MapReduce programs to process data stored in MongoDB documents. It is easily scalable and supports auto partitioning. Map Reduce can be used for aggregation of data through batch processing. MongoDB stores data in BSON (Binary JSON) format, supports a dynamic schema and allows for dynamic queries. The Mongo Query Language is expressed as JSON and is different from the SQL queries used in an RDBMS. MongoDB provides an Aggregation Framework that includes utility functions such as count, distinct and group. However more advanced aggregation functions such as sum, average, max, min, variance and standard deviation need to be implemented using MapReduce.

This article describes the method of implementing common aggregation functions like sum, average, max, min, variance and standard deviation on a MongoDB document using its MapReduce functionality. Typical applications of aggregations include business reporting of sales data such as calculation of total sales by grouping data across geographical locations, financial reporting, etc.

Not terribly advanced but enough to get you started with creating aggregation functions.

Includes “testing” of the aggregation functions that are written in the article.

If Python is more your cup of tea, see: Aggregation in MongoDB (part1) and Aggregation in MongoDB (part 2).

### NoSQL Standards [query languages - tuples anyone?]

Sunday, June 10th, 2012

Andrew Oliver write at InfoWorld: The time for NoSQL standards is now – Like Larry Ellison’s yacht, the RDBMS is sailing into the sunset. But if NoSQL is to take its place, a standard query language and APIs must emerge soon.

A bit dramatic for my taste but a good overview of possible areas for standardization for NoSQL.

Problem: NoSQL query languages are tied to the base format/data structure of their implementation.

For that matter, you could say the same thing about SQL. The query language is tied to the data structure.

I am not sure how you can have a query language that isn’t tied to a notion of structure. Even a very abstract one. That a NoSQL implementation could map against its data structure.

Tuples anyone?

Pointers and resources welcome!