Archive for the ‘NoSQL’ Category

NoSQL Now! 2015

Wednesday, June 3rd, 2015

NoSQL Now! 2015

nosql-2015

There is a strong graph track but if your interests lie elsewhere, you won’t be disappointed!

BTW, register by July 17, 2015 for a 20% discount off the standard price. (That gets the full event below $500. For three days in San Jose? That’s a real bargain.)

Cell Stores

Monday, May 18th, 2015

Cell Stores by Ghislain Fourny.

Abstract:

Cell stores provide a relational-like, tabular level of abstraction to business users while leveraging recent database technologies, such as key-value stores and document stores. This allows to scale up and out the efficient storage and retrieval of highly dimensional data. Cells are the primary citizens and exist in different forms, which can be explained with an analogy to the state of matter: as a gas for efficient storage, as a solid for efficient retrieval, and as a liquid for efficient interaction with the business users. Cell stores were abstracted from, and are compatible with the XBRL standard for importing and exporting data. The first cell store repository contains roughly 200GB of SEC filings data, and proves that retrieving data cubes can be performed in real time (the threshold acceptable by a human user being at most a few seconds).

Github: http://github.com/28msec/cellstore

Demonstration with 200 GB of SEC data.

Tutorial: An Introduction To The Cell Store REST API.

From the tutorial:

Cell stores are a new paradigm of databases. It is decoupled from XBRL and has a data model of its own, yet it natively support XBRL as a file format to exchange data between cell stores.

Traditional relational databases are focused on tables. Document stores are focused on trees. Triple stores are focused on graphs. Well, cell stores are focused on cells. Cells are units of data and also called facts, measures, etc. Think of taking an Excel spreadsheet and a pair of scissors, and of splitting the sheet into its cells. Put these cells in a bag. Pour some more cells that come from other spreadsheets. Many. Millions of cells. Billions of cells. Trillions of cells. You have a cell store.

Why is it so important to store all these cell in a single, big bag? That’s because the main use case for cell stores is the ability to query data across filings. Cell stores are very good at this. They were designed from day one to do this.

Cell stores are very good at reconstructing tables in the presence of highly dimensional data. The idea behind this is based on hypercubes and is called NoLAP (NoSQL Online Analytical Processing). NoLAP extends the OLAP paradigm by removing hypercube rigidity and letting users generate their own hypercubes on the fly on the same pool of cells.

For business users, all of this is completely transparent and hidden. The look and feel of a cell store, in the end, is that of a spreadsheet like Excel. If you are familiar with the pivot table functionality of Excel, cell stores will be straightforward to understand. Also the underlying XBRL is hidden.

XBRL is to cell store what the inside format of .xsls files are to Excel. How many of us have tried to unzip and open an Excel file with a text editor for any other reason than mere curiosity? The same goes for cell stores.

Forget about the complexity of XBRL. Get things done with your data.

The promise of a better user interface alone should be enough to attract attention to this proposal. Yet, so far as I can find, there hasn’t been a lot of use/discussion of it.

I do wonder about this statement in the paper:

When many people define their own taxonomy, this often ends up in redundant terminology. For example, someone might use the term Equity and somebody else Capital. When either querying cells with a hypercube, or loading cells into a spreadsheet, a mapping can be applied so that this redundant terminology is transparent. This way, when a user asks for Equity, (i) she will also get the cells having the concept Capital, (ii) and it will be transparent to her because the Capital concept is overridden with the expected value Equity.

In part because it omits the obvious case of conflicting terminology, that is we both want to use “German” as a term and I mean the language and you mean nationality. In one well known graph database the answer depends on which one of us gets there first. Poor form in my opinion.

Mapping can handle different terms for the same subject but how do we maintain that? Where do I look to discover the reason(s) underlying the mapping? Moreover, in the conflicting case, how do I distinguish otherwise opaque terms that are letter for letter identical?

There may be answers as I delve deeper into the documentation but those are some topic map issues that stood out for me on a first read.

Comments?

New Non-Meaningful NoSQL Benchmark

Tuesday, April 14th, 2015

New NoSQL benchmark: Cassandra, MongoDB, HBase, Couchbase by Jon Jensen.

From the post:

Today we are pleased to announce the results of a new NoSQL benchmark we did to compare scale-out performance of Apache Cassandra, MongoDB, Apache HBase, and Couchbase. This represents work done over 8 months by Josh Williams, and was commissioned by DataStax as an update to a similar 3-way NoSQL benchmark we did two years ago.

If you can guess the NoSQL database used by DataStax, then you already know the results of the benchmark test.

Amazing how that works isn’t it? I can’t think of a single benchmark test sponsored by a vendor that shows a technology option, other than their own, would be the better choice.

Technology vendors aren’t like Progressive where you can get competing quotes for automobile insurance.

Technology vendors are convinced that with just enough effort, your problem can be tamed to be met by their solution.

I won’t bother to list the one hundred and forty odd (140+) NoSQL databases that did not appear in this benchmark or use cases that would challenge the strengths and weaknesses of each one. Unless benchmarking is one of your use cases, ask vendors for performance characteristics based on your use cases. You will be less likely to be disappointed.

Figures Don’t Lie, But Liars Can Figure

Thursday, March 12th, 2015

A pair of posts that you may find amusing on the question of “free” and “cheaper.”

HBase is Free but Oracle NoSQL Database is cheaper

When does “free” challenge that old adage, “You get what you pay for”?

Two brief quotes from the first post set the stage:

How can Oracle NoSQL Database be cheaper than “free”? There’s got to be a catch. And of course there is, but it’s not where you are expecting. The problem in that statement isn’t with “cheaper” it’s with “free”.

An HBase solution isn’t really free because you do need hardware to run your software. And when you need to scale out, you have to look at the how well the software scales. Oracle NoSQL Database scales much better than HBase which translated in this case to needing much less hardware. So, yes, it was cheaper than free. Just be careful when somebody says software is free.

The second post tries to remove the vendor (Oracle) from the equation:

Read-em and weep …. NOT according to Oracle, HBase does not take advantage of SSD’s anywhere near the extent with which Oracle NoSQL does … couldn’t even use the same scale on the vertical bar.

SanDisk on HBase with SSD

SanDisk on Oracle NoSQL with SSD

And so the question remains, when does “free” challenge the old adage “you get what you pay for”, because in this case, the adage continues to hold up.

And as the second post notes, Oracle has committed code back to the HBase product so it isn’t unfamiliar to them.

First things first, the difficulty that leads to these spats is using “cheap,” “free,” “scalable,” “NoSQL,” etc. as the basis for IT marketing or decision making. That may work with poorer IT decision makers and however happy it makes the marketing department, it is just noise. Noise that is a disservice to IT consumers.

Take “cheaper,” and “free” as used in these posts. Is hardware really the only cost associated with HBase or Oracle installations? If it is, I have been severely misled.

On the Hbase expense side I would expect to find HBase DBAs, maintenance of those personnel, hardware (+maintenance), programmers, along with use case requirements that must be met.

On the Oracle expense side I would expect to find Oracle DBAs, maintenance of those personnel, Oracle software licensing, hardware (+maintenance), programmers, along with use case requirements that must be met.

Before you jump to my listing “Oracle software licensing,” consider how that will impact the availability of appropriate personnel, the amount of training needed to introduce new IT staff to HBase, etc.

Not to come down too hard for Oracle, Oracle DBAs and their maintenance aren’t cheap, nor are some of the “features” of Oracle software.

Truth be told there is a role for project requirements, experience of current IT personnel, influence IT has over the decision makers, and personal friendships of decision makers in any IT decision making.

To be very blunt, IT decision making is just as political as any other enterprise decision.

Numbers are a justification for a course chosen for other reasons. As a user I am always more concerned with my use cases being met than numbers. Aren’t you?

Hot Cloud Swap: Migrating a Database Cluster with Zero Downtime

Tuesday, December 23rd, 2014

Hot Cloud Swap: Migrating a Database Cluster with Zero Downtime by Jennifer Rullmann.

By now, you may have heard about, seen, or even tried your hand against the fault tolerance of our database. The Key-Value Store, and the layers that turn it into a multi-model database, handle a wide variety of disasters with ease. In this real-time demo video, we show off the ability to migrate a cluster to a new set of machines with zero downtime.

fdb_image_rush

We’re calling this feature ‘hot cloud swap’, because although you can use it on your own machines, it’s particularly interesting to those who run their database in the cloud and may want to switch providers. And that’s exactly what I do in the video. Watch me migrate a database cluster from Digital Ocean to Amazon Web Services in under 7 minutes, real-time!

Its been years but I can remember as a sysadmin switching out “hot swapable” drives. Never lost any data but there was always that moment of doubt during the rebuild.

Personally I would have more than one complete and tested backups, to the extent that is possible, before trying a “hot cloud swap.” That may be overly cautious but better cautious than crossing into the “Sony Zone.”

At one point Jennifer says:

“…a little bit of hesitation but it worked it out.”

Difficult to capture but if you look at time marker 06.52.85 on the clock below the left hand window, writes start failing.

It recovers but it is not the case that the application never stops. At least in the sense of writes. Depends on your definition of “stops” I suppose.

I am sure that the fault tolerance build into FoundationDB made this less scary but the “hot swap” part should be doable with any clustering solution. Yes?

That is you add “new” machines to the cluster, then exclude the “old” machines from the cluster, which results in a complete transfer of data to the “new” machines, at which point you create new coordinators, exclude the “old” machines from the cluster and then eventually you close the “old” machines. Is there something unique about that process to FoundationDB?

Don’t get me wrong, I am hoping to learn a great deal more about FoundationDB in the new year but I intensely dislike distinctions between software packages that have no basis in fact.

NoSQL Data Modelling (Jan Steemann)

Monday, December 22nd, 2014

From the description:

Learn about data modelling in a NoSQL environment in this half-day class.

Even though most NoSQL databases follow the “schema-free” data paradigma, what a database is really good at is determined by its underlying architecture and storage model.

It is therefore important to choose a matching data model to get the best out of the underlying database technology. Application requirements such as consistency demands also need to be considered.

During the half-day, attendees will get an overview of different data storage models available in NoSQL databases. There will also be hands-on examples and experiments using key/value, document, and graph data structures.

No prior knowledge of NoSQL databases is required. Some basic experience with relational databases (like MySQL) or data modelling will be helpful but is not essential. Participants will need to bring their own laptop (preferably Linux or MacOS). Installation instructions for the required software will be sent out prior to the class.

Great lecture on beginning data modeling for NoSQL.

What I haven’t encountered is a war story approach to data modeling. That is a book or series of lectures that iterates over data modeling problems encountered in practice, what considerations were taken into account and the solution decided upon. A continuing series of annual volumes with great indexing would make a must have series for any SQL or NoSQL DBA.

Jan mentions http://www.nosql-database.org/ as a nearly comprehensive NoSQL database information site. And it nearly is. Nearly because it currently omits Weaver (Graph Store) under graph databases. If you notice other omissions, please forward them to edlich@gmail.com. Maintaining a current list of resources is exhausting work.

Pinned Tabs: myNoSQL

Thursday, October 30th, 2014

Alex Popescu & Ana-Maria Bacalu have added a new feature at myNoSQL called “Pinned Tabs.”

The feature started on 28 Oct. 2014 and consists of very short (2-3 sentence descriptions) with links on NoSQL, BigData, etc. topics.

Today’s “pinned tabs” included:

03: If you don’t test for the possible failures, you might be in for a surprise. Stripe has tried a more organized chaos monkey attack and discovered a scenario in which their Redis cluster is losing all the data. They’ll move to Amazon RDS PostgreSQL. From an in-memory smart key-value engine to a relational database.

Game Day Exercises at Stripe: Learning from kill -9

04: How a distributed database should really behave in front of massive failures. Netflix recounts their recent experience of having 218 Cassandra nodes rebooted without losing availability. At all.

How Netflix Handled the Reboot of 218 Cassandra Nodes

Curated news saves time and attention span!

Enjoy!

Understanding weak isolation is a serious problem

Wednesday, September 17th, 2014

Understanding weak isolation is a serious problem by Peter Bailis.

From the post:

Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional inherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.2

That debates are occurring without full knowledge of the issues at hand isn’t all that surprising. Or as Job 38:2 (KJV) puts it: “Who is this that darkeneth counsel by words without knowledge?”

Peter raises a number of questions and points to resources that are good starting points for investigation of weak isolation.

What sort of weak isolation does your topic map storage mechanism use?

I first saw this in a tweet by Justin Sheehy.

On Lowered Expectations:…

Monday, June 9th, 2014

On Lowered Expectations: Transactions, Scaling, and Honesty by Jennifer Rullmann.

Jennifer reviews Ted Dunning’s list of what developers should demand from database vendors and then adds one more:

But I think the most important thing that developers need from database vendors is missing: honesty. I spent 30 minutes yesterday on a competitor’s website just trying to figure out if they really do ACID, and after four months in the industry, I know quite a bit more about what to look for than most application developers. It’s ridiculous how hard it is to figure out even the most basic things about the vast majority of databases on the market. I feel really strongly about this, so I’ll say it again:

The number one thing we need from database vendors is honesty.

(emphasis in original)

I am sure there are vendors who invent definitions of “hyperedge” and claim to support Unicode when they really support “tick-Unicode,” that is a Unicode character preceded by a “`.”

Beyond basic honesty, I read Jennifer’s complaint as being about the lack of good documentation for database offerings. A lack that is well known.

I doubt developers are suddenly going to start writing high quality documentation for their software. Or at least after decades of not doing so, it seems unlikely.

But that doesn’t mean we are doomed to bad documentation. What if a database vendor decided to document databases comparable to their own? Not complete, not a developer’s guide but ferreting out and documenting basic information comparable databases.

Like support for ACID.

Would take time to develop the data and credibility, but in the long run, whose product would you trust more?

A vendor whose database capabilities are hidden behind smoke and mirrors or a vendor who is honest about themselves and others?

Back to the future of databases

Saturday, May 10th, 2014

Back to the future of databases by Yin Wang.

From the post:

Why do we need databases? What a stupid question. I already heard some people say. But it is a legitimate question, and here is an answer that not many people know.

First of all, why can’t we just write programs that operate on objects? The answer is, obviously, we don’t have enough memory to hold all the data. But why can’t we just swap out the objects to disk and load them back when needed? The answer is yes we can, but not in Unix, because Unix manages memory as pages, not as objects. There are systems who lived before Unix that manage memory as objects, and perform object-granularity persistence. That is a feature ahead of its time, and is until today far more advanced than the current state-of-the-art. Here are some pictures of such systems:

Certainly thought provoking but how much of an advantage would object-granularity persistence have to offer before it could make headway against the install base of Unix?

The database field is certainly undergoing rapid research and development, with no clear path to a winner.

Will the same happen with OSes?

Phoenix: Incubating at Apache!

Sunday, January 12th, 2014

Phoenix: Incubating at Apache!

From the webpage:

Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Tired of reading already and just want to get started? Take a look at our FAQs, listen to the Phoenix talks from Hadoop Summit 2013 and HBaseConn 2013, and jump over to our quick start guide here.

To see whats supported, go to our language reference. It includes all typical SQL query statement clauses, including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, etc. It also supports a full set of DML commands as well as table creation and versioned incremental alterations through our DDL commands. We try to follow the SQL standards wherever possible.

Incubating at Apache is no guarantee of success but it does mean sane licensing and a merit based organization/process.

If you are interested in non-NSA corrupted software, consider supporting the Apache Software Foundation.

Codd’s Relational Vision…

Thursday, December 12th, 2013

Codd’s Relational Vision – Has NoSQL Come Full Circle? by Doug Turnbull.

From the post:

Recently, I spoke at NoSQL Matters in Barcelona about database history. As somebody with a history background, I was pretty excited to dig into the past, beyond the hype and marketing fluff, and look specifically at what technical problems each generation of database solved and where they in-turn fell short.

However, I got stuck at one moment in time I found utterly fascinating: the original development of relational databases. So much of the NoSQL movement feels like a rebellion against the “old timey” feeling relational databases. So I thought it would be fascinating to be a contrarian, to dig into what value relational databases have added to the world. Something everyone thinks is obvious but nobody really understands.

It’s very easy and popular to criticize relational databases. What folks don’t seem to do is go back and appreciate how revolutionary relational databases were when they came out. We forget what problems they solved. We forget how earlier databases fell short, and how relational databases solved the problems of the first generation of databases. In short, relational databases were the noSomething, and I aimed to find out what that something was.

And from that apply those lessons to today’s NoSQL databases. Are today’s databases repeating mistakes of the past? Or are they filling an important niche (or both?).

This is a must read article if you are not choosing databases based on marketing hype.

It’s nice to hear IT history taken seriously.

UnQLite

Thursday, December 12th, 2013

UnQLite

From the webpage:http://unqlite.org/features.html#self_contained

UnQLite is a in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine. UnQLite is a document store database similar to MongoDB,Redis, CouchDB etc. as well a standard Key/Value store similar to BerkeleyDB,LevelDB, etc.

UnQLite is an embedded NoSQL (Key/Value store and Document-store) database engine. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections, is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures. UnQLite features includes:

Does this have the look and feel of a “…just like a Camry…” commercial? 😉

In case you have been under a rock: 2013 Toyota Camry TV Commercial – “Remote”

Still, you may find it meets your requirements better than others.

InfluxDB

Thursday, December 5th, 2013

InfluxDB

From the webpage:

An open-source, distributed, time series, events, and metrics database with no external dependencies.

Time Series

Everything in InfluxDB is a time series that you can perform standard functions on like min, max, sum, count, mean, median, percentiles, and more.

Metrics

Scalable metrics that you can collect on any interval, computing rollups on the fly later. Track 100 metrics or 1 million, InfluxDB scales horizontally.

Events

InfluxDB’s data model supports arbitrary event data. Just write in a hash of associated data and count events, uniques, or grouped columns on the fly later.

The overview page gives some greater detail:

When we built Errplane, we wanted the data model to be flexible enough to store events like exceptions along with more traditional metrics like response times and server stats. At the same time we noticed that other companies were also building custom time series APIs on top of a database for analytics and metrics. Depending on the requirements these APIs would be built on top of a regular SQL database, Redis, HBase, or Cassandra.

We thought the community might benefit from the work we’d already done with our scalable backend. We wanted something that had the HTTP API built in that would scale out to billions of metrics or events. We also wanted sometehing that would make it simple to query for downsampled data, percentiles, and other aggregates at scale. Our hope is that once there’s a standard API, the community will be able to build useful tooling around it for data collection, visualization, and analysis.

While phrased as tracking server stats and events, I suspect InfluxDB would be just as happy tracking other types of stats or events.

I don’t know, say like the “I’m alive” messages your cellphone sends to the local towers for instance.

I first saw this in Nat Torkington’s Four short links: 5 November 2013.

N*SQL Matters @Barcelona, Spain Slides!

Thursday, December 5th, 2013

N*SQL Matters @Barcelona, Spain Slides!

Slides for today but videos are said to be coming soon!

By Title:

  • API Analytics with Redis and Bigquery, Javier Ramirez view the slides
  • ArangoDB – a different approach to NoSQL, Lucas Dohmen view the slides
  • Big Memory Scale-in vs. Scale-out, Niklas Bjorkman view the slides
  • Bringing NoSQL to your mobile!, Patrick Heneise view the slides
  • Building information systems using rapid application development methods, Michel Müller view the slides
  • A call for sanity in NoSQL, Nathan Marz view the slides
  • Cicerone: A Real-Time social venue recommender, Daniel Villatoro view the slides
  • Database History from Codd to Brewer and Beyond, Doug Turnbull view the slides
  • DynamoDB – on-demand NoSQL scaling as a service, Steffen Krause view the slides
  • Getting down and dirty with Elasticsearch, Clinton Gormley view the slides
  • Harnessing the Internet of Things with NoSQL, Michael Hausenblas view the slides
  • How to survive in a BASE world, Uwe Friedrichsen view the slides
  • Introduction to Graph Databases, Stefan Armbruster view the slides
  • A Journey through the MongoDB Internals, Christian Kvalheim view the slides
  • Killing pigs and saving Danish bacon with Riak, Joel Jacobsen view the slides
  • Lambdoop, a framework for easy development of Big Data applications, Rubén Casado view the slides
  • NoSQL Infrastructure, David Mytton view the slides
  • Realtime visitor analysis with Couchbase and Elasticsearch, Jeroen Reijn view the slides
  • SAMOA: A Platform for Mining Big Data Streams, Gianmarco De Francisci Morales view the slides
  • Splout SQL: Web-latency SQL View for Hadoop, Iván de Prado view the slides
  • Sprayer: low latency, reliable multichannel messaging for Telefonica Digital, Pablo Enfedaque and Javier Arias

    view the slides

  • By Presenter:

    • Armbruster, Stefan – Introduction to Graph Databases view the slides
    • Bjorkman, Niklas – Big Memory – Scale-in vs. Scale-out view the slides
    • Casado, Rubén – Lambdoop, a framework for easy development of Big Data applications view the slides
    • Dohmen, Lucas – ArangoDB – a different approach to NoSQL view the slides
    • Enfedaque, Pablo and Javier Arias – Sprayer: low latency, reliable multichannel messaging for Telefonica Digital view the slides
    • Friedrichsen, Uwe – How to survive in a BASE world view the slides
    • Gormley, Clinton – Getting down and dirty with Elasticsearch view the slides
    • Hausenblas, Michael – Harnessing the Internet of Things with NoSQL view the slides
    • Heneise, Patrick – Bringing NoSQL to your mobile! view the slides
    • Jacobsen, Joel – Killing pigs and saving Danish bacon with Riak view the slides
    • Krause, Steffen – DynamoDB – on-demand NoSQL scaling as a service view the slides
    • Kvalheim, Christian – A Journey through the MongoDB Internals view the slides
    • Marz, Nathan – A call for sanity in NoSQL view the slides
    • Morales, Gianmarco De Francisci – SAMOA: A Platform for Mining Big Data Streams view the slides
    • Müller, Michel – Building information systems using rapid application development methods view the slides
    • Mytton, David – NoSQL Infrastructure view the slides
    • Prado, Iván de – Splout SQL: Web-latency SQL View for Hadoop view the slides
    • Ramirez, Javier – API Analytics with Redis and Bigquery view the slides
    • Reijn, Jeroen – Realtime visitor analysis with Couchbase and Elasticsearch view the slides
    • Turnbull, Doug – Database History from Codd to Brewer and Beyond view the slides
    • Villatoro, Daniel – Cicerone: A Real-Time social venue recommender view the slides

    I will update these with the videos when they are posted.

    Enjoy!

    HyperDex 1.0RC5

    Wednesday, November 20th, 2013

    HyperDex 1.0RC5 by Robert Escriva.

    From the post:

    We are proud to announce HyperDex 1.0.rc5, the next generation NoSQL data store that provides ACID transactions, fault-tolerance, and high-performance. This new release has a number of exciting features:

    • Improved cluster management. The cluster will automatically grow as new nodes are added.
    • Backup support. Take backups of the coordinator and daemons in a consistent state and be able to restore the cluster to the point when the backup was taken.
    • An admin library which exposes performance counters for tracking cluster-wide statistics relating to HyperDex
    • Support for HyperLevelDB. This is the first HyperDex release to use HyperLevelDB, which brings higher performance than Google’s LevelDB.
    • Secondary indices. Secondary indices improve the speed of search without the overhead of creating a subspace for the indexed attributes.
    • New atomic operations. Most key-based operations now have conditional atomic equivalents.
    • Improved coordinator stability. This release introduces an improved coordinator that fixes a few stability problems reported by users.

    Binary packages for Debian 7, Ubuntu 12.04-13.10, Fedora 18-19, and CentOS 6 are available on the HyperDex Download page, as well as source tarballs for other Linux platforms.

    BTW, HyperDex has a cool logo:

    HyperDex

    Good logos are like good book covers, they catch the eye of potential customers.

    A book sale starts when a customer pick a book up, hence the need for a good cover.

    What sort of cover does your favorite semantic application have?

    OhmDB

    Friday, November 15th, 2013

    OhmDB

    Billed as:

    The Irresistible Database for Java Combining Great RDBMS and NoSQL Features.

    Supposed to appear by the end of November 2013 so it isn’t clear if SQL, NoSQL are about to be joined by Irresistable as a database category or not. 😉

    The following caught my eye:

    Very fast joins with graph-based relations

    A single join has O(1) time complexity. A combination of multiple joins is internally processed as graph traversal with smart query optimization.

    Without details, “very fast” has too wide a range of meanings to be very useful.

    I don’t agree with the evaluation of Performance for RDBMS as “Limited.” People keep saying that as a truism when performance of any data store depends upon the architecture, data model, caching, etc.

    I saw a performance test recently that depended upon (hopefully) a mis-understanding of one of the subjects of comparison. No surprise that it did really poorly in the comparison.

    On the other hand, I am looking forward to the release of OhmDB as an early holiday surprise!

    PS: I did subscribe to the newsletter on the theory that enough legitimate email might drown out the spam I get.

    List of NoSQL Databases (150 at present count)

    Wednesday, October 30th, 2013

    List of NoSQL Databases (150 at present count)

    A tweet by John Troon pointed me to the current NoSQL listing at NoSQL-database.org with 150 entries.

    Is there a betting pool on how many more will appear by May 1, 2014?

    Just curious.

    An In-Depth Look at Modern Database Systems

    Sunday, October 27th, 2013

    An In-Depth Look at Modern Database Systems by C. Mohan.

    Abstract:

    This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source, commercial and research systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented.

    This is a revised version of a tutorial presented first at the 39th International Conference on Very Large Databases (VLDB2013) in Riva del Garda, Italy in August 2013. This is also a follow up to my EDBT2013 keynote talk “History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla” (see the paper at http://bit.ly/NoSQLp)

    Latest Bibliography.

    The one thing I have not found for this tutorial is a video!

    While highly enjoyable (from my perspective), detailed analysis of the database platforms and the ideas they missed or incorporated would be even more valuable.

    It is one thing to say generally that an idea was missed and quite another to obtain agreement on that point.

    A series of workshops documenting the intellectual history of databases would go a long way to hastening progress, as opposed to proliferation of wheels.

    WhiteDB

    Friday, October 25th, 2013

    WhiteDB

    From the webpage:

    WhiteDB is a lightweight NoSQL database library written in C, operating fully in main memory. There is no server process. Data is read and written directly from/to shared memory, no sockets are used between WhiteDB and the application program.

    This look like fun!

    Not to mention being a reason to bump up to more memory. 😉

    Design Patterns for Distributed…

    Sunday, September 29th, 2013

    Design Patterns for Distributed Non-Relational Databases by Todd Lipcon.

    A bit dated (2009) but true design patterns should find refinement, not retirement.

    Covers:

    • Consistent Hashing
    • Consistency Models
    • Data Models
    • Storage Layouts
    • Log-Structured Merge Trees

    Curious if you would suggest substantial changes to these patterns some four (4) years later?

    Sparkey

    Thursday, September 19th, 2013

    Sparkey

    From the webpage:

    Sparkey is an extremely simple persistent key-value store. You could think of it as a read-only hashtable on disk and you wouldn’t be far off. It is designed and optimized for some server side usecases at Spotify but it is written to be completely generic and makes no assumptions about what kind of data is stored.

    Some key characteristics:

    • Supports data sizes up to 2^63 – 1 bytes.
    • Supports iteration, get, put, delete
    • Optimized for bulk writes.
    • Immutable hash table.
    • Any amount of concurrent independent readers.
    • Only allows one writer at a time per storage unit.
    • Cross platform storage file.
    • Low overhead per entry.
    • Constant read startup cost
    • Low number of disk seeks per read
    • Support for block level compression.
    • Data agnostic, it just maps byte arrays to byte arrays.

    What it’s not:

    • It’s not a distributed key value store – it’s just a hash table on disk.
    • It’s not a compacted data store, but that can be implemented on top of it, if needed.
    • It’s not robust against data corruption.

    The usecase we have for it at Spotify is serving data that rarely gets updated to users or other services. The fast and efficient bulk writes makes it feasible to periodically rebuild the data, and the fast random access reads makes it suitable for high throughput low latency services. For some services we have been able to saturate network interfaces while keeping cpu usage really low.

    If you are looking for a very high-performance key-value store with little to no frills, your search may be over.

    Originating with Spotify and being able to saturate network interfaces bodes well for those needing pure performance.

    I first saw this in Nat Torkington’s Four short links: 10 September 2013.

    Havalo [NoSQL for Small Data]

    Tuesday, September 17th, 2013

    Havalo

    From the webpage:

    A zero configuration, non-distributed NoSQL key-value store that runs in any Servlet 3.0 compatible container.

    Sometimes you just need fast NoSQL storage, but don’t need full redundancy and scalability (that’s right, localhost will do just fine). With Havalo, simply drop havalo.war into your favorite Servlet 3.0 compatible container and with almost no configuration you’ll have access to a fast and lightweight K,V store backed by any local mount point for persistent storage. And, Havalo has a pleasantly simple RESTful API for your added enjoyment.

    Havalo is perfect for testing, maintaining fast indexes of data stored “elsewhere”, and almost any other deployment scenario where relational databases are just too heavy.

    The latest stable version of Havalo is 1.4.

    Interesting move toward the shallow end of the data pool for NoSQL.

    I don’t know of any reason why small data could not benefit from NoSQL flexibility.

    Lowering the overhead of NoSQL for small data may introduce more people to NoSQL earlier in their data careers.

    Which means when they move up the ladder to “big data,” they won’t be easily impressed.

    Are there other “small data” friendly NoSQL solutions you would recommend?

    Cassandra – A Decentralized Structured Storage System [Annotated]

    Monday, September 16th, 2013

    Cassandra – A Decentralized Structured Storage System by Avinash Lakshman, Facebook and Prashant Malik, Facebook.

    Abstract:

    Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.

    Annotated version of the original 2009 Cassandra paper.

    Not a guide to future technology but a very interesting read about how Cassandra arrived at the present.

    Aerospike 3

    Tuesday, September 10th, 2013

    Aerospike 3 by Alex Popescu.

    From the post:

    Aerospike 3 database builds off of Aerospike’s legacy of speed, scale, and reliability, adding an extensible data model that supports complex data types, large data types, queries using secondary indexes, user defined functions (UDFs) and distributed aggregations. Process more data faster to create the richest, most relevant real-time interactions.

    Aerospike 3 Community Edition is a free unlimited license designed for a single cluster of up to two nodes and storage of up to 200GB of data. Enterprise version is available upon request.

    Try the FREE version now.

    Alex has picked up a new sponsor that merits your attention!

    From the community download page:

    Free Aerospike 3 Community Edition is a full copy of Aerospike Database, in a 2-node cluster configuration that supports a database up to 200 GB in size. For example, if you have 125 million records at 1.5 K bytes/object, you can do 16k reads/sec and 8k/writes/sec with data on SSD. Or, if you are deploying an in-memory database, you can handle 60k reads/sec and 30k writes/sec. This product includes:

    • Unlimited license to use the software forever. No fees, no strings attached.
    • Access to online forums and documentation
    • Tools for setting up and managing two Aerospike Servers in a single Aerospike Cluster
    • Aerospike Server software and Aerospike SDK for developing your database client application
    • When scale demands, easy upgrade to the Enterprise Edition without stopping your service!

    The in-memory performance numbers look particularly impressive!

    Google goes back to the future…

    Saturday, August 31st, 2013

    Google goes back to the future with SQL F1 database by Jack Clark.

    From the post:

    The tech world is turning back toward SQL, bringing to a close a possibly misspent half-decade in which startups courted developers with promises of infinite scalability and the finest imitation-Google tools available, and companies found themselves exposed to unstable data and poor guarantees.

    The shift has been going on quietly for some time, and tech leader Google has been tussling with the drawbacks of non-relational and non ACID-compliant systems for years. That struggle has demanded the creation of a new system to handle data at scale, and on Tuesday at the Very Large Data Base (VLDB) conference, Google delivered a paper outlining its much-discussed “F1” system, which has replaced MySQL as the distributed heart of the company’s hugely lucrative AdWords platform.

    The AdWords system includes “100s of applications and 1000s of users,” which all share a database over 100TB serving up “hundreds of thousands of requests per second, and runs SQL queries that scan tens of trillions of data rows per day,” Google said. And it’s got five nines of availability.

    (…)

    F1 uses some of Google’s most advanced technologies, such as BigTable and the planet-spanning “Spanner” database, which F1 servers are co-located with for optimum use. Google describes it as a “a hybrid, combining the best aspects of traditional relational databases and scalable NoSQL systems”.
    (…)

    I am wondering what the “…RDBMS doesn’t do X well parrots…” are going to say now?

    The authors admit up front “trade-offs and sacrifices” were made. But when you meet your requirements while processing trillions of rows of data daily, you are entitled to “trade-offs and sacrifices.”

    A very deep paper that will require background reading for most of us.

    Looking forward to it.

    NoSQL Listener

    Wednesday, August 28th, 2013

    NoSQL Listener

    From the webpage:

    Aggregating NoSQL news from Twitter, from your friends at Cloudant

    What twitter streams do you want to capture and post online (or process into a topic map)?

    You can fork this project at GitHub.

    Here’s a research idea:

    Capture tweets on a possible U.S. lead conflict and separate out those from a geographic plot around the Pentagon.

    Do the tweet levels or tone track U.S. military action?

    FoundationDB: Version 1.0 and Pricing Announced!

    Tuesday, August 20th, 2013

    FoundationDB: Version 1.0 and Pricing Announced!

    From the post:

    After a successful 18-month Alpha and Beta testing program involving more than 2,000 participants, we’re very excited to announce that we’ve released version 1.0 of FoundationDB and general availability pricing!

    Built on a distributed shared-nothing architecture, FoundationDB is a unique database technology that combines the time-proven power of ACID transactions with the scalability, fault tolerance, and operational elegance of distributed NoSQL databases.

    You can download FoundationDB and use it under our Community License today and run as many server processes as you’d like to in non-production use, and use up to six processes in production for free! You don’t even have to sign up – just go to our download page for instant access. You’ll get all the technical goodness of FoundationDB – exceptional fault tolerance, high performance distributed ACID transactions, and access to our growing catalog of open source layers – regardless of whether you’re a community user or a paying customer.

    Have a big application that needs more than six processes in production, or want your FoundationDB cluster supported? We’re also offering commercial licensing and support priced starting at $99 per server process per month. Check out our commercial license and support plans on our pricing page.

    I don’t know if FoundationDB will meet your requirements but I can say their business model should set the standard for software offerings.

    High quality software with aggressive pricing and no registration required for the community edition.

    I am downloading the community version now.

    When are you going to grab a copy?

    Welcome BigCouch

    Friday, July 26th, 2013

    Welcome BigCouch

    From the post:

    Good news! Cloudant has announced the completion of the BigCouch merge. This is a huge step forward for CouchDB. So thank you to Cloudant, and thank you to the committers (particularly Robert Newson and Paul Davis) who slogged (and travelled the world to pair with each other) to make this happen.

    What does this mean? Well, right now, the code is merged, but not released. So hold your clicks just a moment! Once the code has been tested, we will include it in one of our regular releases. (If you want to help us test, hop on to the dev@ mailing list!)

    What’s new? The key accomplishment of the merged code is that BigCouch’s clustering capability, along with the rest of Cloudant’s other enhancements to CouchDB’s code base, will now be available in Apache CouchDB. This also includes improvements in compaction and replication speed, as well as boosts for high-concurrency access performance.

    Painless replication has always been CouchDB’s biggest feature. Now we get to take advantage of Cloudant’s experience running large distributed clusters in production for four years. With BigCouch merged in, CouchDB will be able to replicate data at a much larger scale.

    But wait! That’s not all! Cloudant has decided to terminate their BigCouch fork of CouchDB, and instead focus future development on Apache CouchDB. This is excellent news for CouchDB, even more excellent news for the CouchDB community.

    Just a quick reminder about the CouchTM project that used CouchDB as its backend.