Accumulo: Why The World Needs Another NoSQL Database by Jeff Kelly.
From the post:
If you’ve been unable to keep up with all the competing NoSQL databases that have hit the market over the last several years, you’re not alone. To name just a few, there’s HBase, Cassandra, MongoDB, Riak, CouchDB, Redis, and Neo4J.
To that list you can add Accumulo, an open source database originally developed at the National Security Agency. You may be wondering why the world needs yet another database to handle large volumes of multi-structured data. The answer is, of course, that no one of these NoSQL databases has yet checked all the feature/functionality boxes that most enterprises require before deploying a new technology.
In the Big Data world, that means the ability to handle the three V’s (volume, variety and velocity) of data, the ability to process multiple types of workloads (analytical vs. transactional), and the ability to maintain ACID (atomicity, consistency, isolation and durability) compliance at scale. With each new NoSQL entrant, hope springs eternal that this one will prove the NoSQL messiah.
So what makes Accumulo different than all the rest? According to proponents, Accumulo is capable of maintaining consistency even as it scales to thousands of nodes and petabytes of data; it can both read and write data in near real-time; and, most importantly, it was built from the ground up with cell-level security functionality.
It’s the third feature – cell-level security – that has the Big Data community most excited. Accumulo is being positioned as an all-purpose Hadoop database and a competitor to HBase. While HBase, like Accumulo, is able to scale to thousands of machines while maintaining a relatively high level of consistency, it was not designed with any security, let alone cell-level security, in mind.
The current security documentation on Accumulo reads (in part):
Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.
…
Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together.
If that sounds impressive, realize that:
- Users can overwrite data they cannot see, unless you set the table visibility constraint.
- Users can avoid the table visibility constraint, using the bulk import method. (Which you can also disable.)
More secure than a completely insecure solution but nothing to write home about, yet.
Can you imagine the complexity that is likely to be exhibited in an inter-agency context for security labels?
BTW, how do I determine the semantics of a proposed security label? What if it conflicts with another security label?
Helpful links: Apache Accumulo.
I first saw this at Alex Popescu’s myNoSQL.