Archive for the ‘Accumulo’ Category

Literature Survey of Graph Databases

Tuesday, February 19th, 2013

Literature Survey of Graph Databases by Bryan Thompson.

I can understand Danny Bickson, Literature survey of graph databases, being excited about the coverage of GraphChi in this survey.

However, there are other names you will recognize as well (TOC order):

  • RDF3X
  • Diplodocus
  • GraphChi
  • YARS2
  • 4store
  • Virtuoso
  • Bigdata
  • SHARD
  • Graph partitioning
  • Accumulo
  • Urika
  • Scalable RDF query processing on clusters and supercomputers (a system with no name at Rensselaer Polytechnic)

As you can tell from the system names, the survey focuses on processing of RDF.

In reviewing one system, Bryan remarks:

Only small data sets were considered (100s of millions of edges). (emphasis added)

I think that captures the focus of the paper better than any comment I can make.

A must read for graph heads!

Understanding User Authentication and Authorization in Apache HBase

Wednesday, September 26th, 2012

Understanding User Authentication and Authorization in Apache HBase by Matteo Bertozzi.

From the post:

With the default Apache HBase configuration, everyone is allowed to read from and write to all tables available in the system. For many enterprise setups, this kind of policy is unacceptable.

Administrators can set up firewalls that decide which machines are allowed to communicate with HBase. However, machines that can pass the firewall are still allowed to read from and write to all tables. This kind of mechanism is effective but insufficient because HBase still cannot differentiate between multiple users that use the same client machines, and there is still no granularity with regard to HBase table, column family, or column qualifier access.

In this post, we will discuss how Kerberos is used with Hadoop and HBase to provide User Authentication, and how HBase implements User Authorization to grant users permissions for particular actions on a specified set of data.

When you think about security, remember: Accumulo: Why The World Needs Another NoSQL Database. Accumulo was written to provide cell level security.

Nice idea but the burden of administering cell level authorizations is going to lead to sloppy security practices. Or granting higher level permissions, inadvisedly, to some users.

Not to mention the truck sized security hole in Accumulo for imported data changing access tokens.

You can get a lot of security mileage out of HBase and Kerberos, long before you get to cell level security permissions.

Accumulo: Why The World Needs Another NoSQL Database

Tuesday, September 4th, 2012

Accumulo: Why The World Needs Another NoSQL Database by Jeff Kelly.

From the post:

If you’ve been unable to keep up with all the competing NoSQL databases that have hit the market over the last several years, you’re not alone. To name just a few, there’s HBase, Cassandra, MongoDB, Riak, CouchDB, Redis, and Neo4J.

To that list you can add Accumulo, an open source database originally developed at the National Security Agency. You may be wondering why the world needs yet another database to handle large volumes of multi-structured data. The answer is, of course, that no one of these NoSQL databases has yet checked all the feature/functionality boxes that most enterprises require before deploying a new technology.

In the Big Data world, that means the ability to handle the three V’s (volume, variety and velocity) of data, the ability to process multiple types of workloads (analytical vs. transactional), and the ability to maintain ACID (atomicity, consistency, isolation and durability) compliance at scale. With each new NoSQL entrant, hope springs eternal that this one will prove the NoSQL messiah.

So what makes Accumulo different than all the rest? According to proponents, Accumulo is capable of maintaining consistency even as it scales to thousands of nodes and petabytes of data; it can both read and write data in near real-time; and, most importantly, it was built from the ground up with cell-level security functionality.

It’s the third feature – cell-level security – that has the Big Data community most excited. Accumulo is being positioned as an all-purpose Hadoop database and a competitor to HBase. While HBase, like Accumulo, is able to scale to thousands of machines while maintaining a relatively high level of consistency, it was not designed with any security, let alone cell-level security, in mind.

The current security documentation on Accumulo reads (in part):

Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.

Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together.

If that sounds impressive, realize that:

  • Users can overwrite data they cannot see, unless you set the table visibility constraint.
  • Users can avoid the table visibility constraint, using the bulk import method. (Which you can also disable.)

More secure than a completely insecure solution but nothing to write home about, yet.

Can you imagine the complexity that is likely to be exhibited in an inter-agency context for security labels?

BTW, how do I determine the semantics of a proposed security label? What if it conflicts with another security label?

Helpful links: Apache Accumulo.

I first saw this at Alex Popescu’s myNoSQL.

U.S. Senate vs. Apache Accumulo: Whose side are you on?

Wednesday, July 18th, 2012

Jack Park sent a link to NSA Mimics Google, Pisses Off Senate this morning. If you are unfamiliar with the software, see: Apache Accumulo.

Long story made short:

The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.

At issue is:

The bill indicates that Accumulo may violate OMB Circular A-130, a government policy that bars agencies from building software if it’s less expensive to use commercial software that’s already available. And according to one congressional staffer who worked on the bill, this is indeed the case. He asked that his name not be used in this story, as he’s not authorized to speak with the press.

On its face, OMB Circular A-130 sounds like a good idea. Don’t build your own if it is cheaper to buy commercial.

But here the Senate trying to play favorites.

I have a suggestion: Let’s disappoint them.

Let’s contribute to all three projects:

Apache Accumulo

Apache Cassandra

Apache HBase

Would you look at that! All three of these projects are based at Apache!

Let’s push all three projects forward in terms of working on releases, documentation, testing, etc.

But more than that, let’s build applications based on all three projects that analyze political contributions, patterns of voting, land transfers, stock purchases, virtually every fact than can be known about members of the Senate and the Senate Armed Services Committee in particular.

They are accustomed to living in a gold fish bowl.

Let’s move them into a frying pan.

PS: +1 if the NSA is ordered to contribute to open source projects, if the projects are interested. Direction from the U.S. Senate is not part of the Apache governance model.

Accumulo

Tuesday, April 17th, 2012

Accumulo

From the webpage:

The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google’s BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

We mentioned Accumulo here but missed its graduation from the incubator. Apologies.

Accumulo Proposal

Wednesday, September 7th, 2011

Accumulo Proposal

From the Apache incubator:

Abstract

Accumulo is a distributed key/value store that provides expressive, cell-level access labels.

Proposal

Accumulo is a sorted, distributed key/value store based on Google’s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.

Background

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, CloudStore, and Cassandra. Accumulo began its development in 2008.

Rationale

There is a need for a flexible, high performance distributed key/value store that provides expressive, fine-grained access labels. The communities we expect to be most interested in such a project are government, health care, and other industries where privacy is a concern. We have made much progress in developing this project over the past 3 years and believe both the project and the interested communities would benefit from this work being openly available and having open development.

Further explanation of access labels and iterators:

Access Labels

Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user. The column visibilities are boolean AND and OR combinations of arbitrary strings (such as “(A&B)|C”) and authorizations are sets of strings (such as {C,D}).

Iterators

Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.

The use case for modifying data written to disk is unclear to me but I suppose the data “returned to the user” involves modification of data for security reasons.

Sponsored in part by the NSA, National Security Agency of the United States.

The access label line of thinking has implications for topic map merging. What if a similar mechanism were fashioned to permit or prevent “merging” based on the access of the user? (Where merging isn’t a file based activity.)