From the Apache incubator:
Abstract
Accumulo is a distributed key/value store that provides expressive, cell-level access labels.
Proposal
Accumulo is a sorted, distributed key/value store based on Google’s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.
Background
Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, CloudStore, and Cassandra. Accumulo began its development in 2008.
Rationale
There is a need for a flexible, high performance distributed key/value store that provides expressive, fine-grained access labels. The communities we expect to be most interested in such a project are government, health care, and other industries where privacy is a concern. We have made much progress in developing this project over the past 3 years and believe both the project and the interested communities would benefit from this work being openly available and having open development.
Further explanation of access labels and iterators:
Access Labels
Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user. The column visibilities are boolean AND and OR combinations of arbitrary strings (such as “(A&B)|C”) and authorizations are sets of strings (such as {C,D}).
Iterators
Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.
The use case for modifying data written to disk is unclear to me but I suppose the data “returned to the user” involves modification of data for security reasons.
Sponsored in part by the NSA, National Security Agency of the United States.
The access label line of thinking has implications for topic map merging. What if a similar mechanism were fashioned to permit or prevent “merging” based on the access of the user? (Where merging isn’t a file based activity.)