Archive for the ‘Accumulo’ Category

SecureGraph Slides!

Friday, June 13th, 2014

Open Source Graph Analysis and Visualization by Jeff Kunkle.

From the description:

Lumify is a relatively new open source platform for big data analysis and visualization, designed to help organizations derive actionable insights from the large volumes of diverse data flowing through their enterprise. Utilizing popular big data tools like Hadoop, Accumulo, and Storm, it ingests and integrates many kinds of data, from unstructured text documents and structured datasets, to images and video. Several open source analytic tools (including Tika, OpenNLP, CLAVIN, OpenCV, and ElasticSearch) are used to enrich the data, increase its discoverability, and automatically uncover hidden connections. All information is stored in a secure graph database implemented on top of Accumulo to support cell-level security of all data and metadata elements. A modern, browser-based user interface enables analysts to explore and manipulate their data, discovering subtle relationships and drawing critical new insights. In addition to full-text search, geospatial mapping, and multimedia processing, Lumify features a powerful graph visualization supporting sophisticated link analysis and complex knowledge representation.

The full story of SecureGraph isn’t here but the slides are enough to tempt me into finding out more.

You?

I first saw this in a tweet by Stephen Mallette.

SecureGraph

Thursday, June 12th, 2014

SecureGraph

From the webpage:

SecureGraph is an API to manipulate graphs, similar to Blueprints. Unlike Blueprints, every Secure graph method requires authorizations and visibilities. SecureGraph also supports multivalued properties as well as property metadata.

The SecureGraph API was designed to be generic, allowing for multiple implementations. The only implementation provided currently is built on top of Apache Accumulo for data storage and Elasticsearch for indexing.

According to the readme file, definitely “beta” software but interesting software none the less.

Are you using insecure graph software?

Might be time to find out!

I first saw this in a tweet by Marko A. Rodriguez

Accumulo Comes to CDH

Saturday, December 21st, 2013

Accumulo Comes to CDH by by Sean Busbey, Bill Havanki, and Mike Drob.

From the post:

Cloudera is pleased to announce the immediate availability of its first release of Accumulo packaged to run under CDH, our open source distribution of Apache Hadoop and related projects and the foundational infrastructure for Enterprise Data Hubs.

Accumulo is an open source project that provides the ability to store data in massive tables (billions of rows, millions of columns) for fast, random access. Accumulo was created and contributed to the Apache Software Foundation by the National Security Agency (NSA), and it has quickly gained adoption as a Hadoop-based key/value store for applications that require access to sensitive data sets. Cloudera provides enterprise support with the RTD Accumulo add-on subscription for Cloudera Enterprise.

This release provides Accumulo 1.4.3 tested for use under CDH 4.3.0. The release includes a significant number of backports and fixes to allow use with CDH 4’s highly available, production-ready packaging of HDFS. As a part of our commitment to the open source community, these changes have been submitted back upstream.

At least with Accumulo, you know you are getting NSA vetted software.

Can’t say the same thing for RSA software.

Enterprise customers need to demand open source software that reserves commercial distribution rights to its source.

For self-preservation if no other reason.

Cloudera now supports Accumulo…

Tuesday, October 1st, 2013

Cloudera now supports Accumulo, the NSA’s take on HBase by Derrick Harris.

From the post:

Cloudera will be integrating with the Apache Accumulo database and, according to a press release, “devoting significant internal engineering resources to speed Accumulo’s development.” The National Security Agency created Accumulo and built in fine-grained authentication to ensure only authorized individuals could see ay given piece of data. Cloudera’s support could be bittersweet for Sqrrl, an Accumulo startup comprised of former NSA engineers and intelligence experts, which should benefit from a bigger ecosystem but whose sales might suffer if Accumulo makes its way into Cloudera’s Hadoop distribution.

I would think the bittersweet part would be the NSA’s supporting of a design that leaves them with document level security.

It’s great that they can control access to how many saucers are stolen from White House dinners every year but document security, other than at the grossest level, goes wanting.

Maybe they haven’t heard of SGML or XML?

If you don’t mind, mention XML in your phone calls every now and again. Maybe if enough people say it, then it will come up on the “big board.”

Sqrrl Enterprise…

Monday, July 15th, 2013

Sqrrl Enterprise = 3 Databases in 1 (Column + Document + Graph)

From the post:

When looking across the NoSQL landscape, most folks partition NoSQL databases into 4 categories:

  • Key Value Stores (e.g., Riak, Redis)
  • Column Stores (e.g., HBase, Cassandra, Accumulo)
  • Document Stores (e.g., MongoDB, CouchDB)
  • Graph Stores (e.g., Neo4j, TitanDB)

In addition to being creators of the Accumulo database, the team here at Sqrrl can also appreciate the benefits of other databases in the NoSQL landscape. For this reason, when we began architecting Sqrrl Enterprise, we decided to not limit ourselves to just Accumulo’s column store data structure. Sqrrl Enterprise features Document and Graph Store functionality in addition to being a Column Store at its core.

Sqrrl Enterprise is built using open source Apache Accumulo, giving it it’s column store core. However, we love the ease of use of document stores, so when we ingest data, we convert that data from Accumulo’s native key/value format into hierarchical JSON documents (giving Sqrrl Enterprise document store functionality).

At ingest we also extract all of the graph relationships in the datasets and store them as sets of nodes and edges, giving Sqrrl Enterprise a variety of graph capabilities.

Interesting.

If you have experience with this “enhanced” version of Accumulo, will you share your experience with “a variety of graph capabilities?”

[Updated November 7, 2013. Changed the link to the post. To one that works.]

Accumulo Notes

Sunday, June 16th, 2013

Quick Accumulo Install (Sqrrl Blog)

Accumulo Installation and Configuration Steps on a Ubuntu VirtualBox Instance

In case you are curious about how your data is being stored. 😉

I first saw this in a tweet by Mandar Chandorkar.

An NSA Big Graph Experiment
[On Non-Real World Data]

Friday, June 7th, 2013

An NSA Big Graph Experiment by Paul Burkhardt and Chris Waring.

Slide presentation on processing graphs with Apache Accumulo.

Which has some impressive numbers:

Graph500

Except that if you review the Graph 500 Benchmark Specification,

N the total number of vertices, 2SCALE. An implementation may use any set of N distinct integers to number the vertices, but at least 48 bits must be allocated per vertex number. Other parameters may be assumed to fit within the natural word of the machine. N is derived from the problem’s scaling parameter.

You find that all the nodes are normalized (no duplicates).

Moreover, the Graph 500 Benchmark cites:

The graph generator is a Kronecker generator similar to the Recursive MATrix (R-MAT) scale-free graph generation algorithm [Chakrabarti, et al., 2004].

Which provides:

There is a subtle point here: we may have duplicate edges (ie., edges which fall into the same cell in the adjacency matrix), but we only keep one of them. (R-MAT: A Recursive Model for Graph Mining by Deepayan Chakrabarti,
Yiping Zhan, and Christos Faloutsos.

By design, the Graph 500 Benchmark operates on completely normalized graphs.

I mention that because the graphs from Verizon, credit bureaus, FaceBook, Twitter, etc. are anything but normalized, some internally but all externally to each other.

Scaling Big Data Mining Infrastructure: The Twitter Experience by Jimmy Lin and Dmitriy Ryaboy is a chilling outline of semantic impedance in data within a single organization. Semantic impedance that would be reflected in graph processing of that data.

How much more semantic impedance will be encountered when graphs are build from diverse data sources?

Bottom line: The NSA gets great performance from Accumulo on normalized graphs, graphs that do not reflect real-world, non-normalized data.

I first saw this NSA presentation at Here’s how the NSA analyzes all that call data by Derrick Harris.

Literature Survey of Graph Databases

Tuesday, February 19th, 2013

Literature Survey of Graph Databases by Bryan Thompson.

I can understand Danny Bickson, Literature survey of graph databases, being excited about the coverage of GraphChi in this survey.

However, there are other names you will recognize as well (TOC order):

  • RDF3X
  • Diplodocus
  • GraphChi
  • YARS2
  • 4store
  • Virtuoso
  • Bigdata
  • SHARD
  • Graph partitioning
  • Accumulo
  • Urika
  • Scalable RDF query processing on clusters and supercomputers (a system with no name at Rensselaer Polytechnic)

As you can tell from the system names, the survey focuses on processing of RDF.

In reviewing one system, Bryan remarks:

Only small data sets were considered (100s of millions of edges). (emphasis added)

I think that captures the focus of the paper better than any comment I can make.

A must read for graph heads!

Understanding User Authentication and Authorization in Apache HBase

Wednesday, September 26th, 2012

Understanding User Authentication and Authorization in Apache HBase by Matteo Bertozzi.

From the post:

With the default Apache HBase configuration, everyone is allowed to read from and write to all tables available in the system. For many enterprise setups, this kind of policy is unacceptable.

Administrators can set up firewalls that decide which machines are allowed to communicate with HBase. However, machines that can pass the firewall are still allowed to read from and write to all tables. This kind of mechanism is effective but insufficient because HBase still cannot differentiate between multiple users that use the same client machines, and there is still no granularity with regard to HBase table, column family, or column qualifier access.

In this post, we will discuss how Kerberos is used with Hadoop and HBase to provide User Authentication, and how HBase implements User Authorization to grant users permissions for particular actions on a specified set of data.

When you think about security, remember: Accumulo: Why The World Needs Another NoSQL Database. Accumulo was written to provide cell level security.

Nice idea but the burden of administering cell level authorizations is going to lead to sloppy security practices. Or granting higher level permissions, inadvisedly, to some users.

Not to mention the truck sized security hole in Accumulo for imported data changing access tokens.

You can get a lot of security mileage out of HBase and Kerberos, long before you get to cell level security permissions.

Accumulo: Why The World Needs Another NoSQL Database

Tuesday, September 4th, 2012

Accumulo: Why The World Needs Another NoSQL Database by Jeff Kelly.

From the post:

If you’ve been unable to keep up with all the competing NoSQL databases that have hit the market over the last several years, you’re not alone. To name just a few, there’s HBase, Cassandra, MongoDB, Riak, CouchDB, Redis, and Neo4J.

To that list you can add Accumulo, an open source database originally developed at the National Security Agency. You may be wondering why the world needs yet another database to handle large volumes of multi-structured data. The answer is, of course, that no one of these NoSQL databases has yet checked all the feature/functionality boxes that most enterprises require before deploying a new technology.

In the Big Data world, that means the ability to handle the three V’s (volume, variety and velocity) of data, the ability to process multiple types of workloads (analytical vs. transactional), and the ability to maintain ACID (atomicity, consistency, isolation and durability) compliance at scale. With each new NoSQL entrant, hope springs eternal that this one will prove the NoSQL messiah.

So what makes Accumulo different than all the rest? According to proponents, Accumulo is capable of maintaining consistency even as it scales to thousands of nodes and petabytes of data; it can both read and write data in near real-time; and, most importantly, it was built from the ground up with cell-level security functionality.

It’s the third feature – cell-level security – that has the Big Data community most excited. Accumulo is being positioned as an all-purpose Hadoop database and a competitor to HBase. While HBase, like Accumulo, is able to scale to thousands of machines while maintaining a relatively high level of consistency, it was not designed with any security, let alone cell-level security, in mind.

The current security documentation on Accumulo reads (in part):

Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.

Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together.

If that sounds impressive, realize that:

  • Users can overwrite data they cannot see, unless you set the table visibility constraint.
  • Users can avoid the table visibility constraint, using the bulk import method. (Which you can also disable.)

More secure than a completely insecure solution but nothing to write home about, yet.

Can you imagine the complexity that is likely to be exhibited in an inter-agency context for security labels?

BTW, how do I determine the semantics of a proposed security label? What if it conflicts with another security label?

Helpful links: Apache Accumulo.

I first saw this at Alex Popescu’s myNoSQL.

U.S. Senate vs. Apache Accumulo: Whose side are you on?

Wednesday, July 18th, 2012

Jack Park sent a link to NSA Mimics Google, Pisses Off Senate this morning. If you are unfamiliar with the software, see: Apache Accumulo.

Long story made short:

The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.

At issue is:

The bill indicates that Accumulo may violate OMB Circular A-130, a government policy that bars agencies from building software if it’s less expensive to use commercial software that’s already available. And according to one congressional staffer who worked on the bill, this is indeed the case. He asked that his name not be used in this story, as he’s not authorized to speak with the press.

On its face, OMB Circular A-130 sounds like a good idea. Don’t build your own if it is cheaper to buy commercial.

But here the Senate trying to play favorites.

I have a suggestion: Let’s disappoint them.

Let’s contribute to all three projects:

Apache Accumulo

Apache Cassandra

Apache HBase

Would you look at that! All three of these projects are based at Apache!

Let’s push all three projects forward in terms of working on releases, documentation, testing, etc.

But more than that, let’s build applications based on all three projects that analyze political contributions, patterns of voting, land transfers, stock purchases, virtually every fact than can be known about members of the Senate and the Senate Armed Services Committee in particular.

They are accustomed to living in a gold fish bowl.

Let’s move them into a frying pan.

PS: +1 if the NSA is ordered to contribute to open source projects, if the projects are interested. Direction from the U.S. Senate is not part of the Apache governance model.

Accumulo

Tuesday, April 17th, 2012

Accumulo

From the webpage:

The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google’s BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

We mentioned Accumulo here but missed its graduation from the incubator. Apologies.

Accumulo Proposal

Wednesday, September 7th, 2011

Accumulo Proposal

From the Apache incubator:

Abstract

Accumulo is a distributed key/value store that provides expressive, cell-level access labels.

Proposal

Accumulo is a sorted, distributed key/value store based on Google’s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.

Background

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, CloudStore, and Cassandra. Accumulo began its development in 2008.

Rationale

There is a need for a flexible, high performance distributed key/value store that provides expressive, fine-grained access labels. The communities we expect to be most interested in such a project are government, health care, and other industries where privacy is a concern. We have made much progress in developing this project over the past 3 years and believe both the project and the interested communities would benefit from this work being openly available and having open development.

Further explanation of access labels and iterators:

Access Labels

Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user. The column visibilities are boolean AND and OR combinations of arbitrary strings (such as “(A&B)|C”) and authorizations are sets of strings (such as {C,D}).

Iterators

Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.

The use case for modifying data written to disk is unclear to me but I suppose the data “returned to the user” involves modification of data for security reasons.

Sponsored in part by the NSA, National Security Agency of the United States.

The access label line of thinking has implications for topic map merging. What if a similar mechanism were fashioned to permit or prevent “merging” based on the access of the user? (Where merging isn’t a file based activity.)