Archive for the ‘Cell Stores’ Category

Cell Stores

Monday, May 18th, 2015

Cell Stores by Ghislain Fourny.

Abstract:

Cell stores provide a relational-like, tabular level of abstraction to business users while leveraging recent database technologies, such as key-value stores and document stores. This allows to scale up and out the efficient storage and retrieval of highly dimensional data. Cells are the primary citizens and exist in different forms, which can be explained with an analogy to the state of matter: as a gas for efficient storage, as a solid for efficient retrieval, and as a liquid for efficient interaction with the business users. Cell stores were abstracted from, and are compatible with the XBRL standard for importing and exporting data. The first cell store repository contains roughly 200GB of SEC filings data, and proves that retrieving data cubes can be performed in real time (the threshold acceptable by a human user being at most a few seconds).

Github: http://github.com/28msec/cellstore

Demonstration with 200 GB of SEC data.

Tutorial: An Introduction To The Cell Store REST API.

From the tutorial:

Cell stores are a new paradigm of databases. It is decoupled from XBRL and has a data model of its own, yet it natively support XBRL as a file format to exchange data between cell stores.

Traditional relational databases are focused on tables. Document stores are focused on trees. Triple stores are focused on graphs. Well, cell stores are focused on cells. Cells are units of data and also called facts, measures, etc. Think of taking an Excel spreadsheet and a pair of scissors, and of splitting the sheet into its cells. Put these cells in a bag. Pour some more cells that come from other spreadsheets. Many. Millions of cells. Billions of cells. Trillions of cells. You have a cell store.

Why is it so important to store all these cell in a single, big bag? That’s because the main use case for cell stores is the ability to query data across filings. Cell stores are very good at this. They were designed from day one to do this.

Cell stores are very good at reconstructing tables in the presence of highly dimensional data. The idea behind this is based on hypercubes and is called NoLAP (NoSQL Online Analytical Processing). NoLAP extends the OLAP paradigm by removing hypercube rigidity and letting users generate their own hypercubes on the fly on the same pool of cells.

For business users, all of this is completely transparent and hidden. The look and feel of a cell store, in the end, is that of a spreadsheet like Excel. If you are familiar with the pivot table functionality of Excel, cell stores will be straightforward to understand. Also the underlying XBRL is hidden.

XBRL is to cell store what the inside format of .xsls files are to Excel. How many of us have tried to unzip and open an Excel file with a text editor for any other reason than mere curiosity? The same goes for cell stores.

Forget about the complexity of XBRL. Get things done with your data.

The promise of a better user interface alone should be enough to attract attention to this proposal. Yet, so far as I can find, there hasn’t been a lot of use/discussion of it.

I do wonder about this statement in the paper:

When many people define their own taxonomy, this often ends up in redundant terminology. For example, someone might use the term Equity and somebody else Capital. When either querying cells with a hypercube, or loading cells into a spreadsheet, a mapping can be applied so that this redundant terminology is transparent. This way, when a user asks for Equity, (i) she will also get the cells having the concept Capital, (ii) and it will be transparent to her because the Capital concept is overridden with the expected value Equity.

In part because it omits the obvious case of conflicting terminology, that is we both want to use “German” as a term and I mean the language and you mean nationality. In one well known graph database the answer depends on which one of us gets there first. Poor form in my opinion.

Mapping can handle different terms for the same subject but how do we maintain that? Where do I look to discover the reason(s) underlying the mapping? Moreover, in the conflicting case, how do I distinguish otherwise opaque terms that are letter for letter identical?

There may be answers as I delve deeper into the documentation but those are some topic map issues that stood out for me on a first read.

Comments?