## Archive for the ‘NoSQL’ Category

### NoSQL Bibliographic Records:…

Tuesday, October 30th, 2012

From the background:

Using the Library of Congress Bibliographic Framework for the Digital Age as the starting point for software development requirements; the FRBR-Redis-Datastore project is a proof-of-concept for a next-generation bibliographic NoSQL system within the context of improving upon the current MARC catalog and digital repository of a small academic library at a top-tier liberal arts college.

The FRBR-Redis-Datastore project starts with a basic understanding of the MARC, MODS, and FRBR implemented using a NoSQL technology called Redis.

This presentation guides you through the theories and technologies behind one such proof-of-concept bibliographic framework for the 21st century.

Hadoop was just too complicated compared to the simple three-step Redis server set-up.

refreshing.

Simply because a technology is popular doesn’t mean it meets your requirements. Such as administration by non-full time technical experts.

An Oracle database supports applications that could manage garden club finances but that’s a poor choice under most circumstances.

The Redis part of the presentation is apparently not working (I get Python errors) as of today and I have sent a note with the error messages.

A “proof-of-concept” that merits your attention!

### SPARQL and Big Data (and NoSQL) [Identifying Winners and Losers - Cui Bono?]

Saturday, October 27th, 2012

SPARQL and Big Data (and NoSQL) by Bob DuCharme.

From the post:

How to pursue the common ground?

I think it’s obvious that SPARQL and other RDF-related technologies have plenty to offer to the overlapping worlds of Big Data and NoSQL, but this doesn’t seem as obvious to people who focus on those areas. For example, the program for this week’s Strata conference makes no mention of RDF or SPARQL. The more I look into it, the more I see that this flexible, standardized data model and query language align very well with what many of those people are trying to do.

But, we semantic web types can’t blame them for not noticing. If you build a better mouse trap, the world won’t necessarily beat a path to your door, because they have to find out about your mouse trap and what it does better. This requires marketing, which requires talking to those people in language that they understand, so I’ve been reading up on Big Data and NoSQL in order to better appreciate what they’re trying to do and how.

A great place to start is the excellent (free!) booklet Planning for Big Data by Edd Dumbill. (Others contributed a few chapters.) For a start, he describes data that “doesn’t fit the strictures of your database architectures” as a good candidate for Big Data approaches. That’s a good start for us. Here are a few longer quotes that I found interesting, starting with these two paragraphs from the section titled “Ingesting and Cleaning” after a discussion about collecting data from multiple different sources (something else that RDF and SPARQL are good at):

Bob has a very good point: marketing “…requires talking to those people in language that they understand….”

That is, no matter how “good” we think a solution may be, it won’t interest others until we explain it in terms they “get.”

But “marketing” requires more than a lingua franca.

Once an offer is made and understood, it must interest the other person. Or it is very poor marketing.

We may think that any sane person would jump at the chance to reduce the time and expense of data cleaning. But that isn’t necessarily the case.

I once made a proposal that would substantially reduce the time and expense for maintaining membership records. Records that spanned decades and were growing every year (hard copy). I made the proposal, thinking it would be well received.

Hardly. I was called into my manager’s office and got a lecture on how the department in question had more staff, a larger budget, etc., than any other department. They had no interest whatsoever in my proposal and that I should not presume to offer further advice. (Years later my suggestion was adopted when budget issues forced the issue.)

Efficient information flow interested me but not management.

Bob and the rest of us need to ask the traditional question: Cui bono? (To whose benefit?)

Semantic technologies, just like any other, have winners and losers.

To effectively market our wares, we need to identify both.

### Metamarkets open sources distributed database Druid

Friday, October 26th, 2012

Metamarkets open sources distributed database Druid by Elliot Bentley.

From the post:

It’s no secret that the latest challenge for the ‘big data’ movement is moving from batch processing to real-time analysis. Metamarkets, who provide “Data Science-as-a-Service” business analytics, last year revealed details of in-house distributed database Druid – and have this week released it as an open source project.

Druid was designed to solve the problem of a database which allows multi-dimensional queries on data as and when it arrives. The company originally experimented with both relational and NoSQL databases, but concluded they were not fast enough for their needs and so rolled out their own.

The company claims that Druid’s scan speed is “33M rows per second per core”, able to ingest “up to 10K incoming records per second per node”. An earlier blog post outlines how the company managed to achieve scan speeds of 26B records per second using horizontal scaling. It does this via a distributed architecture, column orientation and bitmap indices.

Now to see how exciting Druid is in fact!

Source code: https://github.com/metamx/druid

### Redis 2.6.2 Released!

Friday, October 26th, 2012

Redis 2.6.2 Released!

From the introduction to Redis:

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

You can run atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set.

In order to achieve its outstanding performance, Redis works with an in-memory dataset. Depending on your use case, you can persist it either by dumping the dataset to disk every once in a while, or by appending each command to a log.

Redis also supports trivial-to-setup master-slave replication, with very fast non-blocking first synchronization, auto-reconnection on net split and so forth.

Other features include a simple check-and-set mechanism, pub/sub and configuration settings to make Redis behave like a cache.

You can use Redis from most programming languages out there.

Redis is written in ANSI C and works in most POSIX systems like Linux, *BSD, OS X without external dependencies. Linux and OSX are the two operating systems where Redis is developed and more tested, and we recommend using Linux for deploying. Redis may work in Solaris-derived systems like SmartOS, but the support is best effort. There is no official support for Windows builds, although you may have some options.

The “in-memory” nature of Redis will be a good excuse for more local RAM.

I noticed the most recent release of Redis at Alex Popescu’s myNoSQL.

### Spanner – …SQL Semantics at NoSQL Scale

Monday, October 22nd, 2012

From the post:

A lot of people seem to passionately dislike the term NewSQL, or pretty much any newly coined term for that matter, but after watching Alex Lloyd, Senior Staff Software Engineer Google, give a great talk on Building Spanner, that’s the term that fits Spanner best.

Spanner wraps the SQL + transaction model of OldSQL around the reworked bones of a globally distributed NoSQL system. That seems NewSQL to me.

As Spanner is a not so distant cousin of BigTable, the NoSQL component should be no surprise. Spanner is charged with spanning millions of machines inside any number of geographically distributed datacenters. What is surprising is how OldSQL has been embraced. In an earlier 2011 talk given by Alex at the HotStorage conference, the reason for embracing OldSQL was the desire to make it easier and faster for programmers to build applications. The main ideas will seem quite familiar:

• There’s a false dichotomy between little complicated databases and huge, scalable, simple ones. We can have features and scale them too.
• Complexity is conserved, it goes somewhere, so if it’s not in the database it’s pushed to developers.
• Push complexity down the stack so developers can concentrate on building features, not databases, not infrastructure.
• Keys for creating a fast-moving app team: ACID transactions; global Serializability; code a 1-step transaction, not 10-step workflows; write queries instead of code loops; joins; no user defined conflict resolution functions; standardized sync; pay as you go, get what you pay for predictable performance.

Spanner did not start out with the goal of becoming a NewSQL star. Spanner started as a BigTable clone, with a distributed file system metaphor. Then Spanner evolved into a global ProtocolBuf container. Eventually Spanner was pushed by internal Google customers to become more relational and application programmer friendly.

If you can’t stay for the full show, Todd provides a useful summary of the video. But if you have the time, take the time to enjoy the presentation!.

### Big data cube

Sunday, October 14th, 2012

Big data cube by John D. Cook.

From the post:

Erik Meijer’s paper Your Mouse is a Database has an interesting illustration of “The Big Data Cube” using three axes to classify databases.

Enjoy John’s short take, then spend some time with Erik’s paper.

Some serious time with Erik’s paper.

You won’t be disappointed.

### Distributed Algorithms in NoSQL Databases

Wednesday, October 10th, 2012

Distributed Algorithms in NoSQL Databases by Ilya Katsov.

From the post:

Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.

In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:

• Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted data replication and data repair.
• Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how to distribute or rebalance data in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.
• System Coordination. Coordination techniques like leader election are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state, detect failures and topology changes. This section describes several important techniques that are used to keep the system in a coherent state.

Slow going but well worth the effort.

Not the issues discussed in the puff-piece webinars extolling NoSQL solutions to “big data.”

I first saw this at Christophe Lalanne’s A bag of tweets / September 2012

### Could Cassandra be the first breakout NoSQL database?

Thursday, October 4th, 2012

Could Cassandra be the first breakout NoSQL database? by Chris Mayer.

From the post:

Years of misunderstanding haven’t been kind to the NoSQL database. Aside from the confusing name (generally understood to mean ‘not only SQL’), there’s always been an air of reluctance from the enterprise world to move away from Oracle’s steady relational database, until there was a definite need to switch from tables to documents

The emergence of Big Data in the past few years has been the kickstart NoSQL distributors needed. Relational databases cannot cope with the sheer amount of data coming in and can’t provide the immediacy large-scale enterprises need to obtain information.

Open source offerings have been lurking in the background for a while, with the highly-tunable Apache Cassandra becoming a community favourite quickly. Emerging from the incubator in October 2011, Cassandra’s beauty lies in its flexible schema, its hybrid data model (lying somewhere between a key-value and tabular database) and also through its high availability. Being from the Apache Software Foundation, there’s also intrinsic links to the big data ‘kernel’ Apache Hadoop, and search server Apache Solr giving users an extra dimension to their data processing and storage.

Using NoSQL on cheap servers for processing and querying data is proving an enticing option for companies of all sizes, especially in combination with MapReduce technology to crunch it all.

One company that appears to be leading this data-driven charge is DataStax, who this week announced the completion of a $25 million C round of funding. Having already permeated the environments of some large companies (notably Netflix), the San Mateo startup are making big noises about their enterprise platform, melding the worlds of Cassandra and Hadoop together. Netflix is a client worth crowing about, with DataStax’s enterprise option being used as one of their primary data stores Chris mentions some other potential players, MongoDB comes to mind, along with the Hadoop crowd. I take the move from tables to documents as a symptom of deeper issue. Relational databases rely on normalization to achieve their performance and reliability. So what happens if data is too large or coming too quickly to be normalized? Relational databases remain the weapon of choice for normalized data but that doesn’t mean they work well with “dirty” data. “Dirty data,” as opposed to “documents,” seems to catch the real shift for which NoSQL solutions are better adapted. Your result are only as good as the data, but you know that up front. Not when you realize your “normalized” data, wasn’t. That has to be a sinking feeling. ### Accumulo: Why The World Needs Another NoSQL Database Tuesday, September 4th, 2012 Accumulo: Why The World Needs Another NoSQL Database by Jeff Kelly. From the post: If you’ve been unable to keep up with all the competing NoSQL databases that have hit the market over the last several years, you’re not alone. To name just a few, there’s HBase, Cassandra, MongoDB, Riak, CouchDB, Redis, and Neo4J. To that list you can add Accumulo, an open source database originally developed at the National Security Agency. You may be wondering why the world needs yet another database to handle large volumes of multi-structured data. The answer is, of course, that no one of these NoSQL databases has yet checked all the feature/functionality boxes that most enterprises require before deploying a new technology. In the Big Data world, that means the ability to handle the three V’s (volume, variety and velocity) of data, the ability to process multiple types of workloads (analytical vs. transactional), and the ability to maintain ACID (atomicity, consistency, isolation and durability) compliance at scale. With each new NoSQL entrant, hope springs eternal that this one will prove the NoSQL messiah. So what makes Accumulo different than all the rest? According to proponents, Accumulo is capable of maintaining consistency even as it scales to thousands of nodes and petabytes of data; it can both read and write data in near real-time; and, most importantly, it was built from the ground up with cell-level security functionality. It’s the third feature – cell-level security – that has the Big Data community most excited. Accumulo is being positioned as an all-purpose Hadoop database and a competitor to HBase. While HBase, like Accumulo, is able to scale to thousands of machines while maintaining a relatively high level of consistency, it was not designed with any security, let alone cell-level security, in mind. The current security documentation on Accumulo reads (in part): Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality. Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together. If that sounds impressive, realize that: • Users can overwrite data they cannot see, unless you set the table visibility constraint. • Users can avoid the table visibility constraint, using the bulk import method. (Which you can also disable.) More secure than a completely insecure solution but nothing to write home about, yet. Can you imagine the complexity that is likely to be exhibited in an inter-agency context for security labels? BTW, how do I determine the semantics of a proposed security label? What if it conflicts with another security label? Helpful links: Apache Accumulo. I first saw this at Alex Popescu’s myNoSQL. ### MongoDB-as-a-service for private rolled out by ScaleGrid, in MongoDirector Monday, July 23rd, 2012 MongoDB-as-a-service for private rolled out by ScaleGrid, in MongoDirector by Chris Mayer. From the post: Of all the NoSQL databases emerging at the moment, there appears to be one constant discussion taking place – are you using MongoDB? It appears to be the open source, document-oriented NoSQL database solution of choice, mainly due to its high performance nature, its dynamism and its similarities to the JSON data structure (in BSON). Despite being written in C++, it is attracting attention from developers of different creeds. Its enterprise level features have helped a fair bit in its charge up the rankings to leading NoSQL database, with it being the ideal datastore for highly scalable environments. Just a look at the latest in-demand skills on Indeed.com shows you that 10gen’s flagship product has infiltrated the enterprise well and truly. Quite often, an enterprise can find the switch from SQL to NoSQL daunting and needs a helping hand. Due to this, many MongoDB-related products are arriving just as quickly as MongoDB converts The latest of which to launch as a public beta is MongoDirector from Seattle start-up ScaleGrid. MongoDirector offers an end-to-end lifecycle manager for MongoDB to guide newcomers along. I don’t have anything negative to say about MongoDB but I’m not sure the discussion of NoSQL solutions is quite as one-sided as Chris seems to think. The Indeed.com site is a fun one to play around with but I would not take the numbers all that seriously. For one thing, it doesn’t appear to control for duplicate job ads posted in different source, for example. But that’s a nitpicking objection. A more serious one is when you start to explore the site and discover the top three job titles for IT. Care to guess what they are? Would you believe they don’t have anything to do with databases or MongoDB? As least as of today, and I am sure it changes over time, Graphic Designer, Technical Writer, and Project Manager all rank higher than Data Analyst, where you would hope to find some MongoDB jobs. (Information Technology Industry – 23 July 2012) BTW, for your amusement, when I was looking for information on database employment, I encountered Database Administrators, from the Bureau of Labor Statistics in the United States. The data is available for download as XLS files. The site says blanks on the maps are from lack of data. I suspect the truth is there are no database administrators in Wyoming. Or at least I could point to the graphic as some evidence for my claim. I think you need to consider the range of database options, from very traditional SQL vendors to bleeding edge No/New/Maybe/SQL solutions, including MongoDB. The question is which one meets your requirements, whether flavor of the month or no. ### Thinking in Datomic: Your data is not square Tuesday, July 10th, 2012 Thinking in Datomic: Your data is not square by Pelle Braendgaard. From the post: Datomic is so different than regular databases that your average developer will probably choose to ignore it. But for the developer and startup who takes the time to understand it properly I think it can be a real unfair advantage as a choice for a data layer in your application. In this article I will deal with the core fundamental definition of how data is stored in Datomic. This is very different from all other databases so before we even deal with querying and transactions I think it’s a good idea to look at it. Yawn, “your data is not square.” Just teasing. But we have all heard the criticism of relational tables. I think writers can assume that much, at least in technical forums. The lasting value of the NoSQL movement (in addition to whichever software packages survive) will be its emphasis on analysis of your data. Your data may fit perfectly well into a square but you need to decide that after looking at your data, not before. The same can be said about the various NoSQL offerings. Your data may or may not be suited for a particular NoSQL option. The data analysis “cat being out of the bag,” it should be applied to NoSQL options as well. True, almost any option will work, your question should be why is option X the best option for my data/use case? ### Intro to HBase Internals and Schema Design Tuesday, July 10th, 2012 Intro to HBase Internals and Schema Design by Alex Baranau. You will be disappointed by the slide that reads: HBase will not adjust cluster settings to optimal based on usage patterns automatically. Sorry, but we just aren’t quite to drag-n-drop software that optimizes to arbitrary data without user intervention. Not sure we could keep that secret from management very long in any case so perhaps all for the best. Once you get over your chagrin at having to still work, a little anyone, you will find Alex’s presentation a high level peak at the internals of HBase. Should be enough to get you motivated to learn more on your own. Not guaranteeing that but that should be the average result. ### Intro to HBase [Augmented Marketing Opportunities] Tuesday, July 10th, 2012 Intro to HBase by Alex Baranau. Slides from a presentation Alex did on HBase for a meetup in New York City on HBase. Fairly high level overview but one of the better ones. Should leave you with a good orientation to HBase and its capabilities. Just in case you are looking for a project, , it would be interesting to point into a slide deck like this one with links into tutorial and documentation for the product. Thinking of the old “hub” document concept from HyTime so you would not have to hard code links in the source but could update them as newer material comes along. Just in case you need some encouragement, think of every slide deck as an augmented marketing opportunity. Where you are leveraging not only the presentation but the other documentation and materials created by your group. ### Implementing Aggregation Functions in MongoDB Tuesday, June 26th, 2012 Implementing Aggregation Functions in MongoDB by Arun Viswanathan and Shruthi Kumar. From the post: With the amount of data that organizations generate exploding from gigabytes to terabytes to petabytes, traditional databases are unable to scale up to manage such big data sets. Using these solutions, the cost of storing and processing data will significantly increase as the data grows. This is resulting in organizations looking for other economical solutions such as NoSQL databases that provide the required data storage and processing capabilities, scalability and cost effectiveness. NoSQL databases do not use SQL as the query language. There are different types of these databases such as document stores, key-value stores, graph database, object database, etc. Typical use cases for NoSQL database includes archiving old logs, event logging, ecommerce application log, gaming data, social data, etc. due to its fast read-write capability. The stored data would then require to be processed to gain useful insights on customers and their usage of the applications. The NoSQL database we use in this article is MongoDB which is an open source document oriented NoSQL database system written in C++. It provides a high performance document oriented storage as well as support for writing MapReduce programs to process data stored in MongoDB documents. It is easily scalable and supports auto partitioning. Map Reduce can be used for aggregation of data through batch processing. MongoDB stores data in BSON (Binary JSON) format, supports a dynamic schema and allows for dynamic queries. The Mongo Query Language is expressed as JSON and is different from the SQL queries used in an RDBMS. MongoDB provides an Aggregation Framework that includes utility functions such as count, distinct and group. However more advanced aggregation functions such as sum, average, max, min, variance and standard deviation need to be implemented using MapReduce. This article describes the method of implementing common aggregation functions like sum, average, max, min, variance and standard deviation on a MongoDB document using its MapReduce functionality. Typical applications of aggregations include business reporting of sales data such as calculation of total sales by grouping data across geographical locations, financial reporting, etc. Not terribly advanced but enough to get you started with creating aggregation functions. Includes “testing” of the aggregation functions that are written in the article. If Python is more your cup of tea, see: Aggregation in MongoDB (part1) and Aggregation in MongoDB (part 2). ### NoSQL Standards [query languages - tuples anyone?] Sunday, June 10th, 2012 Andrew Oliver write at InfoWorld: The time for NoSQL standards is now – Like Larry Ellison’s yacht, the RDBMS is sailing into the sunset. But if NoSQL is to take its place, a standard query language and APIs must emerge soon. A bit dramatic for my taste but a good overview of possible areas for standardization for NoSQL. Problem: NoSQL query languages are tied to the base format/data structure of their implementation. For that matter, you could say the same thing about SQL. The query language is tied to the data structure. I am not sure how you can have a query language that isn’t tied to a notion of structure. Even a very abstract one. That a NoSQL implementation could map against its data structure. Tuples anyone? Pointers and resources welcome! ### Working with NoSQL Databases [MS TechNet] Saturday, June 9th, 2012 Working with NoSQL Databases From Microsoft’s TechNet, an outline listing of NoSQL links and resources. Has the advantage (over similar resources) of being in English, Deustch, Italian and Português. ### NoSQL Databases Saturday, June 9th, 2012 NoSQL Databases by Christof Strauch, Stuttgart Media University. (PDF, 149 pages) An overview and introduction to NoSQL databases. According to a post on High Scalibility, Paper: NoSQL Databases – NoSQL Introduction and Overview, the paper was written between 2010-06 and 2011-02. As High Scalibility notes, the paper is a bit dated but it remains a good general overview of the area. It does omit graph databases entirely (except for some further reading in the bibliography). To be fair, even a summary of the work on graph databases would be at least as long as this paper, if not longer. ### Riak Handbook, Second Edition [$29 for 154 pages of content]

Friday, June 8th, 2012

Riak Handbook, Second Edition, by Mathias Meyer.

From the post:

Basho Technologies today announced the immediate availability of the second edition of Riak Handbook. The significantly updated Riak Handbook includes more than 43 pages of new content covering many of the latest feature enhancements to Riak, Basho’s industry-leading, open-source, distributed database. Riak Handbook is authored by former Basho developer and advocate, Mathias Meyer.

Riak Handbook is a comprehensive, hands-on guide to Riak. The initial release of Riak Handbook focused on the driving forces behind Riak, including Amazon Dynamo, eventual consistency and CAP Theorem. Through a collection of examples and code, Mathias’ Riak Handbook explores the mechanics of Riak, such as storing and retrieving data, indexing, searching and querying data, and sheds a light on Riak in production. The updated handbook expands on previously covered key concepts and introduces new capabilities, including the following:

• An overview of Riak Control, a new Web-based operations management tool
• Full coverage on pre- and post-commit hooks, including JavaScript and Erlang examples
• An entirely new section on deploying Erlang code in a Riak cluster
• Additional details on secondary indexes
• Insight into load balancing Riak nodes
• An introduction to network node planning
• An introduction to Riak CS, includes Amazon S3 API compatibility

The updated Riak Handbook includes an entirely new section dedicated to popular use cases and is full of examples and code from real-time usage scenarios.

Mathias Meyer is an experienced software developer, consultant and coach from Berlin, Germany. He has worked with database technology leaders such as Sybase and Oracle. He entered into the world of NoSQL in 2008 and joined Basho Technologies in 2010.

I haven’t ordered a copy. The \$29.00 for 154 odd pages of content seems a bit steep to me.

### Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2)

Monday, June 4th, 2012

From the post:

Continuing on in my series on MongoDB and Python, this article will explore the new aggregation framework introduced in MongoDB 2.1. If you’re just getting started with MongoDB, you might want to read the previous articles in the series first:

And now that you’re all caught up, let’s jump right in….

Why a new framework?

If you’ve been following along with this article series, you’ve been introduced to MongoDB’s mapreduce command, which up until MongoDB 2.1 has been the go-to aggregation tool for MongoDB. (There’s also the group() command, but it’s really no more than a less-capable and un-shardable version of mapreduce(), so we’ll ignore it here.) So if you already have mapreduce() in your toolbox, why would you ever want something else?

Mapreduce is hard; let’s go shopping

The first motivation behind the new framework is that, while mapreduce() is a flexible and powerful abstraction for aggregation, it’s really overkill in many situations, as it requires you to re-frame your problem into a form that’s amenable to calculation using mapreduce(). For instance, when I want to calculate the mean value of a property in a series of documents, trying to break that down into appropriate map, reduce, and finalize steps imposes some extra cognitive overhead that we’d like to avoid. So the new aggregation framework is (IMO) simpler.

Other than the obvious utility of the new aggregation framework in MongoDB, there is another reason to mention this post: You should use only as much aggregation or in topic map terminology, “merging,” as you need.

It isn’t possible to create a system that will correctly aggregate/merge all possible content. Take that as a given.

In part because new semantics are emerging every day and there are too many previous semantics that are poorly documented or unknown.

What we can do is establish requirements for particular semantics for given tasks and document those to facilitate their possible re-use in the future.

### Aggregation in MongoDB (Part 1)

Monday, June 4th, 2012

Aggregation in MongoDB (Part 1) by Rick Copeland.

From the post:

In some previous posts on mongodb and python,
pymongo, and gridfs, I introduced the NoSQL database MongoDB how to use it from Python, and how to use it to store large (more than 16 MB) files in it. Here, I’ll be showing you a few of the features that the current (2.0) version of MongoDB includes for performing aggregation. In a future post, I’ll give you a peek into the new aggregation framework included in MongoDB version 2.1.

An index “aggregates” information about a subject (called an ‘entry’), where the information is traditionally found between the covers of a book.

MongoDB offers predefined as well as custom “aggregations,” where the information field can be larger than a single book.

Good introduction to aggregation in MongoDB, although you (and I) really should get around to reading the MondoDB documentation.

### CUBRID

Tuesday, May 29th, 2012

I stumbled upon CUBRID via its Important Facts to Know about CUBRID page, where the first entry reads:

Naming Conventions:

The name of this DBMS is CUBRID, written in capital letters, and not Cubrid. We would appreciate much if you followed this naming conventions. It should be fairly simple to remember, itsn’t it!?

Got my attention!

Not for a lack of projects with “attitude” on the Net but a project with “attitude” that expressed it cleverly. Not just offensively.

Features of CUBRID:

Here are the key features that make CUBRID the most optimized open source database management system:

First time I have seen CUBRID.

Does promise a release supporting sharding in June 2012.

The documentation posits extensions to the relational data model:

Extending the Relational Data Model

Collection

For the relational data model, it is not allowed that a single column has multiple values. In CUBRID, however, you can create a column with several values. For this purpose, collection data types are provided in CUBRID. The collection data type is mainly divided into SET, MULTISET and LIST; the types are distinguished by duplicated availability and order.

• SET : A collection type that does not allow the duplication of elements. Elements are stored without duplication after being sorted regardless of their order of entry.
• MULTISET : A collection type that allows the duplication of elements. The order of entry is not considered.
• LIST : A collection type that allows the duplication of elements. Unlike with SET and MULTISET, the order of entry is maintained.

Inheritance

Inheritance is a concept to reuse columns and methods of a parent table in those of child tables. CUBRID supports reusability through inheritance. By using inheritance provided by CUBRID, you can create a parent table with some common columns and then create child tables inherited from the parent table with some unique columns added. In this way, you can create a database model which can minimize the number of columns.

Composition

In a relational database, the reference relationship between tables is defined as a foreign key. If the foreign key consists of multiple columns or the size of the key is significantly large, the performance of join operations between tables will be degraded. However, CUBRID allows the direct use of the physical address (OID) where the records of the referred table are located, so you can define the reference relationship between tables without using join operations.

That is, in an object-oriented database, you can create a composition relation where one record has a reference value to another by using the column displayed in the referred table as a domain (type), instead of referring to the primary key column from the referred table.

Suggestions/comments on what to try first?

Monday, May 28th, 2012

From the post:

to avoid legal issues with some other Avocado lovers we have to change the name of our database. We want to stick to Avocados and selected a variety from Mexico/Guatemala called “Arango”.

So in short words: AvocadoDB will become ArangoDB in the next days, everything else remains the same.

We are making great progress towards version 1 (deadline is end of May). The simple query language is finished and documented and the more complex ArangoDB query language (AQL) is mostly done. So stay tuned. And: in case you know someone who is a node.js user and interesting in writing an API for ArangoDB: let me know!

We will all shop with more confidence knowing the “avocado” at Kroger isn’t a noSQL database masquerading as a piece of fruit.

Another topic map type issue: There are blogs, emails (public and private), all of which refer to “AvocadoDB.” Hard to pretend those aren’t “facts.” The question will be how to index “ArangoDB” so that we pick up prior traffic on “AvocadoDB?”

Such as design or technical choices made in “AvocadoDB” that are the answers to issues with “ArangoDB.”

### Berkeley DB at Yammer: Application Specific NoSQL Data Stores for Everyone

Saturday, May 26th, 2012

Berkeley DB at Yammer: Application Specific NoSQL Data Stores for Everyone

Alex Popescu calls attention to Ryan Kennedy of Yammer presenting on transitioning from PostgreSQL to Berkeley DB.

Is that the right direction?

Watch the presentation and see what you think.

### Solr 4 preview: SolrCloud, NoSQL, and more

Monday, May 21st, 2012

Solr 4 preview: SolrCloud, NoSQL, and more

From the post:

The first alpha release of Solr 4 is quickly approaching, bringing powerful new features to enhance existing Solr powered applications, as well as enabling new applications by further blurring the lines between full-text search and NoSQL.

The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. Distributed indexing with no single points of failure has been designed from the ground up for near real-time (NRT), and NoSQL features such as realtime-get, optimistic locking, and durable updates.

We’ve incorporated Apache ZooKeeper, the rock-solid distributed coordination project that is immune to issues like split-brain syndrome that tend to plague other hand-rolled solutions. ZooKeeper holds the Solr configuration, and contains the cluster meta-data such as hosts, collections, shards, and replicas, which are core to providing an elastic search capability.

When a new node is brought up, it will automatically be assigned a role such as becoming an additional replica for a shard. A bounced node can do a quick “peer sync” by exchanging updates with its peers in order to bring itself back up to date. New nodes, or those that have been down too long, recover by replicating the whole index of a peer while concurrently buffering any new updates.

Run, don’t walk, to learn about the new features for Solr 4.

You won’t be disappointed.

Interested to see the “….blurriing [of] the lines between full-text search and NoSQL.”

Would be even more interested to see the “…blurring of indexing and data/data formats.”

That is to say that data, along with its format, is always indexed in digital media.

So why can’t I see the data as a table, as a graph, as a …., depending upon my requirements?

No ETL, JVD – Just View Differently.

Suspect I will have to wait a while for that, but in the mean time, enjoy Solr 4 alpha.

### Big Game Hunting in the Database Jungle

Thursday, May 17th, 2012

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

Alexander Thomson and Daniel Abadi write:

In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential—some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

Preview of a paper that Alex is presenting at SIGMOD next week. Introducing “Calvin,” a new approach to database processing.

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access).

What I find exciting about this report (and the paper) is the re-thinking of current assumptions concerning data processing. May be successful or may not be. But the exciting part is the attempt to transcend decades of acceptance of the maxims of our forefathers.

BTW, Calvin is reported to support 500,000 transactions a second.

Big game hunting anyone?*

* I don’t mean that as an expression of preference for or against Oracle.

I suspect Calvin will be a wake up call to R&D at Oracle to re-double their own efforts at ground breaking innovation.

Breakthroughs in matching up multi-dimensional indexes would be attractive to users who need to match up disparate data sources.

Speed is great but a useful purpose attracts customers.

### Progressive NoSQL Tutorials

Wednesday, May 16th, 2012

Have you ever gotten an advertising email with clean links in it? I mean a link without all the marketing crap appended to the end. The stuff you have to clean off before using it in a post or sending it to a friend?

Got my first one today. From Skills Matter on the free videos for their Progressive NoSQL Tutorials that just concluded.

High quality presentations, videos freely available after presentation, friendly links in email, just a few of the reasons to support Skills Matter.

The tutorials:

### Announcing Apache Hive 0.9.0

Saturday, May 5th, 2012

Announcing Apache Hive 0.9.0 by Carl Steinbach.

From the post:

This past Monday marked the official release of Apache Hive 0.9.0. Users interested in taking this release of Hive for a spin can download a copy from the Apache archive site. The following post is a quick summary of new features and improvements users can expect to find in this update of the popular data warehousing system for Hadoop.

The 0.9.0 release continues the trend of extending Hive’s SQL support. Hive now understands the BETWEEN operator and the NULL-safe equality operator, plus several new user defined functions (UDF) have now been added. New UDFs include printf(), sort_array(), and java_method(). Also, the concat_ws() function has been modified to support input parameters consisting of arrays of strings.

This Hive release also includes several significant improvements to the query compiler and execution engine. HIVE-2642 improved Hive’s ability to optimize UNION queries, HIVE-2881 made the the map-side JOIN algorithm more efficient, and Hive’s ability to generate optimized execution plans for queries that contain multiple GROUP BY clauses was significantly improved in HIVE-2621.

The database world just keeps getting better!

### Why Every NoSQL Deployment Should Be Paired with Hadoop (webinar)

Monday, April 30th, 2012

Why Every NoSQL Deployment Should Be Paired with Hadoop (webinar)

May 9, 2012 at 10am Pacific

From the webinar registration page:

In this webinar you will hear from Dr. Amr Awadallah, Co-Founder and CTO of Cloudera and James Phillips, Co-Founder and Senior VP of Products at Couchbase.

Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, versus relational database technology which favors centralized computing. But the “problems” these technologies address are quite different. Hadoop, the Big Data poster child, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases are transactional systems – delivering high-performance, cost-effective data management for modern real-time web and mobile applications; this is the Big User problem. Of course, if you have a lot of users, you are probably going to generate a lot of data. IDC estimates that more than 1.8 trillion gigabytes of information was created in 2011 and that this number will double every two years. The proliferation of user-generated data from interactive web and mobile applications are key contributors to this growth. In this webinar, we will explore why every NoSQL deployment should be paired with a Big Data analytics solution.

In this session you will learn:

• Why NoSQL and Big Data are similar, but different
• The categories of NoSQL systems, and the types of applications for which they are best suited
• How Couchbase and Cloudera’s Distribution Including Apache Hadoop can be used together to build better applications
• Explore real-world use cases where NoSQL and Hadoop technologies work in concert

Have you ever wanted to suggest a survey to Gartner or the technology desk at the Wall Street Journal?

Asking c-suite types at Fortune 500 firms the following questions among others:

• Is there a difference between NoSQL and Big Data?
• What percentage of software projects failed at your company last year?

Could go a long way to explaining the persistent and high failure rate of software projects.

Catch the webinar. Always the chance you will learn how to communicate with c-suite types. Maybe.

### Git: the NoSQL Database

Thursday, April 26th, 2012

Git: the NoSQL Database

Brandon Keepers has a nice slide deck on using Git as a NoSQL database.

If you have one of his use cases, consider Git.

I recommend the slidedeck more for his analysis of what is or is not possible with Git.

All too often the shortcomings of a database or ten year old code is seen as fundamental rather than accidental.

Accidents, like mistakes, can be corrected.

### Faster Apache CouchDB

Wednesday, April 25th, 2012

Kay Ewbak reports:

Apache has announced the release of CouchDB 1.2.0. It brings lots of improvements, some of which mean apps written for older versions of CouchDB will no longer work.

According to the blog post from its developers, the changes start with improved performance and security. The performance is better because the developers have added a native JSON parser where the performance critical portions are implemented in C, so latency and throughput for all database and view operations is improved. JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and for machines to parse and generate. The CouchDB team is using the yajl library for its JSON parser.

The new version of CouchDB also has optional file compression for database and view index files, with all storage operations being passed through Google’s snappy compressor. This means less data has to be transferred, so access is faster.

Alongside these headline changes for performance, the team has also made other changes that take the Erlang runtime system into account to improve concurrency when writing data to databases and view index files.

Grab a copy here, or see Kay’s post for more details.