Archive for the ‘NoSQL’ Category

MongoDB: The Definitive Guide 2nd Edition is Out!

Thursday, May 23rd, 2013

MongoDB: The Definitive Guide 2nd Edition is Out! by Kristina Chodorow.

From the webpage:

The second edition of MongoDB: The Definitive Guide is now available from O’Reilly! It covers both developing with and administering MongoDB. The book is language-agnostic: almost all of the examples are in JavaScript.

Looking forward to enjoying the second edition as much as the first!

Although, I am not really sure that always using JavaScript means you are “language-agnostic.” ;-)

Apache Drill

Sunday, May 19th, 2013

Michael Hausenblas at NoSQL Matters 2013 does a great lecture on Apache Drill.

Slides.

Google’s Dremel Paper

Projects “beta” for Apache Drill by second quarter and GA by end of year.

Apache Drill User.

From the rationale:

There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel.

How do you handle ad hoc exploration of data sets as part of planning a topic map?

Being able to “test” merging against data prior to implementation sounds like a good idea.

Warp: Multi-Key Transactions for Key-Value Stores

Saturday, May 18th, 2013

Warp: Multi-Key Transactions for Key-Value Stores by Robert Escriva, Bernard Wong and Emin Gün Sirer†.

Abstract:

Implementing ACID transactions has been a longstanding challenge for NoSQL systems. Because these systems are based on a sharded architecture, transactions necessarily require coordination across multiple servers. Past work in this space has relied either on heavyweight protocols such as Paxos or clock synchronization for this coordination.

This paper presents a novel protocol for coordinating distributed transactions with ACID semantics on top of a sharded data store. Called linear transactions, this protocol achieves scalability by distributing the coordination task to only those servers that hold relevant data for each transaction. It achieves high performance by serializing only those transactions whose concurrent execution could potentially yield a violation of ACID semantics. Finally, it naturally integrates chain-replication and can thus tolerate faults of both clients and servers. We have fully implemented linear transactions in a commercially available data store. Experiments show that the throughput of this system achieves 1-9× more throughput than MongoDB, Cassandra and HyperDex on the Yahoo! Cloud Serving Benchmark, even though none of the latter systems provide transactional guarantees.

Warp looks wicked cool!

Of particular interest is the non-ordering of transactions that have no impact on other transactions. That alone would be interesting for a topic map merging situation.

For more details, see the Warp page, or

Download Warp

Warp Tutorial

Warp Performance Benchmarks

I first saw this at High Scalability.

Solr 4, the NoSQL Search Server [Webinar]

Friday, May 17th, 2013

Solr 4, the NoSQL Search Server by Yonik Seeley

Date: Thursday, May 30, 2013
Time: 10:00am Pacific Time

From the description:

The long awaited Solr 4 release brings a large amount of new functionality that blurs the line between search engines and NoSQL databases. Now you can have your cake and search it too with Atomic updates, Versioning and Optimistic Concurrency, Durability, and Real-time Get!

Learn about new Solr NoSQL features and implementation details of how the distributed indexing of Solr Cloud was designed from the ground up to accommodate them.
Featured Presenter:

Yonik Seeley – Research creator of Apache Solr and the Chief Open Source Architect and Co-Founder at LucidWorks. Mr. Seeley is an Apache Lucene/Solr PMC member and committer and an expert in distributed search systems architecture and performance. His work experience includes CNET Networks, BEA and Telcordia. He earned his M.S. in Computer Science from Stanford University.

This could be a real treat!

Notes on the webinar to follow.

Strange Loop 2013

Saturday, April 27th, 2013

Strange Loop 2013

Dates:

  • Call for presentation opens: Apr 15th, 2013
  • Call for presentation ends: May 9, 2013
  • Speakers notified by: May 17, 2013
  • Registration opens: May 20, 2013
  • Conference dates: Sept 18-20th, 2013

From the webpage:

Below is some guidance on the kinds of topics we are seeking and have historically accepted.

  • Frequently accepted or desired topics: functional programming, logic programming, dynamic/scripting languages, new or emerging languages, data structures, concurrency, database internals, NoSQL databases, key/value stores, big data, distributed computing, queues, asynchronous or dataflow concurrency, STM, web frameworks, web architecture, performance, virtual machines, mobile frameworks, native apps, security, biologically inspired computing, hardware/software interaction, historical topics.
  • Sometimes accepted (depends on topic): Java, C#, testing frameworks, monads
  • Rarely accepted (nothing wrong with these, but other confs cover them well): Agile, JavaFX, J2EE, Spring, PHP, ASP, Perl, design, layout, entrepreneurship and startups, game programming

It isn’t clear why Strange Loop claims to have “archives:”

2009201020112012

As far as I can tell, these are listings with bios of prior presentations, but no substantive content.

Am I missing something?

Aerospike

Friday, April 19th, 2013

Aerospike

From the architecture overview:

Aerospike is a fast Key Value Store or Distributed Hash Table architected to be a flexible NoSQL platform for today’s high scale Apps. Designed to meet the reliability or ACID requirements of traditional databases, there is no single point of failure (SPOF) and data is never lost. Aerospike can be used as an in-memory database and is uniquely optimized to take advantage of the dramatic cost benefits of flash storage. Written in C, Aerospike runs on Linux.

Based on our own experiences developing mission-critical applications with high scale databases and our interactions with customers, we’ve developed a general philosophy of operational efficiency that guides product development. Three principles drive Aerospike architecture: NoSQL flexibility, traditional database reliability, and operational efficiency.

Technical details first published in Proceeding of the VLDB (Very Large Databases), Citrusleaf: A Real-Time NoSQL DB which Preserves ACID by V. Srinivasan and Brian Bulkowski.

You can guess why they changed the name. ;-)

There is a free community edition, along with an SDK and documentation.

Relies on RAM and SDDs.

Timo Elliott was speculating about entirely RAM-based computing in: In-Memory Computing.

Imagine losing all the special coding tricks to get performance despite disk storage.

Simpler code and fewer operations should result in higher speed.

How to Compare NoSQL Databases

Friday, April 19th, 2013

How to Compare NoSQL Databases by Ben Engber. (video)

From the description:

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB, Cassandra, HBase, others) and to do so in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.

We also discuss how to extend benchmarking tests to simulate different kinds of failure scenarios to help evaluate the maintainablility and recoverability of different systems. This requires carefully constructed tests and significant knowledge of the underlying databases — the talk will help evaluators overcome the common pitfalls and time sinks involved in trying to measure this.

Lastly we discuss the YCSB benchmarking tool, its significant limitations, and the significant extensions and supplementary tools Thumbtack has created to provide distributed load generation and failure simulation.

Ben makes a very good case for understanding the details of your use case versus the characteristics of particular NoSQL solutions.

Where you will find “better” performance depends on non-obvious details.

Watch the use of terms like “consistency” in this presentation.

The paper Ben refers to: Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs.

Forty-three pages of analysis and charts.

Slow but interesting reading.

If you are into the details of performance and NoSQL databases.

Mumps: The Proto-Database…

Wednesday, March 27th, 2013

Mumps: The Proto-Database (Or How To Build Your Own NoSQL Database) by Rob Tweed.

From the post:

I think that one of the problems with Mumps as a database technology, and something that many people don’t like about the Mumps database is that it is a very basic and low-level engine, without any of the frills and value-added things that people expect from a database these days. A Mumps database doesn’t provide built-in indexing, for example, nor does it have any high-level query language (eg SQL, Map/Reduce) built in, though there are add-on products that can provide such capabilities.

On the other hand, a raw Mumps database, such as GT.M, is actually an interesting beast, as it turns out to provide everything you need to design and create your own NoSQL (or pretty much any other kind of) database. As I’ve discussed and mentioned a number of times in these articles, it’s a Universal NoSQL engine.

Why, you might ask, would you want to create your own NoSQL database? I’d possibly agree, but there hardly seems to be a week go by without someone doing exactly that and launching yet another NoSQL database. So, there’s clearly a perceived need or desire to do so.

I first saw this at Mumps: The Proto-Database by Alex Popescu.

Alex asks:

The question I’d ask myself is not “why would I build another NoSQL database”, but rather “why none of the popular ones are built using Mumps?”.

I suspect the answer is the same one for why are popular NoSQL databases, such as MongoDB, are re-inventing text indexing? (see MongoDB 2.4 Release)

Database Landscape Map – February 2013

Wednesday, March 27th, 2013

Database Landscape Map – February 2013 by 451 Research.

Database map

A truly awesome map of available databases.

Originated from Neither fish nor fowl: the rise of multi-model databases by Matthew Aslett.

Matthew writes:

One of the most complicated aspects of putting together our database landscape map was dealing with the growing number of (particularly NoSQL) databases that refuse to be pigeon-holed in any of the primary databases categories.

I have begun to refer to these as “multi-model databases” in recognition of the fact that they are able to take on the characteristics of multiple databases. In truth though there are probably two different groups of products that could be considered “multi-model”:

I think I understand the grouping from the key to the map but the ordering within groups, if meaningful, escapes me.

I am sure you will recognize most of the names but equally sure there will be some you can’t quite describe.

Enjoy!

MongoDB 2.4 Release

Tuesday, March 19th, 2013

MongoDB 2.4 Release

From the webpage:

Developer Productivity

  • Capped Arrays simplify development by making it easy to incorporate fixed, sorted lists for features like leaderboards and logging.
  • Geospatial Enhancements enable new use cases with support for polygon intersections and analytics based on geospatial data.
  • Text Search provides a simplified, integrated approach to incorporating search functionality into apps (Note: this feature is currently in beta release).

Operations

  • Hash-Based Sharding simplifies deployment of large MongoDB systems.
  • Working Set Analyzer makes capacity planning easier for ops teams.
  • Improved Replication increases resiliency and reduces administration.
  • Mongo Client creates an intuitive, consistent feature set across all drivers.

Performance

  • Faster Counts and Aggregation Framework Refinements make it easier to leverage real-time, in-place analytics.
  • V8 JavaScript Engine offers better concurrency and faster performance for some operations, including MapReduce jobs.

Monitoring

  • On-Prem Monitoring provides comprehensive monitoring, visualization and alerting on more than 100 operational metrics of a MongoDB system in real time, based on the same application that powers 10gen’s popular MongoDB Monitoring Service (MMS). On-Prem Monitoring is only available with MongoDB Enterprise.



Security
….

  • Kerberos Authentication enables enterprise and government customers to integrate MongoDB into existing enterprise security systems. Kerberos support is only available in MongoDB Enterprise.
  • Role-Based Privileges allow organizations to assign more granular security policies for server, database and cluster administration.

You can read more about the improvements to MongoDB 2.4 in the Release Notes. Also, MongoDB 2.4 is available for download on MongoDB.org.

Lots to look at in MongoDB 2.4!

But I am curious about the beta text search feature.

MongoDB Text Search: Experimental Feature in MongoDB 2.4 says:

Text search (SERVER-380) is one of the most requested features for MongoDB 10gen is working on an experimental text-search feature, to be released in v2.4, and we’re already seeing some talk in the community about the native implementation within the server. We view this as an important step towards fulfilling a community need.

MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need. (emphasis added)

So, why isn’t MongoDB incorporating Solr/Lucene instead of a home grown text search feature?

Seems like users could leverage their Solr/Lucene skills with their MongoDB installations.

Yes?

Why FoundationDB Might Be All Its Cracked Up To Be

Friday, March 8th, 2013

Why FoundationDB Might Be All Its Cracked Up To Be by Doug Turnbull.

From the post:

When I first heard about FoundationDB, I couldn’t imagine how it could be anything but vaporware. Seemed like Unicorns crapping happy rainbows to solve all your problems. As I’m learning more about it though, I realize it could actually be something ground breaking.

NoSQL: Lets Review…

So, I need to step back and explain one reason NoSQL databases have been revolutionary. In the days of yore, we used to normalize all our data across multiple tables on a single database living on a single machine. Unfortunately, Moore’s law eventually crapped out and maybe more importantly hard drive space stopped increasing massively. Our data and demands on it only kept growing. We needed to start trying to distribute our database across multiple machines.

Turns out, its hard to maintain transactionality in a distributed, heavily normalized SQL database. As such, a lot of NoSQL systems have emerged with simpler features, many promoting a model based around some kind of single row/document/value that can be looked up/inserted with a key. Transactionality for these systems is limited a single key value entry (“row” in Cassandra/HBase or “document” in (Mongo/Couch) — we’ll just call them rows here). Rows are easily stored in a single node, although we can replicate this row to multiple nodes. Despite being replicated, it turns out transactionally working with single rows in distributed NoSQL is easier than guaranteeing transactionality of an SQL query visiting potentially many SQL tables in a distributed system.

There are deep design ramifications/limitations to the transactional nature of rows. First you always try to cram a lot of data related to the row’s key into a single row, ending up with massive rows of hierarchical or flat data that all relates to the row key. This lets you cover as much data as possible under the row-based transactionality guarantee. Second, as you only have a single key to use from the system, you must chose very wisely what your key will be. You may need to think hard how your data will be looked up through its whole life, it can be hard to go back. Additionally, if you need to lookup on a secondary value, you better hope that your database is friendly enough to have a secondary key feature or otherwise you’ll need to maintain secondary row for storing the relationship. Then you have the problem of working across two rows, which doesn’t fit in the transactionality guarantee. Third, you might lose the ability to perform a join across multiple rows. In most NoSQL data stores, joining is discouraged and denormalization into large rows is the encouraged best practice.

FoundationDB Is Different

FoundationDB is a distributed, sorted key-value store with support for arbitrary transactions across multiple key-values — multiple “rows” — in the database.

As Doug points out, there is must left to be known.

Still, exciting to have something new to investigate.

FoundationDB

Monday, March 4th, 2013

FoundationDB

FoundationDB Beta 1 is now available!

It will take a while to sort out all of its features, etc.

I should mention that it is refreshing that the documentation contains Known Limitations.

All software has limitations but few every acknowledge them up front.

You have to encounter one before one of the technical folks says: “…yes, we have been meaning to work on that.”

I would rather know up front what the limitation are.

Whether FoundationDB meets your requirements or not, it is good to see that kind of transparency.

NoSQL is Great, But You Still Need Indexes [MongoDB for example]

Wednesday, February 20th, 2013

NoSQL is Great, But You Still Need Indexes by Martin Farach-Colton.

From the post:

I’ve said it before, and, as is the nature of these things, I’ll almost certainly say it again: your database performance is only as good as your indexes.

That’s the grand thesis, so what does that mean? In any DB system — SQL, NoSQL, NewSQL, PostSQL, … — data gets ingested and organized. And the system answers queries. The pain point for most users is around the speed to answer queries. And the query speed (both latency and throughput, to be exact) depend on how the data is organized. In short: Good Indexes, Fast Queries; Poor Indexes, Slow Queries.

But building indexes is hard work, or at least it has been for the last several decades, because almost all indexing is done with B-trees. That’s true of commercial databases, of MySQL, and of most NoSQL solutions that do indexing. (The ones that don’t do indexing solve a very different problem and probably shouldn’t be confused with databases.)

It’s not true of TokuDB. We build Fractal Tree Indexes, which are much easier to maintain but can still answer queries quickly. So with TokuDB, it’s Fast Indexes, More Indexes, Fast Queries. TokuDB is usually thought of as a storage engine for MySQL and MariaDB. But it’s really a B-tree substitute, so we’re always on the lookout for systems where we can improving the indexing.

Enter MongoDB. MongoDB is beloved because it makes deployment fast. But when you peel away the layers, you get down to a B-tree, with all the performance headaches and workarounds that they necessitate.

That’s the theory, anyway. So we did some testing. We ripped out the part of MongoDB that takes care of secondary indices and plugged in TokuDB. We’ve posted the blogs before, but here they are again, the greatest hits of TokuDB+MongoDB: we show a 10x insertion performance, a 268x query performance, and a 532x (or 53,200% if you prefer) multikey index insertion performance. We also discussed covered indexes vs. clustered Fractal Tree Indexes.

Did somebody declare February 20th to be performance release day?

Did I miss that memo? ;-)

Like every geek, I like faster. But, here’s my question:

Have there been any studies on the impact of faster systems on searching and decision making by users?

My assumption is the faster I get a non-responsive result from a search, the sooner I can improve it.

But that’s an assumption on my part.

Is that really true?

Hypertable Has Reached A Major Milestone

Thursday, February 14th, 2013

Hypertable Has Reached A Major Milestone by Doug Judd.

From the post:

RangeServer Failover

With the release of Hypertable version 0.9.7.0 comes support for automatic RangeServer failover. Hypertable will now detect when a RangeServer has failed, logically remove it from the system, and automatically re-assign the ranges that it was managing to other RangeServers. This represents a major milestone for Hypertable and allows for very large scale deployments. We have been actively working on this feature, full-time, for 1 1/2 years. To give you an idea of the magnitude of the change, here are the commit statistics:

  • 441 changed files
  • 17,522 line additions
  • 6,384 line deletions

The reason that this feature has been a long time in the making is because we placed a very high standard of quality for this feature so that under no circumstance, a RangeServer failure would lead to consistency problems or data loss. We’re confident that we’ve achieved 100% correctness under every conceivable circumstance. The two primary goals for the feature, robustness and applicaiton transparancy, are described below.

That is a major milestone!

High-end data processing is becoming as crowded with viable options as low-end data processing. And the “low-end” of data processing keeps getting bigger.

MarkLogic Announces Free Developer License for Enterprise [With Odd Condition]

Wednesday, February 13th, 2013

MarkLogic Announces Free Developer License for Enterprise

From the post:

MarkLogic Corporation today announced the availability of a free Developer License for MarkLogic Enterprise Edition.

The Developer License provides access to the features available in MarkLogic Enterprise Edition, including integrated search, government-grade security, clustering, replication, failover, alerting, geospatial indexing, conversion, and a suite of application development tools. MarkLogic also announced the Mongo2MarkLogic converter, a Java-based tool for importing data from MongoDB into MarkLogic providing developers immediate access to features needed to build out enterprise-ready big data solutions.

“By providing a free Developer License we enable developers to quickly deliver reliable, scalable and secure information and analytic applications that are production-ready,” said Gary Bloom, CEO and President of MarkLogic. “Many of our customers first experimented with other free NoSQL products, but turned to MarkLogic when they recognized the need for search, security, support for ACID transactions and other features necessary for enterprise environments. Our goal is to eliminate the cost barrier for developers and give them access to the best enterprise NoSQL platform from the start.”

The Developer License for MarkLogic Enterprise Edition includes tools for faster application development, business intelligence (BI) tool integration, analytic functions and visualization tools, and the ability to create user-defined functions for fast and flexible analysis of huge volumes of data.

You would think that story would merit at least one link to the free developer program.

For your convenience: Developer License for Enterprise Edition. BTW, MarkLogic homepage.

That wasn’t hard. Two links and you have direct access to the topic of the story and the company.

One odd licensing condition:

Q. Can I publish my work done with MarkLogic Server?

A. We encourage you to share your work publicly, but note that you can not disclose, without MarkLogic prior written consent, any performance or capacity statistics or the results of any benchmark test performed on MarkLogic Server.

That sounds just a tad defensive doesn’t it?

I haven’t looked at MarkLogic for a couple of iterations but earlier versions had no need to fear statistics or benchmark tests.

Results vary depending on how testing is done but anyone authorized to recommend or sign acquisition orders should know that.

If they don’t, your organization has more serious problems than needing a MarkLogic server.

Oracle’s MySQL 5.6 released

Wednesday, February 6th, 2013

Oracle’s MySQL 5.6 released

From the post:

Just over two years after the release of MySQL 5.5, the developers at Oracle have released a GA (General Availability) version of Oracle MySQL 5.6, labelled MySQL 5.6.10. In MySQL 5.5, the developers replaced the old MyISAM backend and used the transactional InnoDB as the default for database tables. With 5.6, the retrofitting of full-text search capabilities has enabled InnoDB to now take on the position of default storage engine for all purposes.

Accelerating the performance of sub-queries was also a focus of development; they are now run using a process of semi-joins and materialise much faster; this means it should not be necessary to replace subqueries with joins. Many operations that change the data structures, such as ALTER TABLE, are now performed online, which avoids long downtimes. EXPLAIN also gives information about the execution plans of UPDATE, DELETE and INSERT commands. Other optimisations of queries include changes which can eliminate table scans where the query has a small LIMIT value.

MySQL’s row-oriented replication now supports “row image control” which only logs the columns needed to identify and make changes on each row rather than all the columns in the changing row. This could be particularly expensive if the row contained BLOBs, so this change not only saves disk space and other resources but it can also increase performance. “Index Condition Pushdown” is a new optimisation which, when resolving a query, attempts to use indexed fields in the query first, before applying the rest of the WHERE condition.

MySQL 5.6 also introduces a “NoSQL interface” which uses the memcached API to offer applications direct access to the InnoDB storage engine while maintaining compatibility with the relational database engine. That underlying InnoDB engine has also been enhanced with persistent optimisation statistics, multithreaded purging and more system tables and monitoring data available.

Download MySQL 5.6.

I mentioned Oracle earlier today (When Oracle bought MySQL [Humor]) so it’s only fair that I point out their most recent release of MySQL.

SQL, NoSQL =? CoSQL? Category Theory to the Rescue

Wednesday, January 30th, 2013

A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman.

I missed this when it appeared in March of 2011.

From the conclusion:

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL’s foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

The ACM Digital Library reports only 3 citations, which is unfortunate for such an interesting proposal.

I have heard about key/value pairs somewhere else. I will have to think about that and get back to you. (Hint for the uninitiated, try the Topic Maps Reference Model (TMRM). A new draft of the TMRM is due to appear in a week or so.)

11 Interesting Releases From the First Weeks of January

Thursday, January 24th, 2013

11 Interesting Releases From the First Weeks of January by Alex Popescu.

Alex has collected links for eleven (11) interesting NoSQL releases in January 2013!

Visit Alex’s post. You won’t be disappointed.

Static and Dynamic Semantics of NoSQL Languages […Combining Operators…]

Tuesday, December 25th, 2012

Static and Dynamic Semantics of NoSQL Languages (PDF) by Véronique Benzaken, Giuseppe Castagna, Kim Nguy˜ên and Jérôme Siméon.

Abstract:

We present a calculus for processing semistructured data that spans differences of application area among several novel query languages, broadly categorized as “NoSQL”. This calculus lets users define their own operators, capturing a wider range of data processing capabilities, whilst providing a typing precision so far typical only of primitive hard-coded operators. The type inference algorithm is based on semantic type checking, resulting in type information that is both precise, and flexible enough to handle structured and semistructured data. We illustrate the use of this calculus by encoding a large fragment of Jaql, including operations and iterators over JSON, embedded SQL expressions, and co-grouping, and show how the encoding directly yields a typing discipline for Jaql as it is, namely without the addition of any type definition or type annotation in the code.

From the conclusion:

On the structural side, the claim is that combining recursive records and pairs by unions, intersections, and negations suffices to capture all possible structuring of data, covering a palette ranging from comprehensions, to heterogeneous lists mixing typed and untyped data, through regular expressions types and XML schemas. Therefore, our calculus not only provides a simple way to give a formal semantics to, reciprocally compare, and combine operators of different NoSQL languages, but also offers a means to equip these languages, in they current definition (ie, without any type definition or annotation), with precise type inference.

With lots of work in between the abstract and conclusion.

The capacity to combine operators of different NoSQL languages sounds relevant to a topic maps query language.

Yes?

I first saw this in a tweet by Computer Science.

ArangoDB

Wednesday, December 12th, 2012

ArangoDB

From the webpage:

A universal open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript/Ruby extensions.

Design considerations:

In a nutshell:

  • Schema-free schemas with shapes: Inherent structures at hand are automatically recognized and subsequently optimized.
  • Querying: ArangoDB is able to accomplish complex operations on the provided data (query-by-example and query-language).
  • Application Server: ArangoDB is able to act as application server on Javascript-devised routines.
  • Mostly memory/durability: ArangoDB is memory-based including frequent file system synchronizing.
  • AppendOnly/MVCC: Updates generate new versions of a document; automatic garbage collection.
  • ArangoDB is multi-threaded.
  • No indices on file: Only raw data is written on hard disk.
  • ArangoDB supports single nodes and small, homogenous clusters with zero administration.

I have mentioned this before but ran across it again at: An experiment with Vagrant and Neo4J by Patrick Mulder.

NuoDB [Everything in NuoDB is an Atom]

Saturday, November 24th, 2012

NuoDB

I last wrote about NuoDB in February of 2012, it was still in private beta release.

You can now download a free community edition with a limit of two nodes for the usual platforms.

The “under the hood” page reads (in part):

Everything in NuoDB is an Atom

Under the hood, NuoDB is an asynchronous, decentralized, peer-to-peer database. The NuoDB system is also object-oriented. Objects in NuoDB know how to perform various actions that create specific behaviors in the overall database. And at the heart of every object in NuoDB is the Atom. An Atom in NuoDB is like a single bird in a flock.

Atoms are self-describing objects (data and metadata) that together comprise the database. Everything in the NuoDB database is an Atom, including the schema, the indexes, and even the underlying data. For example, each table is an Atom that describes the metadata for the table and can reference other Atoms; such as Atoms that describe ranges of records in the table and their versions.

Atoms are Powerful

Atoms are intelligent, powerful, self-describing objects that together form the NuoDB database. Atoms know how to perform many actions, like these:

  • Atoms know how to make copies of themselves.
  • Atoms keep all copies of themselves up to date.
  • Atoms can broadcast messages. Atoms listen for events and changes from other Atoms.
  • Atoms can request data from other Atoms.
  • Atoms can serialize themselves to persistent storage.
  • Atoms can retrieve data from storage.

The Atoms are the Database

Everything in the database is an Atom, and the Atoms are the database. The Atoms work in concert to form both the Transaction (or Compute) Tier, and the Storage Tier.

A NuoDB Transaction Engine is a process that executes the SQL layer and is comprised completely of Atoms. The Transaction Engine operates on Atoms, listens for changes, and communicates changes with other Transaction Engines in the database.

A NuoDB Storage Manager is simply a special kind of Transaction Engine that allows Atoms to serialize themselves to permanent storage (such as a local disk or Amazon S3, for example).

A NuoDB database can be as simple as a single Transaction Engine and a single Storage Manager, or can be as complex as tens of Transaction Engines and Storage Managers distributed across dozens of computer hosts.

Some wag in a report that reminded me to look at NuoDB again was whining about how NuoDB would perform in query intensive environments? I guess downloading a free copy to see was too much effort.

Of course, you would have to define “query intensive” environment and that would be no easy task. Lots of users with simple queries? (Define “lots” and “simple.”)

Just a suspicion as I wait for my download URL to arrive, “query” in a atom based system may not have the same internal processes as a traditional relational database. Or perhaps not entirely.

For example, what if the notion of “retrieval” from a location in memory is no longer operative? That is a query is composed of atoms that begin messaging as they are composed and so receiving information before the user reaches the end of a query string?

And more than that, query atoms that occur frequently could be persisted so the creation cost is not incurred in subsequent queries.

Hard to say without knowing more about it but it definitely should be on your short list of products to watch.

10 things never to do with a relational database

Saturday, November 17th, 2012

10 things never to do with a relational database (The data explosion demands new solutions, yet the hoary old RDBMS still rules. Here’s where you really shouldn’t use it) by Andrew C. Oliver.

From the post:

I am a NoSQLer and a big data guy. That’s a nice coincidence, because as you may have heard, data growth is out of control.

Old habits die hard. The relational DBMS still reigns supreme. But even if you’re a dyed-in-the-wool, Oracle-loving, PL/SQL-slinging glutton for the medieval RAC, think twice, think many times, before using your beloved technology for the following tasks.

[ If you aren’t going to use an RDBMS, which freaking database should you use? | See InfoWorld’s comparative review of NoSQL databases. | Keep up with the latest developer news with InfoWorld’s Developer World newsletter. ]

If you guessed this post is from InfoWorld and that it’s rather ranty, you are right on both counts.

Andrew’s 10 things:

  1. Search
  2. Recommendations
  3. High-frequency trading
  4. Product cataloguing
  5. Users/groups and ACLs
  6. Log analysis
  7. Media repository
  8. Email
  9. Classified ads
  10. Time-series/forecasting

Andrew ducks and covers in his conclusion with:

Can you use the RDBMS for some or many of these? Sure — I have and people continue to. However, is it a good fit? Not really. I expect the cranky old men to disagree, but tradition alone is not a good reason to stick with the old way of doing things.

If you disagree with his assessment, you are by definition a “cranky old man,” and no one wants to be seen as a cranky old man.

Being a “cranky old man,” the label doesn’t sting so I feel free to disagree. ;-)

Andrew is right that tradition alone isn’t “a good reason to stick with the old way of doing things.”

On the other hand, because something is new or venture capitalists have parted with cash, isn’t a reason to find a new way of doing things.

Your requirements aren’t only technical questions but questions of IT competence to deploy a new solution, training of staff to use a new solution, costs of retraining and construction, and others.

Ignoring the non-technical side of requirements is a step toward acquiring a white elephant to sleep in the middle of your office, interfering with day to day operations.

CodernityDB [Origin of the 3 V’s of Big Data]

Sunday, November 11th, 2012

CodernityDB

From the webpage:

CodernityDB pure python, NoSQL, fast database¶

CodernityDB is opensource, pure Python (no 3rd party dependency), fast (really fast check Speed if you don’t believe in words), multiplatform, schema-less, NoSQL database. It has optional support for HTTP server version (CodernityDB-HTTP), and also Python client library (CodernityDB-PyClient) that aims to be 100% compatible with embedded version.

“The hills are alive, with the sound of NoSQL databases….”

Sorry, I usually only sing in the shower. ;-)

I haven’t done a statistical survey (that may be in the offing) but it does seem like the stream of NoSQL databases continues unabated.

What I don’t know and you might: Has there always be a rumble of alternative databases and looking makes them appear larger/more numerous? As in a side view mirror.

If we can discover what makes NoSQL databases popular now, that may apply to semantic integration.

I don’t buy the 3 V’s, Velocity, Volume, Variety, as an explanation for NoSQL database adoption.

Doug Laney, now of Gartner, Inc., then of Meta Group coined that phrase in “3D Data Management: Controlling Data Volume, Velocity and Variety“, Date: 6 February 2001:*



E-Commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity, and variety.


I don’t recall the current level of interest in NoSQL databases when faced with the same problems in 2001.

So what else has changed? (I don’t know or I would say.)

Comments/suggestions/pointers?


I was alerted to the origin of the three V’s by a reference to Doug Laney by Stephen Swoyer in Big Data — Why the 3Vs Just Don’t Make Sense and then followed a reference in Big Data (Wikipedia) to find the link I reproduce above.

G-Store: [Multi key Access]

Friday, November 2nd, 2012

G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud by Sudipto Das, Divyakant Agrawal and Amr El Abbadi.

Abstract:

Cloud computing has emerged as a preferred platform for deploying scalable web-applications. With the growing scale of these applications and the data associated with them, scalable data management systems form a crucial part of the cloud infrastructure. Key-Value stores . such as Bigtable, PNUTS, Dynamo, and their open source analogues. have been the preferred data stores for applications in the cloud. In these systems, data is represented as Key-Value pairs, and atomic access is provided only at the granularity of single keys. While these properties work well for current applications, they are insufficient for the next generation web applications . such as online gaming, social networks, collaborative editing, and many more . which emphasize collaboration. Since collaboration by definition requires consistent access to groups of keys, scalable and consistent multi key access is critical for such applications. We propose the Key Group abstraction that defines a relationship between a group of keys and is the granule for on-demand transactional access. This abstraction allows the Key Grouping protocol to collocate control for the keys in the group to allow efficient access to the group of keys. Using the Key Grouping protocol, we design and implement G-Store which uses a key-value store as an underlying substrate to provide efficient, scalable, and transactional multi key access. Our implementation using a standard key-value store and experiments using a cluster of commodity machines show that G-Store preserves the desired properties of key-value stores, while providing multi key access functionality at a very low overhead.

I encountered this while researching the intrusive keys found in: KitaroDB [intrusive-keyed database]

The notion of a key group that consists of multiple keys, assigned to a single node, seems similar to me to the collection of item identifiers in the TMDM (13250-2) post merging.

I haven’t gotten into the details but supporting collections of item identifiers on top of scalable key-value stores is an interesting prospect.

I created a category for G-Store but had to add “(Multikey)” because in looking for more details on it, I encountered another G-Store that you will be interested in.

KitaroDB [intrusive-keyed database]

Thursday, November 1st, 2012

KitaroDB

From the “What is KitaroDB” page:

KitaroDB is a fast, efficient and scalable NoSQL database that operates natively in the WinRT (Windows 8 UI), the Win32 (x86 and x64) and .NET environments. It features:

  • Easy-to-use interfaces for C#, VB, C++, C and HTML5/JavaScript developers;
  • A proven commercial database system;
  • Support of large-sector drives;
  • Minimal overhead, consuming less than a megabyte of memory resources;
  • Durability as on-disk data store;
  • High performance, capable of handling tens of thousands of operations per second;
  • Asynchronous and synchronous operations;
  • Storage limit of 100 TB;
  • Flexibility as either a key/value data store or an intrusive key database with segmented key support.

The phrase “intrusive-keyed database” was unfamiliar.

Keys can be segmented into up to 255 segments with what appears to be a fairly limited range of data types. Some of the KitaroDB documentation on intrusive-keyed database” is found here: Creating a KitaroDB database.

Segmented keys aren’t unique to KitraoDB and they offer some interesting possibilities. More to follow on that score.

Storage being limited to 100 TB should not be an issue for “middling” data sized applications. ;-)

NoSQL Bibliographic Records:…

Tuesday, October 30th, 2012

NoSQL Bibliographic Records: Implementing a Native FRBR Datastore with Redis by Jeremy Nelson.

From the background:

Using the Library of Congress Bibliographic Framework for the Digital Age as the starting point for software development requirements; the FRBR-Redis-Datastore project is a proof-of-concept for a next-generation bibliographic NoSQL system within the context of improving upon the current MARC catalog and digital repository of a small academic library at a top-tier liberal arts college.

The FRBR-Redis-Datastore project starts with a basic understanding of the MARC, MODS, and FRBR implemented using a NoSQL technology called Redis.

This presentation guides you through the theories and technologies behind one such proof-of-concept bibliographic framework for the 21st century.

I found the answer to “Well, Why Not Hadoop?”

Hadoop was just too complicated compared to the simple three-step Redis server set-up.

refreshing.

Simply because a technology is popular doesn’t mean it meets your requirements. Such as administration by non-full time technical experts.

An Oracle database supports applications that could manage garden club finances but that’s a poor choice under most circumstances.

The Redis part of the presentation is apparently not working (I get Python errors) as of today and I have sent a note with the error messages.

A “proof-of-concept” that merits your attention!

SPARQL and Big Data (and NoSQL) [Identifying Winners and Losers – Cui Bono?]

Saturday, October 27th, 2012

SPARQL and Big Data (and NoSQL) by Bob DuCharme.

From the post:

How to pursue the common ground?

I think it’s obvious that SPARQL and other RDF-related technologies have plenty to offer to the overlapping worlds of Big Data and NoSQL, but this doesn’t seem as obvious to people who focus on those areas. For example, the program for this week’s Strata conference makes no mention of RDF or SPARQL. The more I look into it, the more I see that this flexible, standardized data model and query language align very well with what many of those people are trying to do.

But, we semantic web types can’t blame them for not noticing. If you build a better mouse trap, the world won’t necessarily beat a path to your door, because they have to find out about your mouse trap and what it does better. This requires marketing, which requires talking to those people in language that they understand, so I’ve been reading up on Big Data and NoSQL in order to better appreciate what they’re trying to do and how.

A great place to start is the excellent (free!) booklet Planning for Big Data by Edd Dumbill. (Others contributed a few chapters.) For a start, he describes data that “doesn’t fit the strictures of your database architectures” as a good candidate for Big Data approaches. That’s a good start for us. Here are a few longer quotes that I found interesting, starting with these two paragraphs from the section titled “Ingesting and Cleaning” after a discussion about collecting data from multiple different sources (something else that RDF and SPARQL are good at):

Bob has a very good point: marketing “…requires talking to those people in language that they understand….”

That is, no matter how “good” we think a solution may be, it won’t interest others until we explain it in terms they “get.”

But “marketing” requires more than a lingua franca.

Once an offer is made and understood, it must interest the other person. Or it is very poor marketing.

We may think that any sane person would jump at the chance to reduce the time and expense of data cleaning. But that isn’t necessarily the case.

I once made a proposal that would substantially reduce the time and expense for maintaining membership records. Records that spanned decades and were growing every year (hard copy). I made the proposal, thinking it would be well received.

Hardly. I was called into my manager’s office and got a lecture on how the department in question had more staff, a larger budget, etc., than any other department. They had no interest whatsoever in my proposal and that I should not presume to offer further advice. (Years later my suggestion was adopted when budget issues forced the issue.)

Efficient information flow interested me but not management.

Bob and the rest of us need to ask the traditional question: Cui bono? (To whose benefit?)

Semantic technologies, just like any other, have winners and losers.

To effectively market our wares, we need to identify both.

Metamarkets open sources distributed database Druid

Friday, October 26th, 2012

Metamarkets open sources distributed database Druid by Elliot Bentley.

From the post:

It’s no secret that the latest challenge for the ‘big data’ movement is moving from batch processing to real-time analysis. Metamarkets, who provide “Data Science-as-a-Service” business analytics, last year revealed details of in-house distributed database Druid – and have this week released it as an open source project.

Druid was designed to solve the problem of a database which allows multi-dimensional queries on data as and when it arrives. The company originally experimented with both relational and NoSQL databases, but concluded they were not fast enough for their needs and so rolled out their own.

The company claims that Druid’s scan speed is “33M rows per second per core”, able to ingest “up to 10K incoming records per second per node”. An earlier blog post outlines how the company managed to achieve scan speeds of 26B records per second using horizontal scaling. It does this via a distributed architecture, column orientation and bitmap indices.

It was exciting to read about Druid last year.

Now to see how exciting Druid is in fact!

Source code: https://github.com/metamx/druid

Redis 2.6.2 Released!

Friday, October 26th, 2012

Redis 2.6.2 Released!

From the introduction to Redis:

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

You can run atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set.

In order to achieve its outstanding performance, Redis works with an in-memory dataset. Depending on your use case, you can persist it either by dumping the dataset to disk every once in a while, or by appending each command to a log.

Redis also supports trivial-to-setup master-slave replication, with very fast non-blocking first synchronization, auto-reconnection on net split and so forth.

Other features include a simple check-and-set mechanism, pub/sub and configuration settings to make Redis behave like a cache.

You can use Redis from most programming languages out there.

Redis is written in ANSI C and works in most POSIX systems like Linux, *BSD, OS X without external dependencies. Linux and OSX are the two operating systems where Redis is developed and more tested, and we recommend using Linux for deploying. Redis may work in Solaris-derived systems like SmartOS, but the support is best effort. There is no official support for Windows builds, although you may have some options.

The “in-memory” nature of Redis will be a good excuse for more local RAM. ;-)

I noticed the most recent release of Redis at Alex Popescu’s myNoSQL.

Spanner – …SQL Semantics at NoSQL Scale

Monday, October 22nd, 2012

Spanner – It’s About Programmers Building Apps Using SQL Semantics at NoSQL Scale by Todd Hoff.

From the post:

A lot of people seem to passionately dislike the term NewSQL, or pretty much any newly coined term for that matter, but after watching Alex Lloyd, Senior Staff Software Engineer Google, give a great talk on Building Spanner, that’s the term that fits Spanner best.

Spanner wraps the SQL + transaction model of OldSQL around the reworked bones of a globally distributed NoSQL system. That seems NewSQL to me.

As Spanner is a not so distant cousin of BigTable, the NoSQL component should be no surprise. Spanner is charged with spanning millions of machines inside any number of geographically distributed datacenters. What is surprising is how OldSQL has been embraced. In an earlier 2011 talk given by Alex at the HotStorage conference, the reason for embracing OldSQL was the desire to make it easier and faster for programmers to build applications. The main ideas will seem quite familiar:

  • There’s a false dichotomy between little complicated databases and huge, scalable, simple ones. We can have features and scale them too.
  • Complexity is conserved, it goes somewhere, so if it’s not in the database it’s pushed to developers.
  • Push complexity down the stack so developers can concentrate on building features, not databases, not infrastructure.
  • Keys for creating a fast-moving app team: ACID transactions; global Serializability; code a 1-step transaction, not 10-step workflows; write queries instead of code loops; joins; no user defined conflict resolution functions; standardized sync; pay as you go, get what you pay for predictable performance.

Spanner did not start out with the goal of becoming a NewSQL star. Spanner started as a BigTable clone, with a distributed file system metaphor. Then Spanner evolved into a global ProtocolBuf container. Eventually Spanner was pushed by internal Google customers to become more relational and application programmer friendly.

If you can’t stay for the full show, Todd provides a useful summary of the video. But if you have the time, take the time to enjoy the presentation!.