Archive for the ‘Database’ Category

InfoWord Bossie 2017 Awards Databases & Analytics

Wednesday, September 27th, 2017

InfoWorld’s Bossie awards for 2017 for databases and analytics.

In true InfoWorld fashion, the winners were in no particular order, one per slide and presented as images to prevent copy-n-paste.

Let’s re-arrange those “facts” for the time-pressed reader:

Hyperlinks are to the projects, the best information you will find for each one.

Enjoy!

Google Spanner Inspires CockroachDB To Outrun It

Thursday, February 23rd, 2017

Google Spanner Inspires CockroachDB To Outrun It by Timothy Prickett Morgan.

From the post:

There is an old joke that in the post-apocalyptic world that comes about because of plague or nuclear war, only two things will be left alive: cockroaches and Keith Richards, the guitarist for the Rolling Stones. As it hails from New York City, you can understand why Cockroach Labs, the upstart software company that is cloning Google’s Spanner distributed relational database, chose that particular bug to epitomize a system that will stay alive no matter what. But, they could have just as easily called it RichardsDB.

When discussing Google’s cloud implementation of Spanner, which launched in beta earlier this week, we promised that we would have a sit down with the people behind the CockroachDB effort, who not coincidentally all hail from the search engine giant and who all worked on key parts of the software infrastructure that sets Google apart. We think that CockroachDB has the potential to shake up the database market, and not just because the people behind it understand deeply what Google did but more importantly because they have come to understand the very different approach that is necessary to commercialize the ideas that are embodied in Spanner and that are being coded into a free and open source CockroachDB.

Cockroach Labs has the potential to be Yahoo’s Hadoop and HDFS to Google’s MapReduce and Google File System, but in a database market that is ripe for change and among enterprise customers who want simplicity from complex systems and they want them to scale instantly, easily, and geographically.

“The backbreaking work is making CockroachDB deployable,” Kimball explains. “In fact, CockroachDB is more deployable than any other distributed database out there. It is an incredibly simple model in that every single node – we call them roach nodes – is symmetric and self-contained with a single binary. There are no external dependencies, so you don’t need Zookeeper running, you don’t need etcd or a distributed or shared file system underneath like Oracle needs with its Real Application Clusters. It is simple as taking one statically linked binary with no dependencies and putting that out on some VM somewhere and pointing it at any other node in a CockroachDB cluster and it joins up and they self organize. So making that work with a single binary was a huge challenge and a big departure from the way Google chose to build Spanner.”

Another big difference is support for SQL, and Kimball had plenty to say about this.

“This is a really interesting point,” he says excitedly. “We have had to build CockroachDB to be deployable as open source in a way that doesn’t need TrueTime and that meets all of the needs of these companies. With open source and there being such a low barrier to try it out and kick the tires, you really need to have a very comprehensive, compatible interface with the outside world. You can’t really cut corners. So we decided to make CockroachDB look like Postgres because it is a better thought out database than MySQL. The nice thing about making it look like Postgres is that all of the applications and languages have drivers for it.”

There are other capabilities built into CockroachDB but being deployable (compared to other distributed databases) and looking like Postgres are keys to a large marketplace share.

The gap between software available to governments/industry and individuals/small groups continues to shrink.

Only you can answer the question of your skill versus theirs.

Merry Christmas To All Astronomers! (Pan-STARRS)

Tuesday, December 20th, 2016

The Panoramic Survey Telescopes & Rapid Response System (Pan-STARRS) dropped its data release on December 19, 2016.

Realizing you want to jump straight to the details, check out: PS1 Data Processing procedures.

There is far more to be seen but here’s a shot of the sidebar:

ps1-sidebar-460

Jim Gray favored the use of astronomical data because it was “big” (this was before “big data” became marketing hype) and it is free.

Enjoy!

RDBL – manipulate data in-database with R code only

Monday, August 29th, 2016

RDBL – manipulate data in-database with R code only by David Kun.

From the post:

In this post I introduce our own package RDBL, the R DataBase Layer. With this package you can manipulate data in-database without writing SQL code. The package interprets the R code and sends out the corresponding SQL statements to the database, fully transparently. To minimize overhead, the data is only fetched when absolutely necessary, allowing the user to create the relevant joins (merge), filters (logical indexing) and groupings (aggregation) in R code before the SQL is run on the database. The core idea behind RDBL is to let R users with little or no SQL knowledge to utilize the power of SQL database engines for data manipulation.

It is important to note that the SQL statements generated in the background are not executed unless explicitly requested by the command as.data.frame. Hence, you can merge, filter and aggregate your dataset on the database side and load only the result set into memory for R.

In general the design principle behind RDBL is to keep the models as close as possible to the usual data.frame logic, including (as shown later in detail) commands like aggregate, referencing columns by the \($\) operator and features like logical indexing using the \([]\) operator.

RDBL supports a connection to any SQL-like data source which supports a DBI interface or an ODBC connection, including but not limited to Oracle, MySQL, SQLite, SQL Server, MS Access and more.

Not as much fun as surfing mall wifi for logins/passwords, but it is something you can use at work.

The best feature is that you load resulting data sets only. RDBL uses databases for what they do well. Odd but efficient practices do happen from time to time.

I first saw this in a tweet by Christophe Lalanne.
Enjoy!

Databases are categories

Tuesday, April 19th, 2016

Databases are categories by David I. Spivak.

Slides from a presentation 2010/06/03.

If you are more comfortable with databases than category theory, you may want to give these a spin.

I looked but was unable to locate video of the presentation. That would be a nice addition.

Enjoy!

Readings in Database Systems, 5th Edition (Kindle Stuffer)

Tuesday, December 15th, 2015

Readings in Database Systems, 5th Edition, Peter Bailis, Joseph M. Hellerstein, Michael Stonebraker, editors.

From the webpage:

  1. Preface [HTML] [PDF]
  2. Background introduced by Michael Stonebraker [HTML] [PDF]
  3. Traditional RDBMS Systems introduced by Michael Stonebraker [HTML] [PDF]
  4. Techniques Everyone Should Know introduced by Peter Bailis [HTML] [PDF]
  5. New DBMS Architectures introduced by Michael Stonebraker [HTML] [PDF]
  6. Large-Scale Dataflow Engines introduced by Peter Bailis [HTML] [PDF]
  7. Weak Isolation and Distribution introduced by Peter Bailis [HTML] [PDF]
  8. Query Optimization introduced by Joe Hellerstein [HTML] [PDF]
  9. Interactive Analytics introduced by Joe Hellerstein [HTML] [PDF]
  10. Languages introduced by Joe Hellerstein [HTML] [PDF]
  11. Web Data introduced by Peter Bailis [HTML] [PDF]
  12. A Biased Take on a Moving Target: Complex Analytics
    by Michael Stonebraker [HTML] [PDF]
  13. A Biased Take on a Moving Target: Data Integration
    by Michael Stonebraker [HTML] [PDF]

Complete Book: [HTML] [PDF]

Readings Only: [HTML] [PDF]

Previous Editions: [HTML]

Citations to the “reading” do not present themselves as hyperlinks but they are.

If you are giving someone a Kindle this Christmas, consider pre-loading Readings in Database Systems, along with the readings as a Kindle stuffer.

A Certain Tendency Of The Database Community

Tuesday, October 27th, 2015

A Certain Tendency Of The Database Community by Christopher Meiklejohn.

From the post:

Abstract

We posit that striving for distributed systems that provide “single system image” semantics is fundamentally flawed and at odds with how systems operate in the physical world. We realize the database as an optimization of this system: a required, essential optimization in practice that facilitates central data placement and ease of access to participants in a system. We motivate a new model of computation that is designed to address the problems of computation over “eventually consistent” information in a large-scale distributed system.

Eventual Consistency

When we think about the world we live in, we do not usually say it is eventually consistent, for this is a term usually applied to computing systems, made up of multiple machines, that have to operate with shared information.

Eventual consistency is a consistency model for replicated, shared state. A consistency model is a contract between an application developer and a system that application will run on. A contract between a developer and a system states the following: given the developer follows the rules defined by the system, certain outcomes from the system are guaranteed. This makes it possible for developers to build successful applications, for without this contract, applications would have no guarantee that the actions they perform would have a correct outcome.

(italics in original)

A very accessible and great read on “eventual consistency.”

Christopher points out that any “state” of knowledge is a snapshot under a given set of constraints:

For instance, if the leading researchers on breast cancer were to document the state-of-the-art in a book, as the document is being written it would no longer reflect the state-of-the-art. The collective knowledge of this group is always changing, and as long as we continue to rewrite the document it will only be approaching the combined knowledge of the group. We can think of this somewhat formally: if we had a way to view the group’s knowledge as a omniscient observer and we represent that knowledge as a linear function, the recorded text would be asymptotic to function of the sum of global knowledge.

He concludes with this question:

…Can we build computational abstractions that allow devices to communicate peer-to-peer, acknowledging the true source of truth for a particular piece of information and scale to the amount of information that exists, not only between all computers in a planetary-scale distributed system, but all entities in the universe[?]

I’m not sure about “all entities in the universe,” or even a “planetary-scale distributed system,” but we do know that Netware Directory Services (NDS) (now eDirectory) was a replicated, distributed, sharded database with eventual convergence that was written in 1993.

We have had the computational abstractions for a replicated, distributed, sharded database with eventual convergence for a number of years.

I would adjust Christopher’s “true source of truth,” for “source of truth as defined by users,” to avoid the one-world-truth position that crippled the Semantic Web even before FOL and RDF syntax arrived.

OpenTSDB 2.0.1

Sunday, March 29th, 2015

OpenTSDB 2.0.1

From the homepage:

Store

  • Data is stored exactly as you give it
  • Write with millisecond precision
  • Keep raw data forever

Scale

  • Runs on Hadoop and HBase
  • Scales to millions of writes per second
  • Add capacity by adding nodes

Read

  • Generate graphs from the GUI
  • Pull from the HTTP API
  • Choose an open source front-end

If that isn’t impressive enough, check out the features added for the 2.0 release:

OpenTSDB has a thriving community who contributed and requested a number of new features. 2.0 has the following new features:

  • Lock-less UID Assignment – Drastically improves write speed when storing new metrics, tag names, or values
  • Restful API – Provides access to all of OpenTSDB’s features as well as offering new options, defaulting to JSON
  • Cross Origin Resource Sharing – For the API so you can make AJAX calls easily
  • Store Data Via HTTP – Write data points over HTTP as an alternative to Telnet
  • Configuration File – A key/value file shared by the TSD and command line tools
  • Pluggable Serializers – Enable different inputs and outputs for the API
  • Annotations – Record meta data about specific time series or data points
  • Meta Data – Record meta data for each time series, metrics, tag names, or values
  • Trees – Flatten metric and tag combinations into a single name for navigation or usage with different tools
  • Search Plugins – Send meta data to search engines to delve into your data and figure out what’s in your database
  • Real-Time Publishing Plugin – Send data to external systems as they arrive to your TSD
  • Ingest Plugins – Accept data points in different formats
  • Millisecond Resolution – Optionally store data with millisecond precision
  • Variable Length Encoding – Use less storage space for smaller integer values
  • Non-Interpolating Aggregation Functions – For situations where you require raw data
  • Rate Counter Calculations – Handle roll-over and anomaly supression
  • Additional Statistics – Including the number of UIDs assigned and available

I suppose traffic patterns (license plates) are a form of time series data. Yes?

The Theory of Relational Databases

Sunday, March 29th, 2015

The Theory of Relational Databases by David Maier.

From the webpage:

This text has been long out of print, but I still get requests for it. The copyright has reverted to me, and you have permission to reproduce it for personal or academic use, but not for-profit purposed. Please include “Copyright 1983 David Maier, used with permission” on anything you distribute.

Out of date, 1983, if you are looking for the latest work but not if you are interested in where we have been. Sometimes the later is more important than the former.

Enjoy!

Figures Don’t Lie, But Liars Can Figure

Thursday, March 12th, 2015

A pair of posts that you may find amusing on the question of “free” and “cheaper.”

HBase is Free but Oracle NoSQL Database is cheaper

When does “free” challenge that old adage, “You get what you pay for”?

Two brief quotes from the first post set the stage:

How can Oracle NoSQL Database be cheaper than “free”? There’s got to be a catch. And of course there is, but it’s not where you are expecting. The problem in that statement isn’t with “cheaper” it’s with “free”.

An HBase solution isn’t really free because you do need hardware to run your software. And when you need to scale out, you have to look at the how well the software scales. Oracle NoSQL Database scales much better than HBase which translated in this case to needing much less hardware. So, yes, it was cheaper than free. Just be careful when somebody says software is free.

The second post tries to remove the vendor (Oracle) from the equation:

Read-em and weep …. NOT according to Oracle, HBase does not take advantage of SSD’s anywhere near the extent with which Oracle NoSQL does … couldn’t even use the same scale on the vertical bar.

SanDisk on HBase with SSD

SanDisk on Oracle NoSQL with SSD

And so the question remains, when does “free” challenge the old adage “you get what you pay for”, because in this case, the adage continues to hold up.

And as the second post notes, Oracle has committed code back to the HBase product so it isn’t unfamiliar to them.

First things first, the difficulty that leads to these spats is using “cheap,” “free,” “scalable,” “NoSQL,” etc. as the basis for IT marketing or decision making. That may work with poorer IT decision makers and however happy it makes the marketing department, it is just noise. Noise that is a disservice to IT consumers.

Take “cheaper,” and “free” as used in these posts. Is hardware really the only cost associated with HBase or Oracle installations? If it is, I have been severely misled.

On the Hbase expense side I would expect to find HBase DBAs, maintenance of those personnel, hardware (+maintenance), programmers, along with use case requirements that must be met.

On the Oracle expense side I would expect to find Oracle DBAs, maintenance of those personnel, Oracle software licensing, hardware (+maintenance), programmers, along with use case requirements that must be met.

Before you jump to my listing “Oracle software licensing,” consider how that will impact the availability of appropriate personnel, the amount of training needed to introduce new IT staff to HBase, etc.

Not to come down too hard for Oracle, Oracle DBAs and their maintenance aren’t cheap, nor are some of the “features” of Oracle software.

Truth be told there is a role for project requirements, experience of current IT personnel, influence IT has over the decision makers, and personal friendships of decision makers in any IT decision making.

To be very blunt, IT decision making is just as political as any other enterprise decision.

Numbers are a justification for a course chosen for other reasons. As a user I am always more concerned with my use cases being met than numbers. Aren’t you?

Fifty Words for Databases

Saturday, March 7th, 2015

Fifty Words for Databases by Phil Factor

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

Indexed Database API Proposed Recommendation Published (Review Opportunity)

Thursday, November 20th, 2014

Indexed Database API Proposed Recommendation Published

From the post:

The Web Applications Working Group has published a Proposed Recommendation of Indexed Database API. This document defines APIs for a database of records holding simple values and hierarchical objects. Each record consists of a key and some value. Moreover, the database maintains indexes over records it stores. An application developer directly uses an API to locate records either by their key or by using an index. A query language can be layered on this API. An indexed database can be implemented using a persistent B-tree data structure. Comments are welcome through 18 December. Learn more about the Rich Web Client Activity.

If you have the time between now and 18 December, this is a great opportunity to “get your feet wet” reviewing W3C recommendations.

The Indexed Database API document isn’t long (43 pages approximately) and you are no doubt already familiar with databases in general.

An example to get you started:

3.2 APIs

The API methods return without blocking the calling thread. All asynchronous operations immediately return an IDBRequest instance. This object does not initially contain any information about the result of the operation. Once information becomes available, an event is fired on the request and the information becomes available through the properties of the IDBRequest instance.

  1. When you read:

    The API methods return without blocking the calling thread.

    How do you answer the question: What is being returned by an API method?

  2. What is your answer after reading the next sentence?

    All asynchronous operations immediately return an IDBRequest instance.

  3. Does this work?

    API methods return an IDBRequest instance without blocking the calling thread.

    (Also correcting the unnecessary definite article “The.”)

  4. One more issue with the first sentence is:

    …without blocking the calling thread.

    If you search the document, there is no other mention of calling threads.

    I suspect this is unnecessary and would ask for its removal.

    So, revised the first sentence would read:

    API methods return an IDBRequest instance.

  5. Maybe, except that the second sentence says “All asynchronous operations….”

    When you see a statement of “All …. operations…,” you should look for a definition of those operations.

    I have looked and while “asynchronous” is used thirty-four (34) times, “asynchronous operations” is used only once.

    (more comments on “asynchronous” below)

  6. I am guessing you caught the “immediately” problem on your own. Undefined and what other response would there be?

    If we don’t need “asynchronous” and the first sentence is incomplete, is this a good suggestion for 3.2 APIs?

    (Proposed Edit)

    3.2 APIs

    API methods return an IDBRequest instance. This object does not initially contain any information about the result of the operation. Once information becomes available, an event is fired on the request and the information becomes available through the properties of the IDBRequest instance.

    There, your first edit to a W3C Recommendation removed twelve (12) words and made the text clearer.

    Plus there are the other thirty-three (33) instances of “asynchronous” to investigate.

    Not bad for your first effort!


    After looking around the rest of the proposed recommendation, I suspect that “asynchronous” is used to mean that results and requests can arrive and be processed in any order (except for some cases of overlapping scope of operations). It’s simpler just to say that once and not go about talking about “asynchronous requests,” “opening databases asynchronously,” etc.

How does SQLite work? Part 2: btrees!…

Friday, October 3rd, 2014

How does SQLite work? Part 2: btrees! (or: disk seeks are slow don’t do them!) by Julia Evans.

From the post:

Welcome back to fun with databases! In Part 1 of this series, we learned that:

  • SQLite databases are organized into fixed-size pages. I made an example database which had 1k pages.
  • The pages are all part of a kind of tree called a btree.
  • There are two kinds of pages: interior pages and leaf pages. Data is only stored in leaf pages.

I mentioned last time that I put in some print statements to tell me every time I read a page, like this:

I suspect Chris Granger would consider this as “plumbing” that prevents some users from using computation.

Chris would be right, to a degree, but Julia continues to lower the bar that “plumbing” poses to users.

Looking forward to more untangling and clarifying of SQLite plumbing!

Dynamic Columns Tutorial – Part 1: Introduction

Friday, October 3rd, 2014

Dynamic Columns Tutorial – Part 1: Introduction by Max Mether.

From the post:

For certain situations, the static structure of tables in a relational database can be very limited. Each column is statically defined, has a pre-defined type and you can only enter a value of that type into the column.You can be creative and list multiple values in one column, but then those values are not generally easily accessed and manipulated with other functions. You have to use an API or contortions of a function like SUBSTRING() to pull out a value you want. Even then, you have to know what is contained in the column to be able to manipulate it properly. These methods can require too much manual intervention to assess and access the data contained in the column.

If you want to add columns as the information stored in your tables grows and your needs change you need to do fairly expensive ALTER TABLE operations. These have traditionally been very expensive in MySQL and MariaDB although the performance is a bit better starting with MariaDB 10.0 and MySQL 5.6.

The other option for having a flexible structure is to use something like Anchor Modeling . This allows you to have a very flexible schema as adding an attribute basically just means adding a table. The problem with this approach is that you’ll end up with a lot of tables which means a lot of joins when looking for results which can easily become un-manageable, or at least hard to manage.

This is where dynamic columns steps into the picture. A good solution to the static structure problem is to use dynamic columns provided in MariaDB. It allows flexibility within a defined structure, within a column. A Dynamic Column is defined as a BLOB on the DDL level. However, within the BLOB column, you may set arbitrarily and dynamically defined attributes and values–for a maximum of 64k.

Dynamic columns are not in isolation: The usual functions will work fine with the values contained within them. And they can be used as join points for joining to other table as you would normally. This allows you to retain the power of Relational SQL while still mainting a flexibility with regards to your attributes for specific tables where it makes sense.

Probably channeling topic maps when I observe that dynamic columns are associating multiple properties with a subject. 😉

Very interested in seeing how joins are performed using dynamic columns, but that awaits in a future post.

I first saw this in a tweet by MariaDB

Readings in Databases

Thursday, October 2nd, 2014

Readings in Databases by Reynold Xin.

From the webpage:

A list of papers essential to understanding databases and building new data systems. The list is curated and maintained by Reynold Xin (@rxin)

Not a comprehensive list but it is an annotated one, which should enable you to make better choices.

Concludes with reading lists from several major computer science programs.

How does SQLite work? Part 1: pages!

Monday, September 29th, 2014

How does SQLite work? Part 1: pages! by Julia Evans.

From the post:

This evening the fantastic Kamal and I sat down to learn a little more about databases than we did before.

I wanted to hack on SQLite, because I’ve used it before, it requires no configuration or separate server process, I’d been told that its source code is well-written and approachable, and all the data is stored in one file. Perfect!

Following Julia down a rabbit hole to program internals encourages you to venture out on your own!

I can’t say why her posts have that quality, but they do.

Enjoy!

I first saw this in a tweet by FoundationDB.

Understanding weak isolation is a serious problem

Wednesday, September 17th, 2014

Understanding weak isolation is a serious problem by Peter Bailis.

From the post:

Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional inherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.2

That debates are occurring without full knowledge of the issues at hand isn’t all that surprising. Or as Job 38:2 (KJV) puts it: “Who is this that darkeneth counsel by words without knowledge?”

Peter raises a number of questions and points to resources that are good starting points for investigation of weak isolation.

What sort of weak isolation does your topic map storage mechanism use?

I first saw this in a tweet by Justin Sheehy.

The Lesser Known Normal Forms of Database Design

Friday, September 12th, 2014

The Lesser Known Normal Forms of Database Design by John Myles White.

A refreshing retake on normal forms of database design!

Enjoy!

VLDB – Volume 7, 2013-2014

Saturday, August 9th, 2014

Proceedings of the Very Large Data Bases, Volume 7, 2013-2014.

You are likely already aware of the VLDB proceedings but after seeing the basis for Summingbird:… [VLDB 2014], I was reminded that I should have a tickler to check updates on the VLDB proceedings every month. August of 2014 (Volume 7, No. 12) landed a few days ago and it looks quite good.

Two tidbits to tease you into visiting:

Akash Das Sarma, Yeye He, Surajit Chaudhuri: ClusterJoin: A Similarity Joins Framework using Map-Reduce. 1059 – 1070.

Norases Vesdapunt, Kedar Bellare, Nilesh Dalvi: Crowdsourcing Algorithms for Entity Resolution. 1071 – 1082.

I count twenty-six (26) articles in issue 12 and eighty (80) in issue 13.

Just in case you have run out of summer reading material. 😉

Aerospike goes Open Source

Wednesday, July 2nd, 2014

Aerospike goes Open Source

From the post:

We are excited to announce that the Aerospike database is now open source.

Aerospike’s mission is to disrupt the entire field of databases by offering an addictive proposition: a database literally ten times faster than existing NoSQL solutions, and one hundred times faster than existing SQL solutions. By offering as open source the battle-tested operational database that powers the largest and highest scale companies on the planet, Aerospike will change how applications are architected, and solutions are created!

The code for Aerospike clients and the Aerospike server is published on github. Clients are available under the Apache 2 license so you can use and modify with no restrictions. The server is available under AGPL V3 to protect the long term interests of the community – you are free to use it with no restrictions but if you change the server code, then those code changes must be contributed back.

Aerospike Community Edition has no limits on the number of servers, tps or terabytes of data, and is curated by Aerospike. Use is unlimited and the code is open. We cannot wait to see what you will do with it!

You will have to read the details to decide if Aerospike is appropriate for your requirements.

Among other things, I would focus on statements like:

This layer [Distribution Layer] scales linearly and implements many of the ACID guarantees.

That’s like reading a poorly written standards document. 😉 What does “many of the ACID guarantees” mean exactly?

From the ACID article at Wikipedia I read:

In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably.

Jim Gray defined these properties of a reliable transaction system in the late 1970s and developed technologies to achieve them automatically.

I don’t think four (4) requirements count as “many” but my first question would be:

Which of the “many” ACID guarantees does Aerospike not implement? How hard that this be? It has to be one of the four. Yes?

Second question: So, more than three decades after Jim Gray demonstrated how to satisfy all four ACID guarantees, Aerospike doesn’t? Yes?

I’m not denying there may be valid reasons to ignore one or more of the ACID guarantees. But let’s be clear about which ones and the trade-offs that justify it.

I first saw this in a tweet by Charles Ditzel.

On Lowered Expectations:…

Monday, June 9th, 2014

On Lowered Expectations: Transactions, Scaling, and Honesty by Jennifer Rullmann.

Jennifer reviews Ted Dunning’s list of what developers should demand from database vendors and then adds one more:

But I think the most important thing that developers need from database vendors is missing: honesty. I spent 30 minutes yesterday on a competitor’s website just trying to figure out if they really do ACID, and after four months in the industry, I know quite a bit more about what to look for than most application developers. It’s ridiculous how hard it is to figure out even the most basic things about the vast majority of databases on the market. I feel really strongly about this, so I’ll say it again:

The number one thing we need from database vendors is honesty.

(emphasis in original)

I am sure there are vendors who invent definitions of “hyperedge” and claim to support Unicode when they really support “tick-Unicode,” that is a Unicode character preceded by a “`.”

Beyond basic honesty, I read Jennifer’s complaint as being about the lack of good documentation for database offerings. A lack that is well known.

I doubt developers are suddenly going to start writing high quality documentation for their software. Or at least after decades of not doing so, it seems unlikely.

But that doesn’t mean we are doomed to bad documentation. What if a database vendor decided to document databases comparable to their own? Not complete, not a developer’s guide but ferreting out and documenting basic information comparable databases.

Like support for ACID.

Would take time to develop the data and credibility, but in the long run, whose product would you trust more?

A vendor whose database capabilities are hidden behind smoke and mirrors or a vendor who is honest about themselves and others?

Categorical Databases

Thursday, May 29th, 2014

Categorical Databases by David I. Spivak.

From Slide 2 of 58:

There is a fundamental connection between databases and categories.

  • Category theory can simplify how we think about and use databases.
  • We can clearly see all the working parts and how they fit together.
  • Powerful theorems can be brought to bear on classical DB problems.

The slides are “text heavy” but I think you will find that helpful rather than a hindrance in this case. 😉

From David Spivak’s homepage:

Purpose: I study information and communication, working towards a mathematical foundation for interoperability.

If you are looking for more motivation to get into category theory, this could be the place to start.

I first saw this in a tweet by Jim Duey.

Postgres-XL

Tuesday, May 13th, 2014

Database vendor open sources Postgres-XL for scale-out workloads by Derrick Harris.

From the post:

Database software company TransLattice has rebuilt the popular PostgreSQL database as a clustered system designed for handling large datasets. The open-source product, called Postgres-XL, is designed to be just like regular Postgres, only more scalable and also functional as a massively parallel analytic database.

That’s interesting news but I puzzled over a comment that Derrick makes:

Postgres-XL is among a handful of recent attempts to turn Postgres into a more fully functional database.

True, there are projects to add features to Postgres that it previously did not have but I would not characterize them as making Postgres “…a more fully functional database.”

PostgreSQL is already a fully functional database, a rather advanced one. It may lack some feature someone imagines as useful, but the jury is still out on whether such added “features” are features or cruft.

Back to the future of databases

Saturday, May 10th, 2014

Back to the future of databases by Yin Wang.

From the post:

Why do we need databases? What a stupid question. I already heard some people say. But it is a legitimate question, and here is an answer that not many people know.

First of all, why can’t we just write programs that operate on objects? The answer is, obviously, we don’t have enough memory to hold all the data. But why can’t we just swap out the objects to disk and load them back when needed? The answer is yes we can, but not in Unix, because Unix manages memory as pages, not as objects. There are systems who lived before Unix that manage memory as objects, and perform object-granularity persistence. That is a feature ahead of its time, and is until today far more advanced than the current state-of-the-art. Here are some pictures of such systems:

Certainly thought provoking but how much of an advantage would object-granularity persistence have to offer before it could make headway against the install base of Unix?

The database field is certainly undergoing rapid research and development, with no clear path to a winner.

Will the same happen with OSes?

Building a Database-backed Clojure Web App…

Saturday, March 8th, 2014

Building a Database-backed Clojure Web App On Top of Heroku Cloud App Platform by Charles Ditzel.

From the post:

Some time ago I wrote a post about Java In the Auto-Scaling Cloud. In the post, I mentioned Heroku. In today’s post, I want to take time to point back to Heroku again, this time with the focus on building web applications. Heroku Dev Center recently posted a great tutorial on building a databased-backed Clojure web application. In this example, a twitter-like app is built that stores “shouts” to a PostgreSQL database. It covers a lot of territory, from connecting to PostgreSQL, to web bindings with Compujure, HTML tempting with Hiccup and assembling the application and testing it. Finally, deploying it.

If you aren’t working on a weekend project already, here is one for your consideration!

MetaModel

Monday, January 27th, 2014

MetaModel

From the Metamodel Wiki:

MetaModel is a library that encapsulates the differences and enhances the capabilities of different datastores. Rich querying abilities are offered to datastores that do not otherwise support advanced querying and a unified view of the datastore structure is offered through a single model of the schemas, tables, columns and relationships.

Also from the MetaModel Wiki, supported data formats:

Relational databases known to be working with MetaModel

Database Version JDBC driver
MySQL 5+ Connector/J
PostgreSQL 8+ PostgreSQL JDBC driver
Oracle 10g SQLJ/JDBC
Apache Derby 10+ Derby driver
Firebird SQL 2.0+ Jaybird driver
Hsqldb/HyperSQL 1.8+ Hsqldb driver
H2 1.2+ H2 driver
SQLite 3.6.0+ Xerial driver
Microsoft SQL Server 2005+ JTDS driver
Ingres

Ingres JDBC driver

Non-relational / NoSQL databases supported by MetaModel

  • MongoDB
  • CouchDB

Business application supported (through system services) by MetaModel

  • Salesforce.com
  • SugarCRM

File data formats supported by MetaModel

File format File extension Version
Comma separated file .csv
Microsoft Excel spreadsheet .xls Excel ’97-2003
. .xlsx Excel 2007+
OpenOffice.org database .odb OpenOffice 2.0+
XML file (SAX based) .xml
XML file (DOM based) .xml
Microsoft Access database .mdb Access ’97-2003
. .accdb Access 2007+
dBase database .dbf

Java object datastores (aka POJO datastores)

MetaModel also supports creating datastores built on top of plain Java objects. Either by using a collection of Java bean objects (with getters and setters) or by using collections of Maps or arrays. In the case of using collections of arrays, you will need to manually appoint column names to each index in the arrays.

Composite datastores

MetaModel supports an advanced feature called composite datastores. In short it means that it’s possible to access and query several datastores as if they where one. This involves transparent client-side joining, filtering, grouping etc. Composite datastores are typically not as performant in terms of querying but provides for a convenient way to combine data that is otherwise inherently separated from each other.

That’s an impressive list but who have they missed?

  • AllegroGraph
  • HBase
  • Hive
  • Neo4j
  • OrientDB

Just as a starter list. How many more can you name?

The Scalable Hyperlink Store

Monday, January 6th, 2014

The Scalable Hyperlink Store by Marc Najork.

Abstract:

This paper describes the Scalable Hyperlink Store, a distributed in-memory “database” for storing large portions of the web graph. SHS is an enabler for research on structural properties of the web graph as well as new link-based ranking algorithms. Previous work on specialized hyperlink databases focused on finding efficient compression algorithms for web graphs. By contrast, this work focuses on the systems issues of building such a database. Specifically, it describes how to build a hyperlink database that is fast, scalable, fault-tolerant, and incrementally updateable.

The design goals call for partitioning because:

…the maximum memory size on commodity machines is limited to a few tens of gigabytes….

So the paper is a bit dated but still instructive in terms of building a hyperlink store.

Consider this background to the notion of a hyperlink store that doesn’t offer a user transit to another site but could return the user the content pointed to by a hyperlink.

The Scalable Hyperlink Store at MS Research has more details and software.

Codd’s Relational Vision…

Thursday, December 12th, 2013

Codd’s Relational Vision – Has NoSQL Come Full Circle? by Doug Turnbull.

From the post:

Recently, I spoke at NoSQL Matters in Barcelona about database history. As somebody with a history background, I was pretty excited to dig into the past, beyond the hype and marketing fluff, and look specifically at what technical problems each generation of database solved and where they in-turn fell short.

However, I got stuck at one moment in time I found utterly fascinating: the original development of relational databases. So much of the NoSQL movement feels like a rebellion against the “old timey” feeling relational databases. So I thought it would be fascinating to be a contrarian, to dig into what value relational databases have added to the world. Something everyone thinks is obvious but nobody really understands.

It’s very easy and popular to criticize relational databases. What folks don’t seem to do is go back and appreciate how revolutionary relational databases were when they came out. We forget what problems they solved. We forget how earlier databases fell short, and how relational databases solved the problems of the first generation of databases. In short, relational databases were the noSomething, and I aimed to find out what that something was.

And from that apply those lessons to today’s NoSQL databases. Are today’s databases repeating mistakes of the past? Or are they filling an important niche (or both?).

This is a must read article if you are not choosing databases based on marketing hype.

It’s nice to hear IT history taken seriously.

UnQLite

Thursday, December 12th, 2013

UnQLite

From the webpage:http://unqlite.org/features.html#self_contained

UnQLite is a in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine. UnQLite is a document store database similar to MongoDB,Redis, CouchDB etc. as well a standard Key/Value store similar to BerkeleyDB,LevelDB, etc.

UnQLite is an embedded NoSQL (Key/Value store and Document-store) database engine. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections, is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures. UnQLite features includes:

Does this have the look and feel of a “…just like a Camry…” commercial? 😉

In case you have been under a rock: 2013 Toyota Camry TV Commercial – “Remote”

Still, you may find it meets your requirements better than others.

BayesDB

Sunday, December 8th, 2013

BayesDB (Alpha 0.1.0 Release)

From the webpage:

BayesDB, a Bayesian database table, lets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.

BayesDB is suitable for analyzing complex, heterogeneous data tables with up to tens of thousands of rows and hundreds of variables. No preprocessing or parameter adjustment is required, though experts can override BayesDB’s default assumptions when appropriate.

BayesDB’s inferences are based in part on CrossCat, a new, nonparametric Bayesian machine learning method, that automatically estimates the full joint distribution behind arbitrary data tables.

Now there’s an interesting idea!

Not sure if it is a good idea but it certainly is an interesting one.