Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 6, 2012

Introduction to Databases [MOOC, Stanford, January 2013]

Filed under: CS Lectures,Database — Patrick Durusau @ 11:39 am

Introduction to Databases (info/registration link) – Starts January 15, 2013.

From the webpage:

About the Course

“Introduction to Databases” had a very successful public offering in fall 2011, as one of Stanford’s inaugural three massive open online courses. Since then, the course materials have been improved and expanded, and we’re excited to be launching a second public offering of the course in winter 2013. The course includes video lectures and demos with in-video quizzes to check understanding, in-depth standalone quizzes, a wide variety of automatically-checked interactive programming exercises, midterm and final exams, a discussion forum, optional additional exercises with solutions, and pointers to readings and resources. Taught by Professor Jennifer Widom, the curriculum draws from Stanford’s popular Introduction to Databases course.

Why Learn About Databases?

Databases are incredibly prevalent — they underlie technology used by most people every day if not every hour. Databases reside behind a huge fraction of websites; they’re a crucial component of telecommunications systems, banking systems, video games, and just about any other software system or electronic device that maintains some amount of persistent information. In addition to persistence, database systems provide a number of other properties that make them exceptionally useful and convenient: reliability, efficiency, scalability, concurrency control, data abstractions, and high-level query languages. Databases are so ubiquitous and important that computer science graduates frequently cite their database class as the one most useful to them in their industry or graduate-school careers.

Course Syllabus

This course covers database design and the use of database management systems for applications. It includes extensive coverage of the relational model, relational algebra, and SQL. It also covers XML data including DTDs and XML Schema for validation, and the query and transformation languages XPath, XQuery, and XSLT. The course includes database design in UML, and relational design principles based on dependencies and normal forms. Many additional key database topics from the design and application-building perspective are also covered: indexes, views, transactions, authorization, integrity constraints, triggers, on-line analytical processing (OLAP), JSON, and emerging NoSQL systems. Working through the entire course provides comprehensive coverage of the field, but most of the topics are also well-suited for “a la carte” learning.

Biography

Jennifer Widom is the Fletcher Jones Professor and Chair of the Computer Science Department at Stanford University. She received her Bachelors degree from the Indiana University School of Music in 1982 and her Computer Science Ph.D. from Cornell University in 1987. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering and the American Academy of Arts & Sciences; she received the ACM SIGMOD Edgar F. Codd Innovations Award in 2007 and was a Guggenheim Fellow in 2000; she has served on a variety of program committees, advisory boards, and editorial boards.

Another reason to take the course:

The structure and capabilities of databases shape the way we create solutions.

Consider normalization. An investment of time and effort that may be needed, for some problems, but not others.

Absent alternative approaches, you see every data problem as requiring normalization.

(You may anyway after taking this course. Education cannot impart imagination.)

November 20, 2012

Towards a Scalable Dynamic Spatial Database System [Watching Watchers]

Filed under: Database,Geographic Data,Geographic Information Retrieval,Spatial Index — Patrick Durusau @ 5:07 pm

Towards a Scalable Dynamic Spatial Database System by Joaquín Keller, Raluca Diaconu, Mathieu Valero.

Abstract:

With the rise of GPS-enabled smartphones and other similar mobile devices, massive amounts of location data are available. However, no scalable solutions for soft real-time spatial queries on large sets of moving objects have yet emerged. In this paper we explore and measure the limits of actual algorithms and implementations regarding different application scenarios. And finally we propose a novel distributed architecture to solve the scalability issues.

At least in this version, you will find two copies of the same paper, the second copy sans the footnotes. So read the first twenty (20) pages and ignore the second eighteen (18) pages.

I thought the limitation of location to two dimensions understandable, for the use cases given, but am less convinced that treating a third dimension as an extra attribute is always going to be suitable.

Still, the results here are impressive as compared to current solutions so an additional dimension can be a future improvement.

The use case that I see missing is an ad hoc network of users feeding geo-based information back to a collection point.

While the watchers are certainly watching us, technology may be on the cusp of answering the question: “Who watches the watchers?” (The answer may be us.)

I first saw this in a tweet by Stefano Bertolo.

November 1, 2012

SQL-99 Complete, Really

Filed under: Database,SQL — Patrick Durusau @ 5:39 pm

SQL-99 Complete, Really by Peter Gulutzan & Trudy Pelzer.

From the preface:

If you’ve ever used a relational database product, chances are that you’re already familiar with SQL — the internationally-accepted, standard programming language for databases whic is supported by the vast majority of relational database management system (DBMS) products available today. You may also have noticed that, despite the large number of “reference” works that claim to describe standard SQL, not a single one provides a complete, accurate and example-filled description of the entire SQL Standard. This book was written to fill that void.

True, this is the SQL-99 standard.

I collect old IT standards and books about old IT standards. The standards we draft today address issues that have been seen before, just not dressed in current fashion.

By attempting to understand what worked and what perhaps didn’t in older standards, we can make new mistakes instead of repeating old ones.

October 30, 2012

Summary and Links for CAP Articles on IEEE Computer Issue

Filed under: CAP,Database — Patrick Durusau @ 4:11 am

Summary and Links for CAP Articles on IEEE Computer Issue by Alex Popescu.

From the post:

Daniel Abadi has posted a quick summary of the articles signed by Eric Brewer, Seth Gilbert and Nancy Lynch, Daniel Abadi, Raghu Ramakrishnan, Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell for the IEEE Computer issue dedicated to the CAP theorem. Plus links to most of them:

Be sure to read Daniel’s comments as carefully as you read the IEEE articles.

October 14, 2012

Big data cube

Filed under: BigData,Database,NoSQL — Patrick Durusau @ 7:40 pm

Big data cube by John D. Cook.

From the post:

Erik Meijer’s paper Your Mouse is a Database has an interesting illustration of “The Big Data Cube” using three axes to classify databases.

Enjoy John’s short take, then spend some time with Erik’s paper.

Some serious time with Erik’s paper.

You won’t be disappointed.

October 5, 2012

JugglingDB

Filed under: Database,ORM — Patrick Durusau @ 2:48 pm

JugglingDB

From the webpage:

JugglingDB is cross-db ORM, providing common interface to access most popular database formats. Currently supported are: mysql, mongodb, redis, neo4j and js-memory-storage (yep, self-written engine for test-usage only). You can add your favorite database adapter, checkout one of the existing adapters to learn how, it’s super-easy, I guarantee.

For those of you communing with your favourite databases this weekend. 😉

October 4, 2012

PostgreSQL Database Modeler

Filed under: Database,Modeling,PostgreSQL — Patrick Durusau @ 2:22 pm

PostgreSQL Database Modeler

From the readme file at github:

PostgreSQL Database Modeler, or simply, pgModeler is an open source tool for modeling databases that merges the classical concepts of entity-relationship diagrams with specific features that only PostgreSQL implements. The pgModeler translates the models created by the user to SQL code and apply them onto database clusters from version 8.0 to 9.1.

Other modeling tools you have or are likely to encounter writing topic maps?

When the output of diverse modeling tools or diverse output from the same modeling tool needs semantic reconciliation, I would turn to topic maps.

I first saw this at DZone.

September 29, 2012

Amazon RDS Now Supports SQL Server 2012

Filed under: Database,SQL Server — Patrick Durusau @ 3:32 pm

Amazon RDS Now Supports SQL Server 2012

From the post:

The Amazon Relational Database Service (RDS) now supports SQL Server 2012.You can now launch the Express, Web, and Standard Editions of this powerful database from the comfort of the AWS Management Console. SQL Server 2008 R2 is still available, as are multiple versions and editions of MySQL and Oracle Database.

If you are from the Microsoft world and haven't heard of RDS, here's the executive summary: You can run the latest and greatest offering from Microsoft in a fully managed environment. RDS will install and patch the database, make backups, and detect and recover from failures. It will also provide you with a point-and-click environment to make it easy for you to scale your compute resources up and down as needed.

What's New?
SQL Server 2012 supports a number of new features including contained databases, columnstore indexes, sequences, and user-defined roles:

  • A contained database is isolated from other SQL Server databases including system databases such as "master." This isolation removes dependencies and simplifies the task of moving databases from one instance of SQL Server to another.
  • Columnstore indexes are used for data warehouse style queries. Used properly, they can greatly reduce memory consumption and I/O requests for large queries.
  • Sequences are counters that can be used in more than one table.
  • The new user-defined role management system allows users to create custom server roles.

Read the SQL Server What's New documentation to learn more about these and other features.

I almost missed this!

It is about the only way I am going to get to play with SQL Server. I don’t have a local Windows sysadmin to maintain the server, etc.

September 23, 2012

The Cost of Strict Global Consistency [Or Rules for Eventual Consistency]

Filed under: Consistency,Database,Finance Services,Law,Law - Sources — Patrick Durusau @ 10:15 am

What if all transactions required strict global consistency? by Matthew Aslett.

Matthew quotes Basho CTO Justin Sheehy on eventual consistency and traditional accounting:

“Traditional accounting is done in an eventually-consistent way and if you send me a payment from your bank to mine then that transaction will be resolved in an eventually consistent way. That is, your bank account and mine will not have a jointly-atomic change in value, but instead yours will have a debit and mine will have a credit, each of which will be applied to our respective accounts.”

And Matthew comments:

The suggestion that bank transactions are not immediately consistent appears counter-intuitive. Comparing what happens in a transaction with a jointly atomic change in value, like buying a house, with what happens in normal transactions, like buying your groceries, we can see that for normal transactions this statement is true.

We don’t need to wait for the funds to be transferred from our accounts to a retailer before we can walk out the store. If we did we’d all waste a lot of time waiting around.

This highlights a couple of things that are true for both database transactions and financial transactions:

  • that eventual consistency doesn’t mean a lack of consistency
  • that different transactions have different consistency requirements
  • that if all transactions required strict global consistency we’d spend a lot of time waiting for those transactions to complete.

All of which is very true but misses an important point about financial transctions.

Financial transactions (involving banks, etc.) are eventually consistent according to the same rules.

That’s no accident. It didn’t just happen that banks adopted ad hoc rules that resulted in a uniform eventual consistency.

It didn’t happen over night but the current set of rules for “uniform eventual consistency” of banking transactions are spelled out by the Uniform Commercial Code. (And other laws, regulations but that is a major part of it.)

Dare we say a uniform semantic for financial transactions was hammered out without the use of formal ontologies or web addresses? And that it supports billions of transactions on a daily basis? To become eventually consistent?

Think about the transparency (to you) of your next credit card transaction. Standards and eventual consistency make that possible.

September 22, 2012

The Stages of Database Development (video)

Filed under: Database,Design — Patrick Durusau @ 1:32 pm

The Stages of Database Development (video) by Jeremiah Peschka.

The description:

Strong development practices don’t spring up overnight; they take time, effort, and teamwork. Database development practices are doubly hard because they involve many moving pieces – unit testing, integration testing, and deploying changes that could have potential side effects beyond changing logic. In this session, Microsoft SQL Server MVP Jeremiah Peschka will discuss ways users can move toward a healthy cycle of database development using version control, automated testing, and rapid deployment.

Nothing you haven’t heard before in one form or another.

Question: How does your database environment compare to the one Jeremiah describes?

(Never mind that you have “reasons” (read excuses) for the current state of your database environment.)

Doesn’t just happen with databases or even servers.

What about your topic map development environment?

Or other development environment.

Looking forward to a sequel (sorry) to this video.

September 16, 2012

Spanner : Google’s globally distributed database

Filed under: Database,Distributed Systems — Patrick Durusau @ 5:39 am

Spanner : Google’s globally distributed database

From the post:

This paper, whose co-authors include Jeff Dean and Sanjay Ghemawat of MapReduce fame, describes Spanner. Spanner is Google’s scalable, multi-version, globally distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. Finally the paper comes out! Really exciting stuff!

Abstract from the paper:

Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

Spanner: Google’s Globally Distributed Database (PDF File)

Facing user requirements, Google did not say: Suck it up and use tools already provided.

Google engineered new tools to meet their requirements.

Is there a lesson there for other software projects?

September 12, 2012

PostgreSQL 9.2 released

Filed under: Database,PostgreSQL — Patrick Durusau @ 7:12 pm

PostgreSQL 9.2 released

From the announcement:

The PostgreSQL Global Development Group announces PostgreSQL 9.2, the latest release of the leader in open source databases. Since the beta release was announced in May, developers and vendors have praised it as a leap forward in performance, scalability and flexibility. Users are expected to switch to this version in record numbers.

“PostgreSQL 9.2 will ship with native JSON support, covering indexes, replication and performance improvements, and many more features. We are eagerly awaiting this release and will make it available in Early Access as soon as it’s released by the PostgreSQL community,” said Ines Sombra, Lead Data Engineer, Engine Yard.

Links

Downloads, including packages and installers
Release Notes
Documentation
What’s New in 9.2
Press Kit

New features like range types:

Range types are used to store a range of data of a given type. There are a few pre-defined types. They are integer (int4range), bigint (int8range), numeric (numrange), timestamp without timezone (tsrange), timestamp with timezone (tstzrange), and date (daterange).

Ranges can be made of continuous (numeric, timestamp…) or discrete (integer, date…) data types. They can be open (the bound isn’t part of the range) or closed (the bound is part of the range). A bound can also be infinite.

Without these datatypes, most people solve the range problems by using two columns in a table. These range types are much more powerful, as you can use many operators on them.

have captured my attention.

Now to look at other new features: Index-only scans, Replication improvements and JSON datatype.

September 4, 2012

The Spirit of XLDB (Extremely Large Databases) Past and Present

Filed under: Database,XLDB — Patrick Durusau @ 2:10 pm

The events page for XLDB has:

XLDB 2011 (Slides/Videos), as well as reports back to the 1st XLDB workshop.

Check back to find later proceedings.

August 22, 2012

VLDB 2012 Advance Program

Filed under: CS Lectures,Database — Patrick Durusau @ 6:42 pm

VLDB 2012 Advance Program

I took this text from the conference homepage:

VLDB is a premier annual international forum for data management and database researchers, vendors, practitioners, application developers, and users. The conference will feature research talks, tutorials, demonstrations, and workshops. It will cover current issues in data management, database and information systems research. Data management and databases remain among the main technological cornerstones of emerging applications of the twenty-first century.

I can’t think of a better summary of the papers, tutorials, etc., that you will find here.

I could easily lose the better part of a week just skimming abstracts.

Suggestion/comments?

August 11, 2012

July 27, 2012

PostgreSQL’s place in the New World Order

Filed under: Cloud Computing,Database,Heroku,PostgreSQL — Patrick Durusau @ 4:22 am

PostgreSQL’s place in the New World Order by Matthew Soldo.

Description:

Mainstream software development is undergoing a radical shift. Driven by the agile development needs of web, social, and mobile apps, developers are increasingly deploying to platforms-as-a-service (PaaS). A key enabling technology of PaaS is cloud-services: software, often open-source, that is consumed as a service and operated by a third-party vendor. This shift has profound implications for the open-source world. It enables new business models, increases emphasis on user-experience, and creates new opportunities.

PostgreSQL is an excellent case study in this shift. The PostgreSQL project has long offered one of the most reliable open source databases, but has received less attention than competing technologies. But in the PaaS and cloud-services world, reliability and open-ness become increasingly important. As such, we are seeing the beginning of a shift in adoption towards PostgreSQL.

The datastore landscape is particularly interesting because of the recent attention given to the so-called NoSQL technologies. Data is suddenly sexy again. This attention is largely governed by the same forces driving developers to PaaS, namely the need for agility and scalability in building modern apps. Far from being a threat to PostgreSQL, these technologies present an amazing opportunity for showing the way towards making PostgreSQL more powerful and more widely adopted.

The presentation sounds great, but alas, the slidedeck is just a slidedeck. 🙁

I do recommend it for the next to last slide graphic. Very cool!

(And it may be time to take a another look at PostgreSQL as well.)

July 26, 2012

Understanding Indexing [Webinar]

Filed under: Database,Indexing — Patrick Durusau @ 8:12 am

Understanding Indexing [Webinar]

July 31st 2012 Time: 2PM EDT / 11AM PDT

From the post:

Three rules on making indexes around queries to provide good performance

Application performance often depends on how fast a query can respond and query performance almost always depends on good indexing. So one of the quickest and least expensive ways to increase application performance is to optimize the indexes. This talk presents three simple and effective rules on how to construct indexes around queries that result in good performance.

[graphic button omitted]

This webinar is a general discussion applicable to all databases using indexes and is not specific to any particular MySQL® storage engine (e.g., InnoDB, TokuDB®, etc.). The rules are explained using a simple model that does NOT rely on understanding B-trees, Fractal Tree® indexing, or any other data structure used to store the data on disk.

Indexing is one of those “overloaded” terms in information technologies.

Indexing can refer to:

  1. Database indexing
  2. Search engine indexing
  3. Human indexing

just to name a few of the more obvious uses.

To be sure, you need to be aware of, if not proficient at, all three and this webinar should be a start on #1.

PS: If you know of a more complete typology of indexing, perhaps with pointers into the literature, please give a shout!

July 21, 2012

Announcing TokuDB v6.1

Filed under: Database,TokuDB — Patrick Durusau @ 4:56 pm

Announcing TokuDB v6.1

From the post:

TokuDB v6.1 is now generally available and can be downloaded here.

New features include:

  • Added support for MariaDB 5.5 (5.5.25)
    • The TokuDB storage engine is now available with all the additional functionality of MariaDB 5.5.
  • Added HCAD support to our MySQL 5.5 version (5.5.24)
    • Hot column addition/deletion was present in TokuDB v6.0 for MySQL 5.1 and MariaDB 5.2, but not in MySQL 5.5. This feature is now present in all MySQL and MariaDB versions of TokuDB.
  • Improved in-memory point query performance via lock/latch refinement
    • TokuDB has always been a great performer on range scans and workloads where the size of the working data set is significantly larger than RAM. TokuDB v6.0 improved the performance of in-memory point queries at low levels of concurrency. TokuDB v6.1 further increased the performance at all concurrency levels.
    • The following graph shows our sysbench.oltp.uniform performance on an in-memory data set (16 x 5 million row tables, server is 2 x Xeon 5520, 72GB RAM, Centos 5.8)

Go to the post to see impressive performance numbers.

I do wonder, when do performance numbers cease to be meaningful for the average business application?

Like a car that can go from 0 to 60 in under 3 seconds. (Yes, there is such a car, 2011 Bugatti.)

Nice to have, but where are you going to drive it?

As you can tell from this blog, I am all for the latest algorithms, software, hardware, but at the same time, the latest may not be the best for your application.

It maybe that simpler, less high performance solutions will not only be more appropriate but also more robust.

July 8, 2012

MicrobeDB: a locally maintainable database of microbial genomic sequences

Filed under: Bioinformatics,Biomedical,Database,Genome,MySQL — Patrick Durusau @ 3:54 pm

MicrobeDB: a locally maintainable database of microbial genomic sequences by Morgan G. I. Langille, Matthew R. Laird, William W. L. Hsiao, Terry A. Chiu, Jonathan A. Eisen, and Fiona S. L. Brinkman. (Bioinformatics (2012) 28 (14): 1947-1948. doi: 10.1093/bioinformatics/bts273)

Abstract

Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.

Availability: MicrobeDB is freely available under the GNU-GPL at: http://github.com/mlangill/microbedb/

No doubt a useful project but the article seems to be at war with itself:

Although many of these centers provide genomic data in a variety of static formats such as Genbank and Fasta, these are often inadequate for complex queries. To carry out these analyses efficiently, a relational database such as MySQL (http://mysql.com) can be used to allow rapid querying across many genomes at once. Some existing data providers such as CMR allow downloading of their database files directly, but these databases are designed for large web-based infrastructures and contain numerous tables that demand a steep learning curve. Also, addition of unpublished genomes to these databases is often not supported. A well known and widely used system is the Generic Model Organism Database (GMOD) project (http://gmod.org). GMOD is an open-source project that provides a common platform for building model organism databases such as FlyBase (McQuilton et al., 2011) and WormBase (Yook et al., 2011). GMOD supports a variety of options such as GBrowse (Stein et al., 2002) and a variety of database choices including Chado (Mungall and Emmert, 2007) and BioSQL (http://biosql.org). GMOD provides a comprehensive system, but for many researchers such a complex system is not needed.

On one hand, current solutions are “…often inadequate for complex queries” and just a few lines later, “…such a complex system is not needed.”

I have no doubt that using unfamiliar and complex table structures is a burden on any user. Not to mention lacking the ability to add “unpublished genomes” or fixing versions of data for analysis.

What concerns me is the “solution” being seen as yet another set of “local” options. Which impedes the future use of the now “localized” data.

The issue raised here need to be addressed but one-off solutions seem like a particularly poor choice.

June 27, 2012

Introducing new Fusion Tables API [Deprecation – SQL API]

Filed under: Database,Fusion Tables,SQL — Patrick Durusau @ 10:03 am

Introducing new Fusion Tables API by Warren Shen.

The post in its entirety:

We are very pleased to announce the public availability of the new Fusion Tables API. The new API includes all of the functionality of the existing SQL API, plus the ability to read and modify table and column metadata as well as the definitions of styles and templates for data visualization. This API is also integrated with the Google APIs console which lets developers manage all their Google APIs in one place and take advantage of built-in reporting and authentication features.

With this launch, we are also announcing a six month deprecation period for the existing SQL API. Since the new API includes all of the functionality of the existing SQL API, developers can easily migrate their applications using our migration guide.

For a detailed description of the features in the new API, please refer to the API documentation.

BTW, if you go to the Migration Guide, be aware that as of 27 June 2012, the following links aren’t working (404):

This Migration Guide documents how to convert existing code using the SQL API to code using the Fusion Tables API 1.0. This information is discussed in more detail in the Getting Started and Using the API developer guides.

I have discovered the error:

https://developers.google.com/fusiontables/docs/v1/v1/getting_started.html – Wrong – note the successive “/v1.”

https://developers.google.com/fusiontables/docs/v1/getting_started – Correct – From the left side nav. bar.

https://developers.google.com/fusiontables/docs/v1/v1/using.html – Wrong – note the successive “/v1.”

https://developers.google.com/fusiontables/docs/v1/using – Correct – From the left side nav. bar.

The summary material appears to be useful but you will need the more detailed information as well.

For example, under HTTP Methods (in the Migration Guide), the SQL API is listed as having:

GET for SHOW TABLES, DESCRIBE TABLE, SELECT

And the equivalent in the Fusion API:

GET for SELECT

No equivalent of SHOW TABLES, DESCRIBE TABLE using GET.

If you find and read Using the API you will find:

Retrieving a list of tables

Listing tables is useful because it provides the table ID and column names of tables that are necessary for other calls. You can retrieve the list of tables a user owns by sending an HTTP GET request to the URI with the following format:

https://www.googleapis.com/fusiontables/v1/tables

Tables are listed along with column ids, names and datatypes.

That may be too much for the migration document but implying that all you have with GET is SELECT is misleading.

Rather: GET for TABLES (SHOW + DESCRIBE), SELECT

Yes?

June 21, 2012

SchemafreeDB

Filed under: Database,Web Applications — Patrick Durusau @ 3:24 pm

SchemafreeDB (the FAQ)

From the blog page:

A Database for Web Applications

It’s been about 2 weeks since we announced the preview of SchemafreeDB. The feedback was loud and clear, we need to work on the site design. We listened and now have what we think is a big improvement in design and message.

What’s The Message

In redesigning the site we thought more about what we wanted to say and how we should better convey that message. We realized we were focusing primarily on features. Features make a product but they do not tell the product’s story. The main character in the SchemafreeDB story is web development. We are web developers and we created SchemafreeDB out of necessity and desire. With that in mind we have created a more “web application” centric message. Below is our new messaging in various forms.

The FAQ says that SchemafreeDB is different from every other type of DB and better/faster, etc.

I would appreciate insight you may have into statements like:

What is “join-free SQL” and why is this a good thing?

With SchemafreeDB, you can query deeply across complex structures via a simple join-free SQL syntax. e.g: WHERE $s:person.address.city=’Rochester’ AND $n:person.income>50000

This simplicity gives you new efficiencies when working with complex queries, thus increasing your overall productivity as a developer.

The example isn’t a complex query nor do I know anyone who would think so.

Are any of you using this service?

June 11, 2012

Monday Fun: Seven Databases in Song

Filed under: Database,Humor — Patrick Durusau @ 4:27 pm

Monday Fun: Seven Databases in Song

From the post:

If you understand things best when they’re formatted as a musical, this video is for you. It teaches the essentials of PostgreSQL, Riak, HBase, MongoDB, CouchDB, Neo4J and Redis in the style of My Fair Lady. And for a change, it’s very SFW.

This is a real hoot!

It went by a little too quickly to make sure it covered everything but it covered a lot. 😉

All kidding aside, there have been memorization techniques that relied upon rhyme and song.

Not saying you will have a gold record with an album of Hadoop commands with options but you might gain some noteriety.

If you start setting *nix commands to song, I don’t think Stairway to Heaven is long enough for sed and all its options.

June 4, 2012

Where’s your database’s ER Diagram?

Filed under: Database,Documentation — Patrick Durusau @ 4:32 pm

Where’s your database’s ER Diagram? by Scott Selikoff.

From the post:

I was recently training a new software developer, explaining the joys of three-tier architecture and the importance of the proper black-box encapsulation, when the subject switched to database design and ER diagrams. For those unfamiliar with the subject, entity-relationship diagrams, or ER diagrams for short, are a visual technique for modelling entities, aka tables in relational databases, and the relationships between the entities, such as foreign key constraints, 1-to-many relationships, etc. Below is a sample of such a diagram.

Scott’s post is particularly appropriate since we were talking about documentation of your aggregation strategy in MongoDB.

My experience is that maintenance of documentation in general, not just E-R diagrams, is a very low priority.

Which means that migration of databases and other information resources is far more expensive and problematic than necessary.

There is a solution to the absence of current documentation.

No, it isn’t topic maps, at least not necessarily, although topic map could be part of a solution to the documentation problem.

What could make a difference would be the tracking of changes to the system/schema/database/etc. with relationships to the people who made them.

So that at the end of each week, for example, it would be easy to tell who had or had not created the necessary documentation for the changes they had made.

Think of it as bringing accountability to change tracking. It isn’t enough to track a change or to know who made it, if we lack the documentation necessary to understand the change.

When I said you would not necessarily have to use a topic map, I was thinking of JIRA, which has ample opportunities for documentation of changes. (Insert your favorite solution, JIRA happens to be one that is familiar.) It does require the discipline to enter the documentation.

April 23, 2012

ICDM 2012

ICDM 2012 Brussels, Belgium | December 10 – 13, 2012

From the webpage:

The IEEE International Conference on Data Mining series (ICDM) has established itself as the world’s premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications.

ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing. By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining. Besides the technical program, the conference features workshops, tutorials, panels and, since 2007, the ICDM data mining contest.

Important Dates:

ICDM contest proposals: April 30
Conference full paper submissions: June 18
Demo and tutorial proposals: August 10
Workshop paper submissions: August 10
PhD Forum paper submissions: August 10
Conference paper, tutorial, demo notifications: September 18
Workshop paper notifications: October 1
PhD Forum paper notifications: October 1
Camera-ready copies and copyright forms: October 15

April 12, 2012

Drizzle: An Open Source Microkernel DBMS for High Performance Scale-Out Applications

Filed under: Database,Drizzle,MySQL — Patrick Durusau @ 7:07 pm

Drizzle: An Open Source Microkernel DBMS for High Performance Scale-Out Applications

From the webpage:

The Global Drizzle Development Team is pleased to announce the immediate availability of Drizzle 7.1.33-stable. The first stable release of Drizzle 7.1 and the result of 12 months of hard work from contributors around the world.

Improvements in Drizzle 7.1 compared to 7.0

  • Xtrabackup is included (in-tree) by Stewart Smith
  • Multi-source replication by David Shrewsbury
  • Improved execute parser by Brian Aker and Vijay Samuel
  • Servers are identified with UUID in replication by Joe Daly
  • HTTP JSON API (experimental) by Stewart Smith
  • Percona Innodb patches merged by Laurynas Biveinis
  • JS plugin: execute JavaScript code as a Drizzle function by Henrik Ingo
  • IPV6 data type by Muhammad Umair
  • Improvements to libdrizzle client library by Andrew Hutchings and Brian Aker
  • Query log plugin and auth_schema by Daniel Nichter
  • ZeroMQ plugin by Markus Eriksson
  • Ability to publish transactions to zeromq and rabbitmq by Marcus Eriksson
  • Replication Dictionary by Brian Aker
  • Log output to syslog is enabled by default by Brian Aker
  • Improvements to logging stats plugin
  • Removal of drizzleadmin utility (you can now do all administration from drizzle client itself) by Andrew Hutchings
  • Improved Regex Plugin by Clint Byrum
  • Improvements to pandora build by Monty Taylor
  • New version numbering system and support for it in pandora-build by Henrik Ingo
  • Updated DEB and RPM packages, by Henrik Ingo
  • Revamped testing system Kewpie all-inclusive with suites of randgen, sysbench, sql-bench, and crashme tests by Patrick Crews
  • Removal of HailDB engine by Stewart Smith
  • Removal of PBMS engine
  • Continued code refactoring by Olaf van der Spek, Brian Aker and others
  • many bug fixes
  • Brian Aker ,Mark Atwood- Continuous Integration
  • Vijay Samuel – Release Manager

From the documentation page:

Drizzle is a transactional, relational, community-driven open-source database that is forked from the popular MySQL database.

The Drizzle team has removed non-essential code, has re-factored the remaining code, and has converted the code to modern C++ and modern libraries.

Charter

  • A database optimized for Cloud infrastructure and Web applications
  • Design for massive concurrency on modern multi-CPU architectures
  • Optimize memory use for increased performance and parallelism
  • Open source, open community, open design

Scope

  • Re-designed modular architecture providing plugins with defined APIs
  • Simple design for ease of use and administration
  • Reliable, ACID transactional

If you like databases and data structure research, now is a wonderful time to be active.

April 9, 2012

The Database Nirvana (And an Alternative)

Filed under: Database,Open Source — Patrick Durusau @ 4:31 pm

The Database Nirvana

Alex Popescu of myNoSQL sides with Jim Webber in thinking we need to avoid a “winner-takes-it-all-war” among database advocates.

Saying that people should pick the best store for their data model is a nice sentiment but I rather doubt it will change long or short term outcomes between competing data stores.

I don’t know that anything will but I do have a concrete suggestion that might stand a chance in the short run at any rate.

We are all familiar with the “to many eyes all bugs are shallow” and other Ben Franklin like sayings.

OK, so rather than seeing another dozen or two dozen or more, data stores this year, that is 2012, why not pick an existing store, learn the community and offer your talents, writing code, tests, debugging, creating useful documentation, creating tutorials, etc.

The data store community, if you look for database projects at Sourceforge for example, is like a professional sports league with too many teams. The talent is so spread out that there are only one or two very successful teams and the others, well, are not so great.

If all of the existing data store projects picked up another 100 volunteers each, there would be enough good code, documentation and other resources to hold off both major/minor vendors and other store projects.

The various store projects would have to welcome volunteers. That means doing more than protesting the way it is done is the best possible way for whatever to be done.

If we don’t continue to have a rich ecosystem of store projects, it won’t be entirely the fault of vendors nor winner-take-it-all-wars. A lack of volunteers and acceptance of volunteers will share part of the blame.

April 3, 2012

Apache Sqoop Graduates from Incubator

Filed under: Database,Hadoop,Sqoop — Patrick Durusau @ 4:18 pm

Apache Sqoop Graduates from Incubator by Arvind Prabhakar.

From the post:

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

In its monthly meeting in March of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache Sqoop, thus graduating it from the Incubator. This is a significant milestone in the life of Sqoop, which has come a long way since its inception almost three years ago.

For moving data in and out of Hadoop, Sqoop is your friend. Drop by and say hello.

March 21, 2012

A graphical overview of your MySQL database

Filed under: Data,Database,MySQL — Patrick Durusau @ 3:30 pm

A graphical overview of your MySQL database by Christophe Ladroue.

From the post:

If you use MySQL, there’s a default schema called ‘information_schema‘ which contains lots of information about your schemas and tables among other things. Recently I wanted to know whether a table I use for storing the results of a large number experiments was any way near maxing out. To cut a brief story even shorter, the answer was “not even close” and could be found in ‘information_schema.TABLES‘. Not being one to avoid any opportunity to procrastinate, I went on to write a short script to produce a global overview of the entire database.

infomation_schema.TABLES contains the following fields: TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, AVG_ROW_LENGTH and MAX_DATA_LENGTH (and a few others). We can first have a look at the relative sizes of the schemas with the MySQL query “SELECT TABLE_SCHEMA,SUM(DATA_LENGTH) SCHEMA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA“.

Christophe includes R code to generate graphics that you will find useful in managing (or just learning about) MySQL databases.

While the parts of the schema Christophe is displaying graphically are obviously subjects, the graphical display pushed me in another direction.

If we can visualize the schema of a MySQL database, then shouldn’t we be able to visualize the database structures a bit closer to the metal?

And if we can visualize those database structures, shouldn’t we be able to represent them and the relationships between them as a graph?

Or perhaps better, can we “view” those structures and relationships “on demand” as a graph?

That is in fact what is happening when we display a table at the command prompt for MySQL. It is a “display” of information, it is not a report of information.

I don’t know enough about the internal structures of MySQL or PostgreSQL to start such a mapping. But ignorance is curable, at least that is what they say. 😉

I have another post today that suggests a different take on conversion methodology.

March 20, 2012

Worst-case Optimal Join Algorithms

Filed under: Algorithms,Database,Joins — Patrick Durusau @ 3:52 pm

Worst-case Optimal Join Algorithms by Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra.

Abstract:

Efficient join processing is one of the most fundamental and well-studied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a novel algorithm to process these queries optimally in terms of worst-case data complexity. Our result builds on recent work by Atserias, Grohe, and Marx, who gave bounds on the size of a full conjunctive query in terms of the sizes of the individual relations in the body of the query. These bounds, however, are not constructive: they rely on Shearer’s entropy inequality which is information-theoretic. Thus, the previous results leave open the question of whether there exist algorithms whose running time achieve these optimal bounds. An answer to this question may be interesting to database practice, as it is known that any algorithm based on the traditional select-project-join style plans typically employed in an RDBMS are asymptotically slower than the optimal for some queries. We construct an algorithm whose running time is worst-case optimal for all natural join queries. Our result may be of independent interest, as our algorithm also yields a constructive proof of the general fractional cover bound by Atserias, Grohe, and Marx without using Shearer’s inequality. This bound implies two famous inequalities in geometry: the Loomis-Whitney inequality and the Bollob\’as-Thomason inequality. Hence, our results algorithmically prove these inequalities as well. Finally, we discuss how our algorithm can be used to compute a relaxed notion of joins.

With reference to the optimal join problem the authors say:

Implicitly, this problem has been studied for over three decades: a modern RDBMS use decades of highly tuned algorithms to efficiently produce query results. Nevertheless, as we described above, such systems are asymptotically suboptimal – even in the above simple example of (1). Our main result is an algorithm that achieves asymptotically optimal worst-case running times for all conjunctive join queries.

The author’s strategy involves evaluation of the keys in a join and the dividing of those keys into separate sets. The information used by the authors has always been present, just not used in join processing. (pp. 2-3 of the article)

There are a myriad of details to be mastered in the article but I suspect this line of thinking may be profitable in many situations where “join” operations are relevant.

March 11, 2012

Keyword Searching and Browsing in Databases using BANKS

Filed under: Database,Keywords,Searching — Patrick Durusau @ 8:09 pm

Keyword Searching and Browsing in Databases using BANKS

From the post:

BANKS is a system that enables keyword based searches on a relational database. As a paper that was published 10 years ago in ICDE 2002, it has won the most influential paper award for past decade this year at ICDE. Hearty congrats to the team from IIT Bombay’s CSE department.

Abstract:

With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results.

BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

The paper: http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/BanksICDE2002.pdf.

It is a very interesting paper.

BTW, can someone point me to the ICDE proceedings where it was voted best paper of the decade? I am assuming that ICDE = International Conference on Data Engineering. I am sure I am just overlooking the award and would like to include a pointer to it in this post. Thanks!

« Newer PostsOlder Posts »

Powered by WordPress