Archive for the ‘Schema’ Category

Schema on Read? [The virtues of schema on write]

Friday, April 19th, 2013

Apache Hadoop and Data Agility by Ofer Mendelevitch.

From the post:

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.

In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

If a schema is supplied “on read,” how is data validation accomplished?

I don’t mean in terms of datatypes such as string, integer, double, etc. That are trivial forms of data validation.

How do we validate the semantics of data when a schema is supplied on read?”

Mistakes do happen in RDBMS systems but with a schema, which defines data semantics, applications can attempt to police those semantics.

I don’t doubt that schema “on read” supplies a lot of useful flexibility, but how do we limit the damage that flexibility can cause?

For example, many years ago, area codes (for telephones) in the USA were tied to geographic exchanges. Data from the era still exists in the bowels of some data stores. That is no longer true in many cases.

Let’s assume I have older data that has area codes tied to geographic areas and newer data that has area codes that are not. Without a schema to define the area code data in both cases, how would I know to treat the area code data differently?

I concede that schema “on read” can be quite flexible.

On the other hand, let’s not discount the value of schema “on write” as well.

Schemaless Data Structures

Saturday, January 12th, 2013

Schemaless Data Structures by Martin Fowler.

From the first slide:

In recent years, there’s been an increasing amount of talk about the advantages of schemaless data. Being schemaless is one of the main reasons for interest in NoSQL databases. But there are many subtleties involved in schemalessness, both with respect to databases and in-memory data structures. These subtleties are present both in the meaning of schemaless and in the advantages and disadvantages of using a schemaless approach.

Martin points out that “schemaless” does not mean the lack of a schema but rather the lack of an explicit schema.

Sounds a great deal like the implicit subjects that topic maps have the ability to make explicit.

Is there a continuum of explicitness for any given subject/schema?

Starting from entirely implied, followed by an explicit representation, then further explication as in a data dictionary, and at some distance from the start, a subject defined as a set of properties, which are themselves defined as sets of properties, in relationships with other sets of properties.

How far you go down that road depends on your requirements.

Intro to HBase Internals and Schema Design

Tuesday, July 10th, 2012

Intro to HBase Internals and Schema Design by Alex Baranau.

You will be disappointed by the slide that reads:

HBase will not adjust cluster settings to optimal based on usage patterns automatically.

Sorry, but we just aren’t quite to drag-n-drop software that optimizes to arbitrary data without user intervention.

Not sure we could keep that secret from management very long in any case so perhaps all for the best.

Once you get over your chagrin at having to still work, a little anyone, you will find Alex’s presentation a high level peak at the internals of HBase. Should be enough to get you motivated to learn more on your own. Not guaranteeing that but that should be the average result.

Improving Schema Matching with Linked Data (Flushing the Knowledge Toilet)

Tuesday, May 15th, 2012

Improving Schema Matching with Linked Data by Ahmad Assaf, Eldad Louw, Aline Senart, Corentin Follenfant, Raphaël Troncy, and David Trastour.


With today’s public data sets containing billions of data items, more and more companies are looking to integrate external data with their traditional enterprise data to improve business intelligence analysis. These distributed data sources however exhibit heterogeneous data formats and terminologies and may contain noisy data. In this paper, we present a novel framework that enables business users to semi-automatically perform data integration on potentially noisy tabular data. This framework offers an extension to Google Refine with novel schema matching algorithms leveraging Freebase rich types. First experiments show that using Linked Data to map cell values with instances and column headers with types improves significantly the quality of the matching results and therefore should lead to more informed decisions.

Personally I don’t find mapping Airport -> Airport Code all that convincing a demonstration.

The other problem I have is what happens after a user “accepts” a mapping?

Now what?

I can contribute my expertise to mappings between diverse schemas all day, even public ones.

What happens to all that human effort?

It is what I call the “knowledge toilet” approach to information retrieval/integration.

Software runs (I can’t count the number of times integration software has been run on Citeseer. Can you?) and a user corrects the results as best they are able.

Now what?

Oh, yeah, the next user or group of users does it all over again.


Because the user before them flushed the knowledge toilet.

The information had been mapped. Possibly even hand corrected by one or more users. Then it is just tossed away.

That has to seem wrong at some very fundamental level. Whatever semantic technology you choose to use.

I’m open to suggestions.

How do we stop flushing the knowledge toilet?

On Schemas and Lucene

Friday, April 20th, 2012

On Schemas and Lucene

Chris Male writes:

One of the very first thing users encounter when using Apache Solr is its schema. Here they configure the fields that their Documents will contain and the field types which define amongst other things, how field data will be analyzed. Solr’s schema is often touted as one of its major features and you will find it used in almost every Solr component. Yet at the same time, users of Apache Lucene won’t encounter a schema. Lucene is schemaless, letting users index Documents with any fields they like.

To me this schemaless flexibility comes at a cost. For example, Lucene’s QueryParsers cannot validate that a field being queried even exists or use NumericRangeQuerys when a field is numeric. When indexing, there is no way to automate creating Documents with their appropriate fields and types from a series of values. In Solr, the most optimal strategies for faceting and grouping different fields can be chosen based on field metadate retrieved from its schema.

Consequently as part of the modularisation of Solr and Lucene, I’ve always wondered whether it would be worth creating a schema module so that Lucene users can benefit from a schema, if they so choose. I’ve talked about this with many people over the last 12 months and have had a wide variety of reactions, but inevitably I’ve always come away more unsure. So in this blog I’m going ask you a lot of questions and I hope you can clarify this issue for me.

What follows is a deeply thoughtful examination of the pros and cons of schemas for Lucene and/or their role in Solr.

If you using Lucene, take the time to review Chris’s questions and voice your questions or concerns.

The Lucene you improve will be your own.

If you are interested in either Lucene or Solr, now would be a good time to speak up.

Percona Toolkit 2.1 with New Online Schema Change Tool

Friday, April 13th, 2012

Percona Toolkit 2.1 with New Online Schema Change Tool by Baron Schwartz.

From the post:

I’m proud to announce the GA release of version 2.1 of Percona Toolkit. Percona Toolkit is the essential suite of administrative tools for MySQL.

With this release we introduce a new version of pt-online-schema-change, a tool that enables you to ALTER large tables with no blocking or downtime. As you know, MySQL locks tables for most ALTER operations, but pt-online-schema-change performs the ALTER without any locking. Client applications can continue reading and writing the table with no interruption.

With this new version of the tool, one of the most painful things anyone experiences with MySQL is significantly alleviated. If you’ve ever delayed a project’s schedule because the release involved an ALTER, which had to be scheduled in the dead of the night on Sunday, and required overtime and time off, you know what I mean. A schema migration is an instant blocker in the critical path of your project plan. No more!

Certainly a useful feature for MySQL users.

Not to mention being another step towards data models being a matter of how you choose to view the data for some particular purpose. Not quite there, yet, but that day is coming.

In a very real sense, the “normalization” of data and the data models we have built into SQL systems were compensation for the short-comings of our computing platforms. That we have continued to do so in the face of increases in computing resources that make it unnecessary, is evidence of short-comings on our part.

Combining Heterogeneous Classifiers for Relational Databases (Of Relational Prisons and such)

Sunday, January 22nd, 2012

Combining Heterogeneous Classifiers for Relational Databases by Geetha Manjunatha, M Narasimha Murty and Dinkar Sitaram.


Most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a ‘flat’ form (mega-join), even the human-specified semantic information present in the relations is lost. In this paper, we present a practical, two-phase hierarchical meta-classification algorithm for relational databases with a semantic divide and conquer approach. We propose a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual database tables. The proposed algorithm was evaluated on three diverse datasets, namely TPCH, PKDD and UCI benchmarks and showed considerable reduction in classification time without any loss of prediction accuracy.

When I read:

So, a typical enterprise dataset resides in such expert-designed multiple relational database tables. On the other hand, as known, most traditional classi cation algorithms still assume that the input dataset is available in a single table – a flat representation of data attributes. So, for applying these state-of-art single-table data mining techniques to enterprise data, one needs to convert the distributed relational data into a flat form.

a couple of things dropped into place.

First, the problem being described, the production of a flat form for analysis reminds me of the problem of record linkage in the late 1950’s (predating relational databases). There records were regularized to enable very similar analysis.

Second, as the authors state in a paragraph or so, conversion to such a format is not possible in most cases. Interesting that the choice of relational database table design has the impact of limiting the type of analysis that can be performed on the data.

Therefore, knowledge mining over real enterprise data using machine learning techniques is very valuable for what is called an intelligent enterprise. However, application of state-of-art pattern recognition techniques in the mainstream BI has not yet taken o [Gartner report] due to lack of in-memory analytics among others. The key hurdle to make this possible is the incompatibility between the input data formats used by most machine learning techniques and the formats used by real enterprises.

If freeing data from its relational prison is a key aspect to empowering business intelligence (BI), what would you suggest as a solution?

Web Schemas Task Force

Saturday, October 1st, 2011

Web Schemas Task Force, chaired by R.V. Guha (Google).

Here is your opportunity to participate in some very important work at the W3C without a W3C membership.

From the wiki page:

This is the main Wiki page for W3C’s Semantic Web Interest Group Web Schemas task force.

The taskforce chair is R.V.Guha (Google).

In scope include collaborations on mappings, tools, extensibility and cross-syntax interoperability. An HTML Data group is nearby; detailed discussion about Web data syntax belongs there.

See the charter for more details.

The group uses the mailing list

  • See archives
  • To subscribe, send a message to with Subject: subscribe (see for more details).
  • If you are new to the W3C community, you will need to go through the archive approval process before your posts show up in the archives.
  • To edit this wiki, you’ll need a W3C account; these are available to all

Groups who maintain Web Schemas are welcome to use this forum as a feedback channel, in additional to whatever independent mechanisms they also offer.

The following from the charter makes me think that topic maps may be relevant to the task at hand:

Participants are encouraged to use the group to take practical steps towards interoperability amongst diverse schemas, e.g. through development of mappings, extensions and supporting tools. Those participants who maintain vocabularies in any format designed for wide-scale public Web use are welcome to also to participate in the group as a ‘feedback channel’, including practicalities around syntax, encoding and extensibility (which will be relayed to other W3C groups as appropriate).

MongoDB Schema Design Basics

Friday, July 29th, 2011

MongoDB Schema Design Basics

From Alex Popescu’s myNoSQL:

For NoSQL databases there are no clear rules like the Boyce-Codd Normal Form database normalization. Data modeling and analysis of data access patterns are two fundamental activities. While over the last 2 years we’ve gather some recipes, it’s always a good idea to check what are the recommended ways to model your data with your choice of NoSQL database.

After the break, watch 10gen’s Richard Kreuter’s presentation on MongoDB schema design.

A must see video!

Designing and Refining Schema Mappings via Data Examples

Monday, June 20th, 2011

Designing and Refining Schema Mappings via Data Examples by Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang-Chiew Tan, from SIGMOD ’11.


A schema mapping is a specification of the relationship between a source schema and a target schema. Schema mappings are fundamental building blocks in data integration and data exchange and, as such, obtaining the right schema mapping constitutes a major step towards the integration or exchange of data. Up to now, schema mappings have typically been specified manually or have been derived using mapping-design systems that automatically generate a schema mapping from a visual specification of the relationship between two schemas. We present a novel paradigm and develop a system for the interactive design of schema mappings via data examples. Each data example represents a partial specification of the semantics of the desired schema mapping. At the core of our system lies a sound and complete algorithm that, given a finite set of data examples, decides whether or not there exists a GLAV schema mapping (i.e., a schema mapping specified by Global-and-Local-As-View constraints) that “fits” these data examples. If such a fitting GLAV schema mapping exists, then our system constructs the “most general” one. We give a rigorous computational complexity analysis of the underlying decision problem concerning the existence of a fitting GLAV schema mapping, given a set of data examples. Specifically, we prove that this problem is complete for the second level of the polynomial hierarchy, hence, in a precise sense, harder than NP-complete. This worst-case complexity analysis notwithstanding, we conduct an experimental evaluation of our prototype implementation that demonstrates the feasibility of interactively designing schema mappings using data examples. In particular, our experiments show that our system achieves very good performance in real-life scenarios.

Two observations:

1) The use of data examples may help overcome the difficulty of getting users to articulate “why” a particular mapping should occur.

2) Data examples that support mappings, if preserved, could be used to illustrate for subsequent users “why” particular mappings were made or even should be followed in mappings to additional schemas.

Mapping across revisions of a particular schema or across multiple schemas at a particular time is likely to benefit from this technique.

Why Will Win

Monday, June 13th, 2011

It isn’t hard to see why is going to win out over “other” semantic web efforts.

The first paragraph at the website says why:

This site provides a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers. Search engines including Bing, Google and Yahoo! rely on this markup to improve the display of search results, making it easier for people to find the right web pages.

  • Easy: Uses HTML tags
  • Immediate Utility: Recognized by Bing, Google and Yahoo!
  • Immediate Payoff: People can find the right web pages (your web pages)

Ironic that when HTML came up the scene, any number of hypertext engines offered more complex and useful approaches to hypertext.

But the advantages of HTML were:

  • Easy: Used simple tags
  • Immediate Utility: Useful to the author
  • Immediate Payoff: Joins hypertext network for others to find (your web pages)

I think the third advantage in each case is the crucial one. We are vain enough that making our information more findable is a real incentive, if there is a reasonable expectation of it being found. Today or tomorrow. Not ten years from now.

Schema Design for Raik (Take 2)

Thursday, December 9th, 2010

Schema Design for Riak (Take 2)

Useful exercise in schema design in a NoSQL context.

No great surprise that focus on data and application requirements are the keys (sorry) to a successful deployment.

Amazing how often that gets repeated, at least in presentations.

Equally amazing how often that gets ignored in implementations (at least to judge from how often it is repeated in presentations).

Still, we all need reminders so it is worth the time to review the slides.