Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 11, 2013

PostgreSQL 9.3 released!

Filed under: Database,PostgreSQL,SQL — Patrick Durusau @ 5:13 pm

PostgreSQL 9.3 released!

From the post:

The PostgreSQL Global Development Group announces the release of PostgreSQL 9.3, the latest version of the world’s leading open source relational database system. This release expands PostgreSQL’s reliability, availability, and ability to integrate with other databases. Users are already finding that they can build applications using version 9.3 which would not have been possible before.

“PostgreSQL 9.3 provides features that as an app developer I can use immediately: better JSON functionality, regular expression indexing, and easily federating databases with the Postgres foreign data wrapper. I have no idea how I completed projects without 9.3,” said Jonathan S. Katz, CTO of VenueBook.

From the what’s new page, an item of particular interest:

Writeable Foreign Tables:

“Foreign Data Wrappers” (FDW) were introduced in PostgreSQL 9.1, providing a way of accessing external data sources from within PostgreSQL using SQL. The original implementation was read-only, but 9.3 will enable write access as well, provided the individual FDW drivers have been updated to support this. At the time of writing, only the Redis and PostgreSQL drivers have write support (need to verify this).

I haven’t gotten through the documentation on FDW but for data integration it sounds quite helpful.

Assuming you document the semantics of the data you are writing back and forth. 😉

A use case for a topic map that spans both the local and “foreign” data source or separate topic maps for the local and “foreign” data source that can then be merged together.

September 5, 2013

Stinger Phase 2:…

Filed under: Hive,Hortonworks,SQL,STINGER — Patrick Durusau @ 6:28 pm

Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop by Carter Shanklin.

From the post:

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.

If you don’t think SQL is all that weird, ;-), this is a status update for you!

Serious progress is being made by a broad coalition of more than 60 developers.

Take the challenge and download HDP 2.0 Beta.

You can help build the future of SQL-IN-Hadoop.

But only if you participate.

August 31, 2013

Google goes back to the future…

Filed under: NoSQL,SQL — Patrick Durusau @ 3:31 pm

Google goes back to the future with SQL F1 database by Jack Clark.

From the post:

The tech world is turning back toward SQL, bringing to a close a possibly misspent half-decade in which startups courted developers with promises of infinite scalability and the finest imitation-Google tools available, and companies found themselves exposed to unstable data and poor guarantees.

The shift has been going on quietly for some time, and tech leader Google has been tussling with the drawbacks of non-relational and non ACID-compliant systems for years. That struggle has demanded the creation of a new system to handle data at scale, and on Tuesday at the Very Large Data Base (VLDB) conference, Google delivered a paper outlining its much-discussed “F1” system, which has replaced MySQL as the distributed heart of the company’s hugely lucrative AdWords platform.

The AdWords system includes “100s of applications and 1000s of users,” which all share a database over 100TB serving up “hundreds of thousands of requests per second, and runs SQL queries that scan tens of trillions of data rows per day,” Google said. And it’s got five nines of availability.

(…)

F1 uses some of Google’s most advanced technologies, such as BigTable and the planet-spanning “Spanner” database, which F1 servers are co-located with for optimum use. Google describes it as a “a hybrid, combining the best aspects of traditional relational databases and scalable NoSQL systems”.
(…)

I am wondering what the “…RDBMS doesn’t do X well parrots…” are going to say now?

The authors admit up front “trade-offs and sacrifices” were made. But when you meet your requirements while processing trillions of rows of data daily, you are entitled to “trade-offs and sacrifices.”

A very deep paper that will require background reading for most of us.

Looking forward to it.

August 21, 2013

Simple Hive ‘Cheat Sheet’ for SQL Users

Filed under: Hive,SQL — Patrick Durusau @ 4:51 pm

Simple Hive ‘Cheat Sheet’ for SQL Users by Marc Holmes.

From the post:

If you’re already familiar with SQL then you may well be thinking about how to add Hadoop skills to your toolbelt as an option for data processing.

From a querying perspective, using Apache Hive provides a familiar interface to data held in a Hadoop cluster and is a great way to get started. Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Naturally, there are a bunch of differences between SQL and HiveQL, but on the other hand there are a lot of similarities too, and recent releases of Hive bring that SQL-92 compatibility closer still.

To highlight that – and as a bit of fun to get started – below is a simple ‘cheat sheet’ (based on a simple MySQL reference such as this one) for getting started with basic querying for Hive. Here, we’ve done a direct comparison to MySQL, but given the simplicity of these particular functions, then it should be the same in essentially any SQL dialect.

Of course, if you really want to get to grips with Hive, then take a look at the full language manual.
(…)

Definitely going to print this cheat sheet out and put it in plastic.

A top of the desk sort of reference.

July 18, 2013

Mapping Wikipedia – Update

Filed under: MySQL,SQL,Topic Maps,Wikipedia — Patrick Durusau @ 8:45 pm

I have spent a good portion of today trying to create an image of the MediaWiki table structure.

While I think the SQLFairy (aka SQL Translator) is going to work quite well, it has rather cryptic error messages.

For instance, if the SQL syntax isn’t supported by its internal parser, the error message references the start of the table.

Which means, of course, that you have to compare statements in the table to the subset of SQL that is supported.

I am rapidly losing my SQL parsing skills as the night wears on so I am stopping with a little over 50% of the MediaWiki schema parsing.

Hopefully will finish correcting the SQL file tomorrow and will post the image of the MediaWiki schema.

Plus notes on what I found to not be recognized in SQLFairy to ease your use of it on other SQL schemas.

May 24, 2013

Postgres Demystified

Filed under: PostgreSQL,SQL — Patrick Durusau @ 2:12 pm

From the description:

Postgres has long been known as a stable database product that reliably stores your data. However, in recent years it has picked up many features, allowing it to become a much sexier database.

This video covers a whirlwind of Postgres features, which highlight why you should consider it for your next project. These include: Datatypes Using other languages within Postgres Extensions including NoSQL inside your SQL database Accessing your non-Postgres data (Redis, Oracle, MySQL) from within Postgres Window Functions.

Chris Kerstiens does a fast paced overview of Postgres.

May 18, 2013

Apache Hive 0.11: Stinger Phase 1 Delivered

Filed under: Hadoop,Hive,SQL,STINGER — Patrick Durusau @ 3:47 pm

Apache Hive 0.11: Stinger Phase 1 Delivered by Owen O’Malley.

From the post:

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop. Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them. See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Stinger

Welcome news for the Hive and SQL communities alike!

May 15, 2013

PostgreSQL 9.3 Beta 1 Released

Filed under: PostgreSQL,SQL — Patrick Durusau @ 3:22 pm

PostgreSQL 9.3 Beta 1 Released

From the post:

The first beta release of PostgreSQL 9.3, the latest version of the world’s best open source database, is now available. This beta contains previews of all of the features which will be available in version 9.3, and is ready for testing by the worldwide PostgreSQL community. Please download, test, and report what you find.

Major Features

The major features available for testing in this beta include:

  • Writeable Foreign Tables, enabling pushing data to other databases
  • pgsql_fdw driver for federation of PostgreSQL databases
  • Automatically updatable VIEWs
  • MATERIALIZED VIEW declaration
  • LATERAL JOINs
  • Additional JSON constructor and extractor functions
  • Indexed regular expression search
  • Disk page checksums to detect filesystem failures

In 9.3, PostgreSQL has greatly reduced its requirement for SysV shared memory, changing to mmap(). This allows easier installation and configuration of PostgreSQL, but means that we need our users to rigorously test and ensure that no memory management issues have been introduced by the change. We also request that users spend extra time testing the improvements to Foreign Key locks.

If that isn’t enough features for you to test, see the full announcement! 😉

April 24, 2013

Fast Database Emerges from MIT Class… [Think TweetMap]

Filed under: GPU,MapD,SQL — Patrick Durusau @ 4:39 pm

Fast Database Emerges from MIT Class, GPUs and Student’s Invention by Ian B. Murphy.

Details the invention of MapD by Todd Mostak.

From the post:

MapD, At A Glance:

MapD is a new database in development at MIT, created by Todd Mostak.

  • MapD stands for “massively parallel database.”
  • The system uses graphics processing units (GPUs) to parallelize computations. Some statistical algorithms run 70 times faster compared to CPU-based systems like MapReduce.
  • A MapD server costs around $5,000 and runs on the same power as five light bulbs.
  • MapD runs at between 1.4 and 1.5 teraflops, roughly equal to the fastest supercomputer in 2000.
  • MapD uses SQL to query data.
  • Mostak intends to take the system open source sometime in the next year.

Sam Madden (MIT) describes MapD this way:

Madden said there are three elements that make Mostak’s database a disruptive technology. The first is the millisecond response time for SQL queries across “huge” datasets. Madden, who was a co-creator of the Vertica columnar database, said MapD can do in milliseconds what Vertica can do in minutes. That difference in speed is everything when doing iterative research, he said.

The second is the very tight coupling between data processing and visually rendering the data; this is a byproduct of building the system from GPUs from the beginning. That adds the ability to visualize the results of the data processing in under a second. Third is the cost to build the system. MapD runs in a server that costs around $5,000.

“He can do what a 1000 node MapReduce cluster would do on a single processor for some of these applications,” Madden said.

Not a lot of technical detail but you could start learning CUDA while waiting for the open source release.

At 1.4 to 1.5 teraflops on $5,000 worth of hardware, how will clusters will retain their customer base?

Welcome to TweetMap ALPHA

Filed under: GPU,Maps,SQL,Tweets — Patrick Durusau @ 3:57 pm

Welcome to TweetMap ALPHA

From the introduction popup:

TweetMap is an instance of MapD, a massively parallel database platform being developed through a collaboration between Todd Mostak, (currently a researcher at MIT), and the Harvard Center for Geographic Analysis (CGA).

The tweet database presented here starts on 12/10/2012 and ends 12/31/2012. Currently 95 million tweets are available to be queried by time, space, and keyword. This could increase to billions and we are working on real time streaming from tweet-tweeted to tweet-on-the-map in under a second.

MapD is a general purpose SQL database that can be used to provide real-time visualization and analysis of just about any very large data set. MapD makes use of commodity Graphic Processing Units (GPUs) to parallelize hard compute jobs such as that of querying and rendering very large data sets on-the-fly.

This is a real treat!

Try something popular, like “gaga,” without the quotes.

Remember this is running against 95 million tweets.

Impressive! Yes?

April 19, 2013

Schema on Read? [The virtues of schema on write]

Filed under: BigData,Database,Hadoop,Schema,SQL — Patrick Durusau @ 3:48 pm

Apache Hadoop and Data Agility by Ofer Mendelevitch.

From the post:

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.

In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

If a schema is supplied “on read,” how is data validation accomplished?

I don’t mean in terms of datatypes such as string, integer, double, etc. That are trivial forms of data validation.

How do we validate the semantics of data when a schema is supplied on read?”

Mistakes do happen in RDBMS systems but with a schema, which defines data semantics, applications can attempt to police those semantics.

I don’t doubt that schema “on read” supplies a lot of useful flexibility, but how do we limit the damage that flexibility can cause?

For example, many years ago, area codes (for telephones) in the USA were tied to geographic exchanges. Data from the era still exists in the bowels of some data stores. That is no longer true in many cases.

Let’s assume I have older data that has area codes tied to geographic areas and newer data that has area codes that are not. Without a schema to define the area code data in both cases, how would I know to treat the area code data differently?

I concede that schema “on read” can be quite flexible.

On the other hand, let’s not discount the value of schema “on write” as well.

April 14, 2013

Planform:… [Graph vs. SQL?]

Filed under: Bioinformatics,Graphs,SQL,SQLite — Patrick Durusau @ 3:16 pm

Planform: an application and database of graph-encoded planarian regenerative experiments by Daniel Lobo, Taylor J. Malone and Michael Levin. Bioinformatics (2013) 29 (8): 1098-1100. doi: 10.1093/bioinformatics/btt088

Abstract:

Summary: Understanding the mechanisms governing the regeneration capabilities of many organisms is a fundamental interest in biology and medicine. An ever-increasing number of manipulation and molecular experiments are attempting to discover a comprehensive model for regeneration, with the planarian flatworm being one of the most important model species. Despite much effort, no comprehensive, constructive, mechanistic models exist yet, and it is now clear that computational tools are needed to mine this huge dataset. However, until now, there is no database of regenerative experiments, and the current genotype–phenotype ontologies and databases are based on textual descriptions, which are not understandable by computers. To overcome these difficulties, we present here Planform (Planarian formalization), a manually curated database and software tool for planarian regenerative experiments, based on a mathematical graph formalism. The database contains more than a thousand experiments from the main publications in the planarian literature. The software tool provides the user with a graphical interface to easily interact with and mine the database. The presented system is a valuable resource for the regeneration community and, more importantly, will pave the way for the application of novel artificial intelligence tools to extract knowledge from this dataset.

Availability: The database and software tool are freely available at http://planform.daniel-lobo.com.

Watch the video tour for an example of a domain specific authoring tool.

It does not use any formal graph notation/terminology or attempt a new form of ASCII art.

Users can enter data about worms with four (4) heads. That bodes well for new techniques to author topic maps.

On the use of graphs, the authors write:

We have created a formalism based on graphs to encode the resultant morphologies and manipulations of regenerative experiments (Lobo et al., 2013). Mathematical graphs are ideal to encode relationships between individuals and have been previously used to encode morphologies (Lobo et al., 2011). The formalism divided a morphology into adjacent regions (graph nodes) connected to each other (graph edges). The geometrical characteristics of the regions (connection angles, distances, shapes, type, etc.) are stored as node and link labels. Importantly, the formalism permits automatic comparisons between morphologies: we implemented a metric to quantify the difference between two morphologies based on the graph edit distance algorithm.

The experiment manipulations are encoded in a tree structure. Nodes represent specific manipulations (cuts, irradiation and transplantations) where links define the order and relations between manipulations. This approach permits encode the majority of published planarian regenerative experiments.

The graph vs. relational crowd will be disappointed to learn the project uses SQLite (“the most widely deployed SQL database engine in the world”) for the storage/access to its data. 😉

You were aware that hypergraphs were used to model relational databases in the “old days.” Yes?

I will try to pull together some of those publications in the near future.

April 7, 2013

Phoenix in 15 Minutes or Less

Filed under: HBase,Phoenix,SQL — Patrick Durusau @ 3:50 pm

Phoenix in 15 Minutes or Less by Justin Kestelyn.

An amusing FAQ by “James Taylor of Salesforce, which recently open-sourced its Phoenix client-embedded JDBC driver for low-latency queries over HBase.”

From the post:

What is this new Phoenix thing I’ve been hearing about?
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.

Doesn’t putting an extra layer between my application and HBase just slow things down?
Actually, no. Phoenix achieves as good or likely better performance than if you hand-coded it yourself (not to mention with a heck of a lot less code) by:

  • compiling your SQL queries to native HBase scans
  • determining the optimal start and stop for your scan key
  • orchestrating the parallel execution of your scans
  • bringing the computation to the data by
    • pushing the predicates in your where clause to a server-side filter
    • executing aggregate queries through server-side hooks (called co-processors)

In addition to these items, we’ve got some interesting enhancements in the works to further optimize performance:

  • secondary indexes to improve performance for queries on non row key columns
  • stats gathering to improve parallelization and guide choices between optimizations
  • skip scan filter to optimize IN, LIKE, and OR queries
  • optional salting of row keys to evenly distribute write load

…..

Sounds authentic to me!

You?

March 27, 2013

Apache Tajo

Filed under: Apache Tajo,HDFS,SQL — Patrick Durusau @ 12:31 pm

Apache Tajo

From the webpage:

Introduction

Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo uses HDFS as a primary storage layer and has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer.

Features

  • Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
  • Rudiment ETL that transforms one data format to another data format.
  • Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
  • Command line interface to allow users to submit SQL queries
  • Java API to enable clients to submit SQL queries to Tajo

If you ever wanted to get in on the ground floor of a data warehouse project, this could be your chance!

I first saw this at ‎Apache Incubator: Tajo – a Relational and Distributed Data Warehouse for Hadoop by Alex Popescu.

Database Landscape Map – February 2013

Filed under: Database,Graph Databases,Key-Value Stores,NoSQL,Software,SQL — Patrick Durusau @ 11:55 am

Database Landscape Map – February 2013 by 451 Research.

Database map

A truly awesome map of available databases.

Originated from Neither fish nor fowl: the rise of multi-model databases by Matthew Aslett.

Matthew writes:

One of the most complicated aspects of putting together our database landscape map was dealing with the growing number of (particularly NoSQL) databases that refuse to be pigeon-holed in any of the primary databases categories.

I have begun to refer to these as “multi-model databases” in recognition of the fact that they are able to take on the characteristics of multiple databases. In truth though there are probably two different groups of products that could be considered “multi-model”:

I think I understand the grouping from the key to the map but the ordering within groups, if meaningful, escapes me.

I am sure you will recognize most of the names but equally sure there will be some you can’t quite describe.

Enjoy!

March 25, 2013

HOWTO use Hive to SQLize your own Tweets…

Filed under: Hive,SQL,Tweets — Patrick Durusau @ 2:59 am

HOWTO use Hive to SQLize your own Tweets – Part One: ETL and Schema Discovery by Russell Jurney.

HOWTO use Hive to SQLize your own Tweets – Part Two: Loading Hive, SQL Queries

Russell walks you through extracting your tweets, discovering their schema, loading them into Hive and querying the result.

I just requested my tweets on Friday so expect to see them tomorrow or Tuesday.

Will be a bit more complicated than Russell’s example because I re-post tweets about older posts on my blog.

I will have to delete those, although I may want to know when a particular tweet appeared, which means I will need to capture the date(s) when a particular tweet appeared.

BTW, if you do obtain your tweet archive, consider donating it to #Tweets4Science.

March 15, 2013

YSmart: Yet Another SQL-to-MapReduce Translator

Filed under: MapReduce,SQL — Patrick Durusau @ 4:30 pm

YSmart: Yet Another SQL-to-MapReduce Translator by Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, Xiaodong Zhang.

Abstract:

MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to-MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can signicantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.

Just in case you aren’t plicking the videos for this weekend.

Alex Popescus points this paper out at: Paper: YSmart – Yet Another SQL-to-MapReduce Translator.

March 8, 2013

Why FoundationDB Might Be All Its Cracked Up To Be

Filed under: FoundationDB,NoSQL,SQL — Patrick Durusau @ 3:30 pm

Why FoundationDB Might Be All Its Cracked Up To Be by Doug Turnbull.

From the post:

When I first heard about FoundationDB, I couldn’t imagine how it could be anything but vaporware. Seemed like Unicorns crapping happy rainbows to solve all your problems. As I’m learning more about it though, I realize it could actually be something ground breaking.

NoSQL: Lets Review…

So, I need to step back and explain one reason NoSQL databases have been revolutionary. In the days of yore, we used to normalize all our data across multiple tables on a single database living on a single machine. Unfortunately, Moore’s law eventually crapped out and maybe more importantly hard drive space stopped increasing massively. Our data and demands on it only kept growing. We needed to start trying to distribute our database across multiple machines.

Turns out, its hard to maintain transactionality in a distributed, heavily normalized SQL database. As such, a lot of NoSQL systems have emerged with simpler features, many promoting a model based around some kind of single row/document/value that can be looked up/inserted with a key. Transactionality for these systems is limited a single key value entry (“row” in Cassandra/HBase or “document” in (Mongo/Couch) — we’ll just call them rows here). Rows are easily stored in a single node, although we can replicate this row to multiple nodes. Despite being replicated, it turns out transactionally working with single rows in distributed NoSQL is easier than guaranteeing transactionality of an SQL query visiting potentially many SQL tables in a distributed system.

There are deep design ramifications/limitations to the transactional nature of rows. First you always try to cram a lot of data related to the row’s key into a single row, ending up with massive rows of hierarchical or flat data that all relates to the row key. This lets you cover as much data as possible under the row-based transactionality guarantee. Second, as you only have a single key to use from the system, you must chose very wisely what your key will be. You may need to think hard how your data will be looked up through its whole life, it can be hard to go back. Additionally, if you need to lookup on a secondary value, you better hope that your database is friendly enough to have a secondary key feature or otherwise you’ll need to maintain secondary row for storing the relationship. Then you have the problem of working across two rows, which doesn’t fit in the transactionality guarantee. Third, you might lose the ability to perform a join across multiple rows. In most NoSQL data stores, joining is discouraged and denormalization into large rows is the encouraged best practice.

FoundationDB Is Different

FoundationDB is a distributed, sorted key-value store with support for arbitrary transactions across multiple key-values — multiple “rows” — in the database.

As Doug points out, there is must left to be known.

Still, exciting to have something new to investigate.

March 6, 2013

PolyBase

Filed under: Hadoop,HDFS,MapReduce,PolyBase,SQL,SQL Server — Patrick Durusau @ 11:20 am

PolyBase

From the webpage:

PolyBase is a fundamental breakthrough in data processing used in SQL Server 2012 Parallel Data Warehouse to enable truly integrated query across Hadoop and relational data.

Complementing Microsoft’s overall Big Data strategy, PolyBase is a breakthrough new technology on the data processing engine in SQL Server 2012 Parallel Data Warehouse designed as the simplest way to combine non-relational data and traditional relational data in your analysis. While customers would normally burden IT to pre-populate the warehouse with Hadoop data or undergo an extensive training on MapReduce in order to query non-relational data, PolyBase does this all seamlessly giving you the benefits of “Big Data” without the complexities.

I must admit I had my hopes up for the videos labeled: “Watch informative videos to understand PolyBase.”

But the first one was only 2:52 in length and the second was about the Jim Gray Systems Lab (2:13).

So, fair to say it was short on details. 😉

The closest thing I found to a clue was in the PolyBase datasheet that reads (under PolyBase Use Cases, if you are reading along) where it says:

PolyBase introduces the concept of external tables to represent data residing in HDFS. An external table defines a schema (that is, columns and their types) for data residing in HDFS. The table’s metadata lives in the context of a SQL Server database and the actual table data resides in HDFS.

I assume that means that the data in HDFS could have multiple external tables for the same data? Depending upon the query?

Curious if the external tables and/or data types are going to have mapreduce capabilities built-in? To take advantage of parallel processing of the data?

BTW, for topic map types, subject identities for the keys and data types would be the same as with more traditional “internal” tables. In case you want to merge data.

Just out of curiosity, any thoughts on possible IP on external schemas being applied to data?

I first saw this at Alex Popescu’s Microsoft PolyBase: Unifying Relational and Non-Relational Data.

March 3, 2013

Project Panthera…

Filed under: Hadoop,SQL — Patrick Durusau @ 1:38 pm

Project Panthera: Better Analytics with SQL and Hadoop

Another Hintel project focused on Hadoop.

From the project page:

We have worked closely with many enterprise users over the past few years to enhance their new data analytics platforms using the Hadoop stack. Increasingly, these platforms have evolved from a batch-style, custom-built system for unstructured data, to become an integral component of the enterprise application framework. While the Hadoop stack provides a solid foundation for these platforms, gaps remain; in particular, enterprises are looking for full SQL support to seamlessly integrate these new platforms into their existing enterprise data analytics infrastructure. Project Panthera is our open source efforts to provide efficient support of standard SQL features on Hadoop, so as to enable many important, advanced use cases not supported by Hadoop today, including:

  • Exploring data with complex and sophisticated SQL queries (such as nested subqueries with aggregation functions) – for instance, about half of the queries in TPC-H (a standard decision support benchmark) use subqueries
  • Efficient storage engine for high-update rate SQL query workloads – while HBase is often used to support such workloads, query processing (e.g., Hive) on HBase can incur significant overheads as the storage engine completely ignores the SQL relational model
  • Utilizations of new hardware platform technologies (e.g., new flash technologies and large RAM capacities available in modern servers) for efficient SQL query processing

The objective of Project Panthera is to collaborate with the larger Hadoop community in enhancing the SQL support of the platform for a broader set of use cases. We are building these new capabilities on top of the Hadoop stack, and contributing necessary improvements of the underlying stack back to the existing Apache Hadoop projects. Our initial goals are:

SQL is still alive! Who knew? 😉

A good example of new technologies not replacing old ones, but being grafted onto them.

With that grafting, semantic impedance between the systems remains.

You can remap over that impedance on an ad hoc and varying basis.

Or, you can create mapping today that can be re-used tomorrow.

Which sounds like a better option to you?

March 1, 2013

Pig Eye for the SQL Guy

Filed under: Hadoop,MapReduce,Pig,SQL — Patrick Durusau @ 5:33 pm

Pig Eye for the SQL Guy by Cat Miller.

From the post:

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.

As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.

Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)

This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

Do you speak SQL?

Want to learn to speak Pig?

This is the right post for you!

February 21, 2013

Teradata Announces Aster Discovery Platform

Filed under: SQL,Teradata,Visualization — Patrick Durusau @ 8:53 pm

Teradata Announces Aster Discovery Platform

Teradata didn’t get the memo about February 20th being performance day either. So that makes two of us. 😉

From the post:

Teradata today introduced Teradata Aster Discovery Platform 5.10 a discovery solution with more than 20 new big data analytic capabilities, including purpose-built visualizations.

The platform was designed for customers to acquire, prepare, analyze, and visualize petabyte-sized volumes of multi-structured data in a single platform with a single structured query language (SQL) command.

“Existing and aspiring data scientists should take note. The Teradata Aster Discovery Platform is full of new capabilities that can empower them to accelerate their innovation and supply new options to their business users,” said Scott Gnau , president, Teradata Labs.

Teradata’s open platform is a suite of integrated hardware, software, and best-of-breed partner solutions, using business intelligence (BI), data integration, analytics, and visualization tools. It was built for use by any SQL-savvy analyst or business user, while being powerful and flexible enough for the most sophisticated data scientists.

“With newly added analytics and visualization functionality, the Teradata Aster Discovery Platform offers the convenience of a ‘data scientist in a box,'” said Dan Vesset, program vice president of business analytics and big data, IDC. “Much of the market attention has been on vendors trying to build SQL engines on Hadoop. Teradata Aster Discovery Platform already provides an ANSI SQL-compliant method with its SQL-MapReduce framework to acquire, prepare, analyze, and visualize data from any data source including Hadoop. Without the need to integrate multiple point solutions, customers using this Teradata technology are able to accelerate the discovery process and visualize information in new and exciting ways, and to focus the scarce expertise of data scientists on highest value- added tasks.”

Or maybe they did.

Aster Discovery Platform 5.10 will appear by the end of the 2nd quarter 2013.

See the post for a nice summary of coming features.

BTW, I hope you still have your SQL books. Looks like SQL is making a serious comeback. 😉

February 20, 2013

Cascading into Hadoop with SQL

Filed under: Cascading,Hadoop,Lingual,SQL — Patrick Durusau @ 9:24 pm

Cascading into Hadoop with SQL by Nicole Hemsoth.

From the post:

Today Concurrent, the company behind the Cascading Hadoop abstraction framework, announced a new trick to help developers tame the elephant.

The company, which is focused on simplifying Hadoop, has introduced a SQL parser that sits on top of Cascading with a JDBC Interface. Concurrent says that they’ll be pushing out over the next couple of weeks with hopes that developers will take it under their wing and support the project.

According to the company’s CTO and founder, Chris Wensel, the goal is to get the commuity to rally around a new way to let non-programmers make use of data that’s locked in Hadoop clusters and let them more easily move applications onto Hadoop clusters.

The newly-announced approach to extending the abstraction is called Lingual, which is aimed at putting Hadoop within closer sights for those familiar with SQL, JDBC and traditional BI tools. It provides what the company calls “true SQL for Cascading and Hadoop” to enable easier creation and running of applications on Hadoop and again, to tap into that growing pool of Hadoop-seekers who lack the expertise to back mission-critical apps on the platform.

Wensel says that Lingual’s goal is to provide an ANSI-standard SQL interface that is designed to play well with all of the big name distros running on site or in cloud environments. This will allow a “cut and paste” capability for existing ANSI SQL code from traditional data warehouses so users can access data that’s locked away on a Hadoop cluster. It’s also possible to query and export data from Hadoop right into a wide range of BI tools.

Another example of meeting a large community of uses where they are, not where you would like for them to be.

Targeting a market that already exists is easier than building a new one from the ground up.

February 19, 2013

Really Large Queries: Advanced Optimization Techniques, Feb. 27

Filed under: MySQL,Performance,SQL — Patrick Durusau @ 11:10 am

Percona MySQL Webinar: Really Large Queries: Advanced Optimization Techniques, Feb. 27 by Peter Boros.

From the post:

Do you have a query you never dared to touch?
Do you know it’s bad, but it’s needed?
Does it fit your screen?
Does it really have to be that expensive?
Do you want to do something about it?

During the next Percona webinar on February 27, I will present some techniques that can be useful when troubleshooting such queries. We will go through case studies (each case study is made from multiple real-world cases). In these cases we were often able to reduce query execution time from 10s of seconds to a fraction of a second.

If you have SQL queries in your work flow, this will definitely be of interest.

January 30, 2013

SQL, NoSQL =? CoSQL? Category Theory to the Rescue

Filed under: Category Theory,Database,NoSQL,SQL,TMRM — Patrick Durusau @ 8:44 pm

A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman.

I missed this when it appeared in March of 2011.

From the conclusion:

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL’s foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

The ACM Digital Library reports only 3 citations, which is unfortunate for such an interesting proposal.

I have heard about key/value pairs somewhere else. I will have to think about that and get back to you. (Hint for the uninitiated, try the Topic Maps Reference Model (TMRM). A new draft of the TMRM is due to appear in a week or so.)

December 2, 2012

A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out

Filed under: Data Mining,SQL,SQL Server — Patrick Durusau @ 7:46 pm

A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out

A rather refreshing if anonymous take on statistics and data mining.

Since I can access SQL Servers in the cloud (without the necessity of maintaining a local Windows Server box), thought I should look at data mining for SQL Servers.

This was one of the first posts I encountered.

In the first of a series of amateur tutorials on SQL Server Data Mining (SSDM), I promised to pull off an impossible stunt: explaining the broad field of statistics in a few paragraphs without the use of equations. What other SQL Server blog ends with a cliffhanger like that? Anyone who aims at incorporating data mining into their IT infrastructure or skill set in any substantial way is going to have to learn to interpret equations, but it is possible to condense a few key statistical concepts in a way that will help those who aren’t statisticians – like me – to make productive use of SSDM without them. These crude Cliff’s Notes can at least familiarize DBAs, programmers and other readers of these tutorials with the minimal bare bones concepts they will need to know in order to interpret the data output by SSDM’s nine algorithms, as well as to illuminate the inner workings of the algorithms themselves. Without that minimal foundation, it will be more difficult to extract useful meaning from your data mining efforts.

The first principle to keep in mind is so absurdly obvious that it is often half-consciously forgotten – perhaps because it is right before our noses – but it is indispensable to understanding both the field of statistics and the stats output by SSDM. To wit, the numbers signify something. Some intelligence assigned meaning to them. One of the biggest hurdles when interpreting statistical data, reading equations or learning a foreign language is the subtle, almost subconscious error of forgetting that these symbols reflect ideas in the head of another conscious human being, which probably correspond to ideas that you also have in your head, but simply lack the symbols to express. An Englishman learning to read or write Spanish, Portuguese, Russian or Polish may often forget that the native speakers of these languages are trying to express the exact same concepts that an English speaker would; they have the exact same ideas in their heads as we do, but communicate them quite differently. Quite often, the seemingly incoherent quirks and rules of a particular foreign language may actually be part of a complex structure designed to convey identical, ordinary ideas in a dissimilar, extraordinary way. It is the same way with mathematical equations: the scientists and mathematicians who use them are trying to convey ideas in the most succinct way they know. It is often easier for laymen to understand the ideas and supporting evidence that those equations are supposed to express, when they’re not particularly well-versed in the detailed language that equations represent. I’m a layman, like some of my readers probably are. My only claim to expertise in this area is that when I was in fourth grade, I learned enough about equations to solve the ones my father, a college physics teacher, taught every week – but then I forgot it all, so I found myself back at Square One when I took up data mining a few years back.

On a side note, it would be wise for anyone who works with equations regularly to consciously remind themselves that they are merely symbols representing ideas, rather than the other way around; a common pitfall among physicists and other scientists who work with equations regularly seems to be the Pythagorean heresy, i.e. the quasi-religious belief that reality actually consists of mathematical equations. It doesn’t. If we add two apples to two apples, we end up with four apples; the equation 2 + 2 = 4 expresses the nature and reality of several apples, rather than the apples merely being a stand-in for the equation. Reality is not a phantom that obscures some deep, dark equation underlying all we know; math is simply a shortcut to expressing certain truths about the external world. This danger is magnified when we pile abstraction on top of abstraction, which may lead to the construction of ivory towers that eventually fall, often spectacularly. This is a common hazard in the field of finance, where our economists often forget that money is just an abstraction based on agreements among large numbers of people to assign certain meanings to it that correspond to tangible, physical goods; all of the periodic financial crashes that have plagued Western civilization since Tulipmania have been accompanied by a distinct forgetfulness of this fact, which automatically produces the scourge of speculation. I’ve often wondered if this subtle mistake has also contributed to the rash of severe mental illness among mathematicians and physicists, with John Nash (of the film A Beautiful Mind), Nicolai Tesla and Georg Cantor being among the most recognized names in a long list of victims. It may also be linked to the uncanny ineptitude of our most brilliant physicists and mathematicians when it comes to philosophy, such as Rene Descartes, Albert Einstein, Stephen Hawking and Alan Turing. In his most famous work, Orthodoxy, 20th Century British journalist G.K. Chesterton noticed the same pattern, which he summed up thus: “Poets do not go mad; but chess-players do. Mathematicians go mad, and cashiers; but creative artists very seldom. I am not, as will be seen, in any sense attacking logic: I only say that this danger does lie in logic, not in imagination.”[1] At a deeper level, some of the risk to mental health from excessive math may pertain to seeking patterns that aren’t really there, which may be closely linked to the madness underlying ancient “arts” of divination like haruspicy and alectromancy.

November 22, 2012

Teiid (8.2 Final Released!) [Component for TM System]

Filed under: Data Integration,Federation,Information Integration,JDBC,SQL,Teiid,XQuery — Patrick Durusau @ 11:16 am

Teiid

From the homepage:

Teiid is a data virtualization system that allows applications to use data from multiple, heterogenous data stores.

Teiid is comprised of tools, components and services for creating and executing bi-directional data services. Through abstraction and federation, data is accessed and integrated in real-time across distributed data sources without copying or otherwise moving data from its system of record.

Teiid Parts

  • Query Engine: The heart of Teiid is a high-performance query engine that processes relational, XML, XQuery and procedural queries from federated datasources.  Features include support for homogenous schemas, hetrogenous schemas, transactions, and user defined functions.
  • Embedded: An easy-to-use JDBC Driver that can embed the Query Engine in any Java application. (as of 7.0 this is not supported, but on the roadmap for future releases)
  • Server: An enterprise ready, scalable, managable, runtime for the Query Engine that runs inside JBoss AS that provides additional security, fault-tolerance, and administrative features.
  • Connectors: Teiid includes a rich set of Translators and Resource Adapters that enable access to a variety of sources, including most relational databases, web services, text files, and ldap.  Need data from a different source? A custom translators and resource adaptors can easily be developed.
  • Tools:

Teiid 8.2 final was released on November 20, 2012.

Like most integration services, not strong on integration between integration services.

Would make one helluva component for a topic map system.

A system with an inter-integration solution mapping layer in addition to the capabilities of Teiid.

November 17, 2012

10 things never to do with a relational database

Filed under: NoSQL,SQL,SQL-NoSQL — Patrick Durusau @ 3:51 pm

10 things never to do with a relational database (The data explosion demands new solutions, yet the hoary old RDBMS still rules. Here’s where you really shouldn’t use it) by Andrew C. Oliver.

From the post:

I am a NoSQLer and a big data guy. That’s a nice coincidence, because as you may have heard, data growth is out of control.

Old habits die hard. The relational DBMS still reigns supreme. But even if you’re a dyed-in-the-wool, Oracle-loving, PL/SQL-slinging glutton for the medieval RAC, think twice, think many times, before using your beloved technology for the following tasks.

[ If you aren’t going to use an RDBMS, which freaking database should you use? | See InfoWorld’s comparative review of NoSQL databases. | Keep up with the latest developer news with InfoWorld’s Developer World newsletter. ]

If you guessed this post is from InfoWorld and that it’s rather ranty, you are right on both counts.

Andrew’s 10 things:

  1. Search
  2. Recommendations
  3. High-frequency trading
  4. Product cataloguing
  5. Users/groups and ACLs
  6. Log analysis
  7. Media repository
  8. Email
  9. Classified ads
  10. Time-series/forecasting

Andrew ducks and covers in his conclusion with:

Can you use the RDBMS for some or many of these? Sure — I have and people continue to. However, is it a good fit? Not really. I expect the cranky old men to disagree, but tradition alone is not a good reason to stick with the old way of doing things.

If you disagree with his assessment, you are by definition a “cranky old man,” and no one wants to be seen as a cranky old man.

Being a “cranky old man,” the label doesn’t sting so I feel free to disagree. 😉

Andrew is right that tradition alone isn’t “a good reason to stick with the old way of doing things.”

On the other hand, because something is new or venture capitalists have parted with cash, isn’t a reason to find a new way of doing things.

Your requirements aren’t only technical questions but questions of IT competence to deploy a new solution, training of staff to use a new solution, costs of retraining and construction, and others.

Ignoring the non-technical side of requirements is a step toward acquiring a white elephant to sleep in the middle of your office, interfering with day to day operations.

November 1, 2012

SQL-99 Complete, Really

Filed under: Database,SQL — Patrick Durusau @ 5:39 pm

SQL-99 Complete, Really by Peter Gulutzan & Trudy Pelzer.

From the preface:

If you’ve ever used a relational database product, chances are that you’re already familiar with SQL — the internationally-accepted, standard programming language for databases whic is supported by the vast majority of relational database management system (DBMS) products available today. You may also have noticed that, despite the large number of “reference” works that claim to describe standard SQL, not a single one provides a complete, accurate and example-filled description of the entire SQL Standard. This book was written to fill that void.

True, this is the SQL-99 standard.

I collect old IT standards and books about old IT standards. The standards we draft today address issues that have been seen before, just not dressed in current fashion.

By attempting to understand what worked and what perhaps didn’t in older standards, we can make new mistakes instead of repeating old ones.

October 22, 2012

Spanner – …SQL Semantics at NoSQL Scale

Filed under: NoSQL,Spanner,SQL — Patrick Durusau @ 2:18 pm

Spanner – It’s About Programmers Building Apps Using SQL Semantics at NoSQL Scale by Todd Hoff.

From the post:

A lot of people seem to passionately dislike the term NewSQL, or pretty much any newly coined term for that matter, but after watching Alex Lloyd, Senior Staff Software Engineer Google, give a great talk on Building Spanner, that’s the term that fits Spanner best.

Spanner wraps the SQL + transaction model of OldSQL around the reworked bones of a globally distributed NoSQL system. That seems NewSQL to me.

As Spanner is a not so distant cousin of BigTable, the NoSQL component should be no surprise. Spanner is charged with spanning millions of machines inside any number of geographically distributed datacenters. What is surprising is how OldSQL has been embraced. In an earlier 2011 talk given by Alex at the HotStorage conference, the reason for embracing OldSQL was the desire to make it easier and faster for programmers to build applications. The main ideas will seem quite familiar:

  • There’s a false dichotomy between little complicated databases and huge, scalable, simple ones. We can have features and scale them too.
  • Complexity is conserved, it goes somewhere, so if it’s not in the database it’s pushed to developers.
  • Push complexity down the stack so developers can concentrate on building features, not databases, not infrastructure.
  • Keys for creating a fast-moving app team: ACID transactions; global Serializability; code a 1-step transaction, not 10-step workflows; write queries instead of code loops; joins; no user defined conflict resolution functions; standardized sync; pay as you go, get what you pay for predictable performance.

Spanner did not start out with the goal of becoming a NewSQL star. Spanner started as a BigTable clone, with a distributed file system metaphor. Then Spanner evolved into a global ProtocolBuf container. Eventually Spanner was pushed by internal Google customers to become more relational and application programmer friendly.

If you can’t stay for the full show, Todd provides a useful summary of the video. But if you have the time, take the time to enjoy the presentation!.

« Newer PostsOlder Posts »

Powered by WordPress