Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 1, 2012

SkyQuery: …Parallel Probabilistic Join Engine… [When Static Mapping Isn’t Enough]

Filed under: Astroinformatics,Bayesian Data Analysis,Dynamic Mapping,Identity,Merging,SQL — Patrick Durusau @ 4:41 pm

SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases by László Dobos, Tamás Budavári, Nolan Li, Alexander S. Szalay, and István Csabai.

Abstract:

Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.

Astronomy is a cool area to study and has data out the wazoo, but I was struck by:

One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios.

Is identity with sharp edges, susceptible to pair-wise mapping, the common case?

Or do we just see some identity issues that way?

Commend the paper to you as an example of dynamic merging practice.

June 27, 2012

Introducing new Fusion Tables API [Deprecation – SQL API]

Filed under: Database,Fusion Tables,SQL — Patrick Durusau @ 10:03 am

Introducing new Fusion Tables API by Warren Shen.

The post in its entirety:

We are very pleased to announce the public availability of the new Fusion Tables API. The new API includes all of the functionality of the existing SQL API, plus the ability to read and modify table and column metadata as well as the definitions of styles and templates for data visualization. This API is also integrated with the Google APIs console which lets developers manage all their Google APIs in one place and take advantage of built-in reporting and authentication features.

With this launch, we are also announcing a six month deprecation period for the existing SQL API. Since the new API includes all of the functionality of the existing SQL API, developers can easily migrate their applications using our migration guide.

For a detailed description of the features in the new API, please refer to the API documentation.

BTW, if you go to the Migration Guide, be aware that as of 27 June 2012, the following links aren’t working (404):

This Migration Guide documents how to convert existing code using the SQL API to code using the Fusion Tables API 1.0. This information is discussed in more detail in the Getting Started and Using the API developer guides.

I have discovered the error:

https://developers.google.com/fusiontables/docs/v1/v1/getting_started.html – Wrong – note the successive “/v1.”

https://developers.google.com/fusiontables/docs/v1/getting_started – Correct – From the left side nav. bar.

https://developers.google.com/fusiontables/docs/v1/v1/using.html – Wrong – note the successive “/v1.”

https://developers.google.com/fusiontables/docs/v1/using – Correct – From the left side nav. bar.

The summary material appears to be useful but you will need the more detailed information as well.

For example, under HTTP Methods (in the Migration Guide), the SQL API is listed as having:

GET for SHOW TABLES, DESCRIBE TABLE, SELECT

And the equivalent in the Fusion API:

GET for SELECT

No equivalent of SHOW TABLES, DESCRIBE TABLE using GET.

If you find and read Using the API you will find:

Retrieving a list of tables

Listing tables is useful because it provides the table ID and column names of tables that are necessary for other calls. You can retrieve the list of tables a user owns by sending an HTTP GET request to the URI with the following format:

https://www.googleapis.com/fusiontables/v1/tables

Tables are listed along with column ids, names and datatypes.

That may be too much for the migration document but implying that all you have with GET is SELECT is misleading.

Rather: GET for TABLES (SHOW + DESCRIBE), SELECT

Yes?

June 8, 2012

Jetpants: a toolkit for huge MySQL topologies [“action” on MySQL]

Filed under: MySQL,SQL — Patrick Durusau @ 8:59 pm

Jetpants: a toolkit for huge MySQL topologies

From the webpage:

Tumblr is one of the largest users of MySQL on the web. At present, our data set consists of over 60 billion relational rows, adding up to 21 terabytes of unique relational data. Managing over 200 dedicated database servers can be a bit of a handful, so naturally we engineered some creative solutions to help automate our common processes.

Today, we’re happy to announce the open source release of Jetpants, Tumblr’s in-house toolchain for managing huge MySQL database topologies. Jetpants offers a command suite for easily cloning replicas, rebalancing shards, and performing master promotions. It’s also a full Ruby library for use in developing custom billion-row migration scripts, automating database manipulations, and copying huge files quickly to multiple remote destinations.

Dynamically resizable range-based sharding allows you to scale MySQL horizontally in a robust manner, without any need for a central lookup service or massive pre-allocation of tiny shards. Jetpants supports this range-based model by providing a fast way to split shards that are approaching capacity or I/O limitations. On our hardware, we can split a 750GB, billion-row pool in half in under six hours.

Jetpants can be obtained via GitHub or RubyGems.

Interested in this type of work? We’re hiring!

I am reminded of the line from The Blues Brothers film when Ray of Ray’s Music Exchange (played by Ray Charles) tells Jake and Elwood, “E-excuse me, uh, I don’t think there’s anything wrong with the action on this piano.”

Doesn’t look like there is anything wrong with the “action” on MySQL. 😉

Certainly worth taking a look.

I first saw this Alex Popescu’s myNoSQL.

May 22, 2012

SQL Azure Labs Posts

Filed under: Azure Marketplace,Microsoft,SQL,Windows Azure,Windows Azure Marketplace — Patrick Durusau @ 10:36 am

Roger Jennings writes in Recent Articles about SQL Azure Labs and Other Value-Added Windows Azure SaaS Previews: A Bibliography:

I’ve been concentrating my original articles for the past six months or so on SQL Azure Labs, Apache Hadoop on Windows Azure and SQL Azure Federations previews, which I call value-added offerings. I use the term value-added because Microsoft doesn’t charge for their use, other than Windows Azure compute, storage and bandwidth costs or SQL Azure monthly charges and bandwidth costs for some of the applications, such as Codename “Cloud Numerics” and SQL Azure Federations.

As of 22 May 2012, there are forty-four (44) posts in the following categories:

  • Windows Azure Marketplace DataMarket plus Codenames “Data Hub” and “Data Transfer” from SQL Azure Labs
  • Apache Hadoop on Windows Azure from the SQL Server Team
  • Codename “Cloud Numerics” from SQL Azure Labs
  • Codename “Social Analytics from SQL Azure Labs
  • Codename “Data Explorer” from SQL Azure Labs
  • SQL Azure Federations from the SQL Azure Team

If you need quick guides and/or incentives to use Windows Azure, try these on for size.

May 17, 2012

Big Game Hunting in the Database Jungle

Filed under: Calvin,NoSQL,Oracle,SQL — Patrick Durusau @ 2:31 pm

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

Alexander Thomson and Daniel Abadi write:

In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential—some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

Preview of a paper that Alex is presenting at SIGMOD next week. Introducing “Calvin,” a new approach to database processing.

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access).

What I find exciting about this report (and the paper) is the re-thinking of current assumptions concerning data processing. May be successful or may not be. But the exciting part is the attempt to transcend decades of acceptance of the maxims of our forefathers.

BTW, Calvin is reported to support 500,000 transactions a second.

Big game hunting anyone?*


* I don’t mean that as an expression of preference for or against Oracle.

I suspect Calvin will be a wake up call to R&D at Oracle to re-double their own efforts at ground breaking innovation.

Breakthroughs in matching up multi-dimensional indexes would be attractive to users who need to match up disparate data sources.

Speed is great but a useful purpose attracts customers.

May 11, 2012

Picard and Dathon at El-Adrel

Filed under: bigdata®,Graphs,SQL — Patrick Durusau @ 4:50 pm

Orri Erling’s account of the seeing Bryan Thompson reminded me of Picard and Dathon at El-Adrel, albeit with happier results.

See what you think:

I gave an invited talk (“Virtuoso 7 – Column Store and Adaptive Techniques for Graph” (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012.

Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.

So, we could do it like Bigdata®, or we could add a “quad ID” column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.

Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.

Pushing past the metaphors it sounds like both Orri and Bryan are working on interesting projects. 😉

April 10, 2012

Metablogging MADlib

Filed under: Data Analysis,SQL — Patrick Durusau @ 6:44 pm

Metablogging MADlib

Joseph M. Hellerstein writes:

When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I’m involved with. So I wrote up a discussion of MADlib, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote a paper on the design and use of MADlib, which made my writing job a bit easier.) I’m optimistic about MADlib closing a gap between algorithm researchers and working data scientists, using familiar SQL as a vector for adoption on both fronts.

I kicked off MADlib as a part-time consulting project for Greenplum during my sabbatical in 2010-2011. As I built out the first two methods (FM and CountMin sketches) and an installer, Greenplum started assembling a team of their own engineers and data scientists to overlap with and eventually replace me when I returned to campus. They also developed a roadmap of additional methods that their customers wanted in the field. Eighteen months later, Greenplum now contributes the bulk of the labor, management and expertise for the project, and has built bridges to leading academics as well.

Like they said at Woodstock, “if you don’t think SQL is all that weird….” you might want to stop by the MADlib project. (I will have to go listen to the soundtrack. That may not be an exact quote.)

This is an important project for database analytics in an SQL context.

March 7, 2012

Batch Importer – Neo4j

Filed under: CSV,Neo4j,SQL — Patrick Durusau @ 5:43 pm

By Max De Marzi.

From part 1:

Data is everywhere… all around us, but sometimes the medium it is stored in can be a problem when analyzing it. Chances are you have a ton of data sitting around in a relational database in your current application… or you have begged, borrowed or scraped to get the data from somewhere and now you want to use Neo4j to find how this data is related.

Batch Importer – Part 1: CSV files.

Batch Importer – Part 2: Use of SQL to prepare files for import.

What other importers would you need for Neo4j? Or would you use CSV as a target format for loading into Neo4j?

March 5, 2012

Trees in the Database: Advanced Data Structures

Filed under: Data Structures,Database,PostgreSQL,RDBMS,SQL,Trees — Patrick Durusau @ 7:52 pm

Trees in the Database: Advanced Data Structures

Lorenzo Alberton writes:

Despite the NoSQL movement trying to flag traditional databases as a dying breed, the RDBMS keeps evolving and adding new powerful weapons to its arsenal. In this talk we’ll explore Common Table Expressions (SQL-99) and how SQL handles recursion, breaking the bi-dimensional barriers and paving the way to more complex data structures like trees and graphs, and how we can replicate features from social networks and recommendation systems. We’ll also have a look at window functions (SQL:2003) and the advanced reporting features they make finally possible. The first part of this talk will cover several different techniques to model a tree data structure into a relational database: parent-child (adjacency list) model, materialized path, nested sets, nested intervals, hybrid models, Common Table Expressions. Then we’ll move one step forward and see how we can model a more complex data structure, i.e. a graph, with concrete examples from today’s websites. Starting from real-world examples of social networks’ and recommendation systems’ features, and with the help of some graph theory, this talk will explain how to represent and traverse a graph in the database. Finally, we will take a look at Window Functions and how they can be useful for data analytics and simple inline aggregations, among other things. All the examples have been tested on PostgreSQL >= 8.4.

Very impressive presentation!

Definitely makes me want to dust off my SQL installations and manuals for a closer look!

March 4, 2012

Graphs in the database: SQL meets social networks

Filed under: Database,Graphs,Social Networks,SQL — Patrick Durusau @ 7:17 pm

Graphs in the database: SQL meets social networks by Lorenzo Alberton.

If you are interested in graphs, SQL databases, Common Table Expressions (CTEs), together or in any combination, this is the article for you!

Lorenzo walks the reader through the basics of graphs with an emphasis on understanding how SQL techniques can be successfully used, depending upon your requirements.

From the post:

Graphs are ubiquitous. Social or P2P networks, thesauri, route planning systems, recommendation systems, collaborative filtering, even the World Wide Web itself is ultimately a graph! Given their importance, it’s surely worth spending some time in studying some algorithms and models to represent and work with them effectively. In this short article, we’re going to see how we can store a graph in a DBMS. Given how much attention my talk about storing a tree data structure in the db received, it’s probably going to be interesting to many. Unfortunately, the Tree models/techniques do not apply to generic graphs, so let’s discover how we can deal with them.

February 29, 2012

AlchemyDB – The world’s first integrated GraphDB + RDBMS + KV Store + Document Store

Filed under: Alchemy Database,Graphs,SQL — Patrick Durusau @ 7:20 pm

AlchemyDB – The world’s first integrated GraphDB + RDBMS + KV Store + Document Store by Russell Sullivan.

From the post:

I recently added a fairly feature rich Graph Database to AlchemyDB (called it LuaGraphDB) and it took roughly 10 days to prototype. I implemented the graph traversal logic in Lua (embedded in AlchemyDB) and used AlchemyDB’s RDBMS to index the data. The API for the GraphDB is modeled after the very advanced GraphDB Neo4j. Another recently added functionality in AlchemyDB, a column type that stores a Lua Table (called it LuaTable), led me to mix Lua-function-call-syntax into every part of SQL I could fit it into (effectively tacking on Document-Store functionality to AlchemyDB). Being able to call lua functions from any place in SQL and being able to call lua functions (that can call into the data-store) directly from the client, made building a GraphDB on top of AlchemyDB possible as a library, i.e. it didn’t require any new core functionality. This level of extensibility is unique and I am gonna refer to AlchemyDB as a “Data Platform”. This is the best term I can come up with, I am great at writing cache invalidation algorithms, but I suck at naming things :)

Another graph contender! It’s a fairly long post so get a cup of coffee before you start!

An observation that interests me:

It is worth noting (at some point, why not here:) that as long as you keep requests relatively simple, meaning they dont look at 1000 table-rows or traverse 1000 graph-nodes, your performance will range between 10K-100K TPS on a single core w/ single millisecond latencies, these are the types of numbers people should demand for OLTP.

Are we moving in the direction of databases that present “good enough” performance for particular use cases?

A related question: How would you optimize a graph database for particular recursive graphs?

February 7, 2012

Hybrid SQL-NoSQL Databases Are Gaining Ground

Filed under: NoSQL,SQL,SQL-NoSQL — Patrick Durusau @ 4:29 pm

Hybrid SQL-NoSQL Databases Are Gaining Ground

From the post:

Hybrid SQL-NoSQL database solutions combine the advantage of being compatible with many SQL applications and providing the scalability of NoSQL ones. Xeround offers such a solution as a service in the cloud, including a free edition. Other solutions: Database.com with ODBC/JDBC drivers, NuoDB, Clustrix, and VoltDB.

Xeround provides a DB-as-a-Service based on a SQL-NoSQL hybrid. The front-end is a MySQL query engine, appealing to the already existing large number of MySQL applications, but its storage API works with an in-memory distributed NoSQL object store up to 50 GB in size. Razi Sharir, Xeround CEO, detailed for InfoQ:

Read the post to find offers of smallish development space for free.

Do you get the sense that terminology is being invented at a rapid pace in this area? Which is going to make comparing SQL, NoSQL, SQL-NoSQL, etc., offerings more and more difficult? Not to mention differences due to platforms (including the cloud).

Doesn’t that make it difficult for both private as well as government CIO’s to:

  1. Formulate specifications for RFPs
  2. Evaluate responses to RFPs
  3. Measure performance or meeting of other requirements across responses
  4. Same as #3 but under actual testing condition?

Semantic impedance, it will be with us always.

January 31, 2012

Accelerating SQL Database Operations on a GPU with CUDA (merging spreadsheet data?)

Filed under: CUDA,GPU,Spreadsheets,SQL,SQLite — Patrick Durusau @ 4:33 pm

Accelerating SQL Database Operations on a GPU with CUDA by Peter Bakkum and Kevin Skadron.

Abstract:

Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the eff ort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.

This paper focuses on accelerating SELECT queries and describes the considerations in an efficient GPU implementation of the SQLite command processor. Results on an NVIDIA Tesla C1060 achieve speedups of 20-70X depending on the size of the result set.

Important lessons to be learned from this paper:

  • Don’t invent new languages for the average user to learn.
  • Avoid the need to modify existing programs
  • Write against common software

Remember that 75% of the BI market is still using spreadsheets. For all sorts of data but numeric data in particular.

I don’t have any experience with importing files into Excel but I assume there is a macro language that can used to create import processes.

Curious if there has been any work on creating import macros for Excel that incorporate merging as part of those imports?

That would:

  • Not be a new language for users to learn.
  • Avoid modification of existing programs (or data)
  • Be written against common software

I am not sure about the requirements for merging numeric data but that should make the exploration process all the more enjoyable.

January 26, 2012

Tenzing: A SQL Implementation On The MapReduce Framework

Filed under: MapReduce,SQL,Tenzing — Patrick Durusau @ 6:42 pm

Tenzing: A SQL Implementation On The MapReduce Framework by Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon and Michael Wong.

Abstract:

Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries.

Of the conclusions of the authors:

  • It is possible to create a fully functional SQL engine on top of the MapReduce framework, with extensions that go beyond SQL into deep analytics.
  • With relatively minor enhancements to the MapReduce framework, it is possible to implement a large number of optimizations currently available in commercial database systems, and create a system which can compete with commercial MPP DBMS in terms of throughput and latency.
  • The MapReduce framework provides a combination of high performance, high reliability and high scalability on cheap unreliable hardware, which makes it an excellent platform to build distributed applications that involve doing simple to medium complexity operations on large data volumes.
  • By designing the engine and the optimizer to be aware of the characteristics of heterogeneous data sources, it is possible to create a smart system which can fully utilize the characteristics of the underlying data sources.
  • the last one is of the most interest to me. Which one interests you the most?

    BTW, the authors mention:

    We are working on various other enhancements and believe we can cut this time down to less than 5 seconds end-to-end, which is fairly acceptable to the analyst community.

    I think the analyst community needs to use 2400 baud modems for a month or two. 😉

    Sub-5 second performance is sometimes useful, even necessary. But as a general requirement?

    January 9, 2012

    Triggers in MySQL

    Filed under: MySQL,SQL,Triggers — Patrick Durusau @ 1:44 pm

    Triggers in MySQL

    From the post:

    Almost all developers are heard about Triggers and all knows that mysql support triggers and triggers are adding an advantages to mysql.Triggers are the SQL statements are stored in database.

    Triggers are the SQL statements which add functionality to your tables so that they perform a certain series of actions when a some queries are executed. We can say in easy language is Triggers are some conditions performed when INSERT, UPDATE or DELETE events are made in the table without using two separate queries.

    Sometimes developers are prefer to use store procedures rather than triggers but triggers are one kind of store procedures which contain procedural code into body.The difference between a trigger and a stored procedure is that a trigger is called when an event occurs in a table whereas a stored procedure must be called explicitly.

    Short overview of triggers in MySQL.

    Possibly useful if you are using a relational backend for your topic map engine.

    Should topic map engines support the equivalent of triggers?

    As declared by a topic map?

    Now that would be clever, to have a topic map carry its triggers around with it.

    Admittedly, interactive data structures aren’t the norm, yet, but they are certainly worth thinking about.

    December 26, 2011

    Beyond Relational

    Filed under: Database,MySQL,Oracle,PostgreSQL,SQL,SQL Server — Patrick Durusau @ 8:19 pm

    Beyond Relational

    I originally arrived at this site because of a blog hosted there with lessons on Oracle 10g. Exploring a bit I decided to post about it.

    Seems to have fairly broad coverage, from Oracle and PostgreSQL to TSQL and XQuery.

    Likely to be a good site for learning cross-overs between systems that you can map for later use.

    Suggestions of similar sites?

    December 24, 2011

    Topic Maps & Oracle: A Smoking Gun

    Filed under: Database,Oracle,SQL — Patrick Durusau @ 4:42 pm

    Using Similarity-based Operations for Resolving Data-Level Conflicts (2003)

    Abstract:

    Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of these operators, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples how the operators can be used in given application scenarios.

    No, the title of the post is not a mistake.

    The authors of this paper, in 2003, conclude:

    In this paper we presented database operators for finding related data and identifying duplicates based on user-specific similarity criteria. The main application area of our work is the integration of heterogeneous data where the likelihood of occurrence of data objects representing related or the same real-world objects though containing discrepant values is rather high. Intended as an extended grouping operation and by combining it with aggregation functions for merging/reconciling groups of conflicting values our grouping operator fits well into the relational algebra framework and the SQL query processing model. In a similar way, an extended join operator takes similarity predicates used for both operators into consideration. These operators can be utilized in ad-hoc queries as part of more complex data integration and cleaning tasks.

    In addition to a theoretical background, the authors illustrate an implementation of their techniques, using Oracle 8i. (Oracle 11i is the current version.)

    Don’t despair! 😉

    Leaves a lot to be done, including:

    • Interchange between relational database stores
    • Semantic integration in non-relational database stores
    • Interchange in mixed relational/non-relational environments
    • Identifying bases for semantic integration in particular data sets (the tough nut)
    • Others? (your comments can extend this list)

    The good news for topic maps is that Oracle has some name recognition in IT contexts. 😉

    There is a world of difference between a CIO saying to the CEO:

    “That was a great presentation about how we can use our data more effectively with topic maps and some software, what did he say the name was?”

    and,

    “That was a great presentation about using our Oracle database more effectively!”

    Yes?

    Big iron for your practice of topic maps. A present for your holiday tradition.


    Aside to Matt O’Donnell. Yes, I am going to be covering actual examples of using these operators for topic map purposes.

    Right now I am sifting through a 400 document collection on “multi-dimensional indexing” where I discovered this article. Remind me to look at other databases/indexers with similar operators.

    December 8, 2011

    QL.IO

    Filed under: Aggregation,DSL,JSON,SQL — Patrick Durusau @ 7:59 pm

    QL.IO – A declarative, data-retrieval and aggregation gateway for quickly consuming HTTP APIs.

    From the about page:

    A SQL and JSON inspired DSL

    SQL is quite a powerful DSL to retrieve, filter, project, and join data — see efforts like A co-Relational Model of Data for Large Shared Data Banks, LINQ, YQL, or unQL for examples.

    ql.io combines SQL, JSON, and a few procedural style constructs into a compact language. Scripts written in this language can make HTTP requests to retrieve data, perform joins between API responses, project responses, or even make requests in a loop. But note that ql.io’s scripting language is not SQL – it is SQL inspired.

    Orchestration

    Most real-world client apps need to mashup data from multiple APIs in one go. Data mashup is often complicated as client apps need to worry about order of requests, inter-dependencies, error handling, and parallelization to reduce overall latency.

    ql.io’s scripts are procedural in appearance but are executed out of order based on dependencies. Some statements may be scheduled in parallel and some in series based on a dependency analysis done at script compile time. The compilation is an on-the-fly process.

    Consumer Centric Interfaces

    APIs are designed for reuse, and hence they cater to the common denominator. Getting new fields added, optimizing responses, or combining multiple requests into one involve drawn out negotiations between API producing teams and API consuming teams.

    ql.io lets API consuming teams move fast by creating consumer-centric interfaces that are optimized for the client – such optimized interfaces can reduce bandwidth usage and number of HTTP requests.

    I can believe the “SQL inspired” part since it looks like keys/column headers are opaque. That is you an specify a key/column header but you can’t specify the identity of the subject it represents.

    So, if you don’t know the correct term, you are SOL. Which isn’t the state of being inspired.

    Still, it looks like an interesting effort that could develop to be non-opaque with regard to keys and possibly values. (The next stage is how do you learn what properties a subject representative has for the purpose of subject recognition.)

    November 30, 2011

    Ad Hoc Normalization II

    Filed under: Marketing,Normalization,SQL,Topic Maps — Patrick Durusau @ 8:09 pm

    After writing Ad Hoc Normalization it occurred to me that topic maps offer another form of “ad hoc” normalization.

    I don’t know what else you would call merging two topic maps together?

    Try that with two relational databases.

    So, not only can topic maps maintain “internal” ad hoc normalization but also “external” ad hoc normalization with data sources that were not present at the time of the creation of a topic map.

    But there are other forms of normalization.

    Recall that Lars Marius talks about the reduction of information items that represent the same subjects. That can only occur when there is a set of information items that obey the same data model and usually the same syntax. I would call that information model normalization. That is whatever is supported by a particular information model can be normalized.

    For relational databases that is normalization by design and for topic maps that is ad hoc normalization (although some of it could be planned in advance as well).

    But there is another form of normalization. A theoretical construct but subject-based normalization. I say it is theoretical because in order to instantiate a particular case you have to cross over into the land of information model normalization.

    I find subject-based normalization quite useful, mostly because as human designers/authors, we are not constrained by the limits of our machines. We can hold contradictory ideas at the same time without requiring a cold or hot reboot. Subject-based normalization allows us to communicate with other users what we have seen in data and how we need to process it for particular needs.

    November 29, 2011

    Ad Hoc Normalization

    Filed under: Marketing,Normalization,SQL,Topic Maps — Patrick Durusau @ 8:46 pm

    I really should not start reading Date over the weekend. It puts me in a relational frame of mind and I start thinking of explanations of topic maps in terms of the relational model.

    For example, take his definition of:

    First normal form: A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute. (page 358)

    Second normal form: (definition assuming only one candidate key, which we assume is the primary key): a relvar is in 2NF if and only if it is in 1NF and every nonkey attribute is irreducibly dependent on the primary key. (page 361)

    Third normal form: (definition assuming only one candidate key, which we assume is the primary key): A relvar is in 3NF if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key. (page 363)

    Third normal form (even more informal definition): A relvar is in third normal form (3NF) if and only if, for all time, each tuple consists of a primary key value that identifies some entity, together with a set of zero or more mutually independent values that describe that entity in some way.

    Does that mean that topic maps support ad hoc normalization? That is we don’t have to design in normalization before we start writing the topic map but can decide on what subjects need to be “normalized,” that is represented by topics that read to a single representative, after we have started writing the topic map.

    Try that with a relational database and tables of any complexity. If you don’t get it right at the design stage, fixing it becomes more expensive as time goes by.

    Not a “dig” at relational databases. If your domain is that slow changing and other criteria point to a relational solution, by all means, use one. Performance numbers are hard to beat.

    On the other hand, if you need “normalization” an yet you have a rapidly changing environment that is subject to exploration and mappings across domains, you should give topic maps a hard look. Ask for “Ad Hoc Normalization” by name. 😉

    PS: I suspect this is what Lars Marius meant by Topic Maps Data Model (TMDM) 6. Merging, 6.1 General:

    A central operation in Topic Maps is that of merging, a process applied to a topic map in order to eliminate redundant topic map constructs in that topic map. This clause specifies in which situations merging shall occur, but the rules given here are insufficient to ensure that all redundant information is removed from a topic map.

    Any change to a topic map that causes any set to contain two information items equal to each other shall be followed by the merging of those two information items according to the rules given below for the type of information item to which the two equal information items belong.

    But I wasn’t “hearing” “…eliminate redundant topic maps constructs…” as “normalization.”

    November 19, 2011

    Percona Server 5.5.17-22.1 released

    Filed under: MySQL,Percona Server,SQL — Patrick Durusau @ 10:23 pm

    Percona Server 5.5.17-22.1 released

    From the webpage:

    Percona is glad to announce the release of Percona Server 5.5.17-22.1 on November 19th, 2011 (Downloads are available here and from the Percona Software Repositories).

    Based on MySQL 5.5.17, including all the bug fixes in it, Percona Server 5.5.17-22.1 is now the current stable release in the 5.5 series. All of Percona ‘s software is open-source and free, all the details of the release can be found in the 5.5.17-22.1 milestone at Launchpad or in the Release Notes for 5.5.17-22.1 in the documentation.

    I haven’t installed or run a Percona Server, but the reported performance numbers are good enough to merit a closer look.

    If you have run a Percona server, please comment.

    November 11, 2011

    Postgres Plus Connector for Hadoop

    Filed under: Hadoop,MapReduce,Pig,PostgreSQL,SQL — Patrick Durusau @ 7:39 pm

    Postgres Plus Connector for Hadoop

    From the webpage:

    The Postgres Plus Connector for Hadoop provides developers easy access to massive amounts of SQL data for integration with or analysis in Hadoop processing clusters. Now large amounts of data managed by PostgreSQL or Postgres Plus Advanced Server can be accessed by Hadoop for analysis and manipulation using Map-Reduce constructs.

    EnterpriseDB recognized early on that Hadoop, a framework allowing distributed processing of large data sets across computer clusters using a simple programming model, was a valuable and complimentary data processing model to traditional SQL systems. Map-Reduce processing serves important needs for basic processing of extremely large amounts of data and SQL based systems will continue to fulfill their mission critical needs for complex processing of data well into the future. What was missing was an easy way for developers to access and move data between the two environments.

    EnterpriseDB has created the Postgres Plus Connector for Hadoop by extending the Pig platform (an engine for executing data flows in parallel on Hadoop) and using an EnterpriseDB JDBC driver to allow users the ability to load the results of a SQL query into Hadoop where developers can operate on that data using familiar Map-Reduce programming. In addition, data from Hadoop can also be moved back into PostgreSQL or Postgres Plus Advanced Server tables.

    A private beta is in progress, see the webpage for details and to register.

    Plus, there is a webinar, Tuesday, November 29, 2011 11:00 am Eastern Standard Time (New York, GMT-05:00), Extending SQL Analysis with the Postgres Plus Connector for Hadoop. Registration at the webpage as well.

    A step towards seamless data environments. Much like word processing now without the “.” commands. Same commands for the most part but unseen. Data is going in that direction. You will specify desired results and environments will take care of access, processor(s), operations and the like. Tables will appear as tables because you have chosen to view them as tables, etc.

    October 26, 2011

    SQL -> Pig Translation

    Filed under: Pig,SQL — Patrick Durusau @ 6:57 pm

    hadoop pig documentation

    From the post:

    It is sometimes difficult for SQL users to learn Pig because their mind is used to working in SQL. In this tutorial, examples of various SQL statements are shown, and then translated into Pig statements. For more detailed documentation, please see the official Pig manual.

    This could be an effective technique for teaching Pig to SQL programmers. What do you think?

    October 15, 2011

    Neo4j SQL Importer

    Filed under: Neo4j,SQL — Patrick Durusau @ 4:28 pm

    Neo4j SQL Importer

    From Peter Neubauer, from who so many Neo4j goodies come!

    In a discussion thread on importing, Rick Otten mentions http://symmetricds.codehaus.org/ having the potential:

    to feed data from your Oracle (or other JDBC accessible Relational database) into Neo4j – live, as the data changes.

    As Peter replies:

    Very interesting – I like!

    Would love to play around with it. Thanks for the tip Rick!

    October 5, 2011

    Oracle, NoSQL and Topic Maps

    Filed under: RDBMS,SQL — Patrick Durusau @ 3:27 am

    There have been more tweets about Oracle’s recent NoSQL offering than Lucene turning 10 years old. The information content has been about the same.

    The Oracle tweets, “if you can’t beat them, join them,” “we have been waiting for your,” etc., don’t appreciate a software vendor’s view of the world.

    Software vendors, as opposed to software cultists, offer products customers are likely to lease or purchase. A software vendor would port vi to the iPhone 5 if there was enough customer demand.

    Which in an embarrassing way explains why Oracle doesn’t support topic maps, lack of customer demand.

    Topic maps do have customer demand, at least enough to keep any number of topic map service/software vendors afloat. But, those customers don’t make up enough appeal for Oracle to move into the topic map field.

    The NoSQL people may have a model we can follow (perhaps even using NoSQL as backends).

    They isolated use cases of interest to customers, then demonstrated impressive performance numbers on those use cases.

    Question: So how do I learn what use cases are of interest to others? That could be impacted by topic maps?*


    *I know what use cases are of interest to me but a comparative Semitic linguistics topic map isn’t likely to have high demand as an iPhone app, for example. Quite doable with topic maps but not commercially compelling.

    October 3, 2011

    Our big data/total data survey is now live [the 451 Group]

    Filed under: BigData,Data Warehouse,Hadoop,NoSQL,SQL — Patrick Durusau @ 7:05 pm

    Our big data/total data survey is now live [the 451 Group]

    The post reads in part:

    The 451 Group is conducting a survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

    In return for your participation, you will receive a copy of a forthcoming long-format report covering introducing Total Data, The 451 Group’s concept for explaining the changing data management landscape, which will include the results. Respondents will also have the opportunity to become members of TheInfoPro’s peer network.

    Just a word about the survey.

    Question 10 reads:

    What is the primary platform used for storing and querying from each of the following types of data?

    Good question but you have to choose one of three answers to put other (and say what “other” means), you are not allowed to skip any type of data.

    Data types are:

    • Customer Data
    • Transactional Data
    • Online Transaction Data
    • Domain-specific Application Data (e.g., Trade Data in Financial Services, and Call Data in Telecoms)
    • Application Log Data
    • Web Log Data
    • Network Log Data
    • Other Log Files
    • Social Media/Online Data
    • Search Log
    • Audio/Video/Graphics
    • Other Documents/Content

    Same thing happens for Question 11:

    What is the primary platform used for each of the following analytics workloads?

    Eleven required answers that I won’t bother to repeat here.

    As a consultant I really don’t have serious iron/data on the premises but that doesn’t seem to occurred to the survey designers. Nor that even a major IT installation might not have all forms of data or analytics.

    My solution? I just marked Hadoop on Questions 10 and 11 so I could get to the rest of the survey.

    Q12. Which are the top three benefits associated with each of the following data management technologies?

    Q13. Which are the top three challenges associated with each of the following data management technologies?

    Q14. To what extent do you agree with the following statements? (which includes: “The enterprise data warehouse is the single version of the truth for business intelligence”

    Questions 12 – 14 all require answers to all options.

    Note the clever first agree/disagree statement for Q.14.

    Someone will conduct a useful survey of business opinions about big data and likely responses to it.

    Hopefully with a technical survey of the various options and their advantages/disadvantages.

    Please let me know when you see it, I would like to point people to it.

    (I completed this form on Sunday, October 2, 2011, around 11 AM Eastern time.)

    September 10, 2011

    Generic Multiset Programming for Language-Integrated Querying

    Filed under: Multisets,SQL — Patrick Durusau @ 2:53 pm

    More than one resource on generic multiset programming by Fritz Henglein and Ken Friis Larsen:

    Paper: Generic Multiset Programming for Language-Integrated Querying

    Video: Generic Multiset Programming for Language-Integrated Querying

    From the introduction to the paper:

    We introduce a library for generic multiset programming. It supports algebraic programming based on Codd’s original relational algebra with select (filter), project (map), cross product, (multi)set union and (multi)set di fference as primitive operators and extends it with SQL-style functions corresponding to GROUP BY, SORT BY, HAVING, DISTINCT and aggregation functions.

    It generalizes the querying core of SQL as follows: Multisets may contain elements of arbitrary fi rst-order data types, including references (pointers), recursive data types and nested multisets. It contains an expressive embedded domain specifi c language for user-defi nable equivalence and ordering relations. And it allows in principle arbitrary user-defi ned functions for selection, projection and aggregation.

    Under Contributions:

    In this paper we provide a library for SQL-style programming with multisets that

    • supports all the classic features of the data query sublanguage of SQL;
    • admits multisets of any element type, including nested multisets and trees;
    • admits user-definable equivalences (equijoin conditions), predicates, and functions;
    • admits naïaut;ve programming with cross-products, but avoids spending quadratic time on computing them;
    • offers an object-oriented (object-method-parameters) syntax using infix binary operators;
    • and is easy to implement.

    To demonstrate this we include the complete source code (without basic routines for generic discrimination reported elsewhere) and a comple of paradigmatic examples.

    Looks interesting to me!

    September 6, 2011

    First Look – Oracle Data Mining Update

    Filed under: Data Mining,Database,Information Retrieval,SQL — Patrick Durusau @ 7:18 pm

    First Look – Oracle Data Mining Update by James Taylor.

    From the post:

    I got an update from Oracle on Oracle Data Mining (ODM) recently. ODM is an in-database data mining and predictive analytics engine that allows you to build and use advanced predictive analytic models on data that can be accessed through your Oracle data infrastructure. I blogged about ODM extensively last year in this First Look – Oracle Data Mining and since then they have released ODM 11.2.

    The fundamental architecture has not changed, of course. ODM remains a “database-out” solution surfaced through SQL and PL-SQL APIs and executing in the database. It has the 12 algorithms and 50+ statistical functions I discussed before and model building and scoring are both done in-database. Oracle Text functions are integrated to allow text mining algorithms to take advantage of them. Additionally, because ODM mines star schema data it can handle an unlimited number of input attributes, transactional data and unstructured data such as CLOBs, tables or views.

    This release takes the preview GUI I discussed last time and officially releases it. This new GUI is an extension to SQL Developer 3.0 (which is available for free and downloaded by millions of SQL/database people). The “Classic” interface (wizard-based access to the APIs) is still available but the new interface is much more in line with the state of the art as far as analytic tools go.

    BTW, the correct link to: First Look – Oracle Data Mining. (Taylor’s post last year on Oracle Data Mining.)

    For all the buzz about NoSQL, topic map mavens should be aware of the near universal footprint of SQL and prepare accordingly.

    September 1, 2011

    Greenplum Community

    Filed under: Algorithms,Analytics,Machine Learning,SQL — Patrick Durusau @ 6:00 pm

    A post by Alex Popescu, Data Scientist Summit Videos, lead me to discover the Greenplum Community.

    Hosted by Greenplum:

    Greenplum is the pioneer of Enterprise Data Cloud™ solutions for large-scale data warehousing and analytics, providing customers with flexible access to all their data for business intelligence and advanced analytics. Greenplum offers industry-leading performance at a low cost for companies managing terabytes to petabytes of data. Data-driven businesses around the world, including NASDAQ, NYSE Euronext, Silver Spring Networks and Zions Bancorporation, have adopted Greenplum Database-based products to support their mission-critical business functions.

    registration (free) brings access to the videos from the Data Scientist Summit.

    The “community” is focused on Greenplum software (there is a “community” edition). Do be aware that Greenplum Database CE is a 1.7 GB download. Just so you know.

    August 21, 2011

    YeSQL?

    Filed under: NoSQL,SQL — Patrick Durusau @ 7:07 pm

    Perspectives on NoSQL by Gavin M. Roy.

    I don’t remember how I found this presentation but it is quite interesting.

    Starts with a review of NoSQL database options, one slide summaries.

    Compares them to PostgreSQL 9.0b1 using KVPBench, http://github.com/gmr/kvpbench.

    Concludes that SQL databases perform as well if not out-performing NoSQL databases.

    Really depends on the benchmark or more importantly, what use case is at hand. Use the most appropriate technology, SQL or not.

    Still, I like the slide with database administrators running with scissors. I have always wondered what that would look like. Now I know. It isn’t pretty.

    « Newer PostsOlder Posts »

    Powered by WordPress