Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 20, 2011

May the Index be with you!

Filed under: MySQL,Query Language,SQL — Patrick Durusau @ 8:06 pm

May the Index be with you! by Lawrence Schwartz.

From the post:

The summer’s end is rapidly approaching — in the next two weeks or so, most people will be settling back into work. Time to change your mindset, re-evaluate your skills and see if you are ready to go back from the picnic table to the database table.

With this in mind, let’s see how much folks can remember from the recent indexing talks my colleague Zardosht Kasheff gave (O’Reilly Conference, Boston, and SF MySQL Meetups). Markus Winand’s site “Use the Index, Luke!” (not to be confused with my favorite Star Wars parody, “Use the Schwartz, Lone Starr!”), has a nice, quick 5 question indexing quiz that can help with this.

Interesting enough to request an account so I could download ToKuDB v.5.0. Uses fractal trees for indexing speed. Could be interesting. More on that later.

August 18, 2011

Introduction to Databases

Filed under: CS Lectures,Database,SQL — Patrick Durusau @ 6:50 pm

Introduction to Databases by Jennifer Widom.

Course Description:

This course covers database design and the use of database management systems for applications. It includes extensive coverage of the relational model, relational algebra, and SQL. It also covers XML data including DTDs and XML Schema for validation, and the query and transformation languages XPath, XQuery, and XSLT. The course includes database design in UML, and relational design principles based on dependencies and normal forms. Many additional key database topics from the design and application-building perspective are also covered: indexes, views, transactions, authorization, integrity constraints, triggers, on-line analytical processing (OLAP), and emerging “NoSQL” systems.

The third free Stanford course being offered this Fall.

The others are: Introduction to Artificial Intelligence and Introduction to Machine Learning.

As of today, the AI course has a registration of 84,000 from 175 countries. I am sure the machine learning with Ng and the database class will post similar numbers.

My only problem is that I lack the time to take all three while working full time. Best hope is for an annual repeat of these offerings.

August 17, 2011

What’s New in MySQL 5.6 – Part 1: Overview – Webinar 18 August 2011

Filed under: MySQL,NoSQL,SQL — Patrick Durusau @ 6:54 pm

What’s New in MySQL 5.6 – Part 1: Overview

From the webpage:

MySQL 5.6 builds on Oracle’s investment in MySQL by adding improvements to Performance, InnoDB, Replication, Instrumentation and flexibility with NoSQL (Not Only SQL) access. In the first session of this 5-part Webinar series, we’ll cover the highlights of those enhancements to help you begin the development and testing efforts around the new features and improvements that are now available in the latest MySQL 5.6 Development Milestone and MySQL Labs releases.

OK, I’ll ‘fess up, I haven’t kept up with MySQL like I did when I was a sysadmin and running it everyday in a production environment. So, maybe its time to do some catching up.

Besides, when you read:

We will also explore how you can now use MySQL 5.6 as a “Not Only SQL” data source for high performance key-value operations by leveraging the new Memcached Plug-in to InnoDB, running simultaneously with SQL for more complex queries, all across the same data set.

“…SQL for more complex queries,…” you almost have to look. 😉

So, get up early tomorrow and throw a recent copy of MySQL on a box.

Mental Shortcuts and Relational Databases

Filed under: SQL — Patrick Durusau @ 6:51 pm

Mental Shortcuts and Relational Databases by Robert Pickering.

Premise is that relational databases evolved to solve a particular set of hardware constraints and problems. Not all that surprising if you think about it. How would software attempt to solve problems not yet known? Doesn’t make SQL any less valuable for the problems it solves well.

July 27, 2011

NoSQL @ Netflix, Part 2

Filed under: Cassandra,NoSQL,SQL — Patrick Durusau @ 2:17 pm

NoSQL @ Netflix, Part 2 by Sid Anand.

OSCON 2011 presentation.

I think the RDBMS Concepts to Key-Value Store Concepts was the best part of the slide deck.

What do you think?

July 25, 2011

Performance of Graph vs. Relational Databases

Filed under: Database,Graphs,SQL — Patrick Durusau @ 6:41 pm

Performance of Graph vs. Relational Databases by Josh Adell.

Short but interesting exploration of performance differences between relational and graph databases.

July 21, 2011

Oracle, Sun Burned, and Solr Exposure

Filed under: Data Mining,Database,Facets,Lucene,SQL,Subject Identity — Patrick Durusau @ 6:27 pm

Oracle, Sun Burned, and Solr Exposure

From the post:

Frankly we wondered when Oracle would move off the dime in faceted search. “Faceted search”, in my lingo, is showing users categories. You can fancy up the explanation, but a person looking for a subject may hit a dead end. The “facet” angle displays links to possibly related content. If you want to educate me, use the comments section for this blog, please.

We are always looking for a solution to our clients’ Oracle “findability” woes. It’s not just relevance. Think performance. Query and snack is the operative mode for at least one of our technical baby geese. Well, Oracle is a bit of a red herring. The company is not looking for a solution to SES11g functionality. Lucid Imagination, a company offering enterprise grade enterprise search solutions, is.

If “findability” is an issue at Oracle, I would be willing to bet that subject identity is as well. Rumor has it that they have paying customers.

July 17, 2011

Building blocks of a scalable web crawler

Filed under: Indexing,NoSQL,Search Engines,Searching,SQL — Patrick Durusau @ 7:29 pm

Building blocks of a scalable web crawler Thesis by Marc Seeger. (2010)

Abstract:

The purpose of this thesis was the investigation and implementation of a good architecture for collecting, analysing and managing website data on a scale of millions of domains. The final project is able to automatically collect data about websites and analyse the content management system they are using.

To be able to do this efficiently, different possible storage back-ends were examined and a system was implemented that is able to gather and store data at a fast pace while still keeping it searchable.

This thesis is a collection of the lessons learned while working on the project combined with the necessary knowledge that went into architectural decisions. It presents an overview of the different infrastructure possibilities and general approaches and as well as explaining the choices that have been made for the implemented system.

From the conclusion:

The implemented architecture has been recorded processing up to 100 domains per second on a single server. At the end of the project the system gathered information about approximately 100 million domains. The collected data can be searched instantly and the automated generation of statistics is visualized in the internal web interface.

Most of your clients have lesser information demands but the lessons here will stand you in good stead with their systems too.

July 12, 2011

MADlib goes beta!

Filed under: Data Analysis,SQL,Statistics — Patrick Durusau @ 7:08 pm

MADlib goes beta! Serious in-database analytics

From the post:

MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!

Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:

  • standard statistical methods like multi-variate linear and logistic regressions,
  • supervised learning methods including support-vector machines, naive Bayes, and decision trees
  • unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
  • descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
  • statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.

Kudos to EMC:

And hats off to EMC-Greenplum for putting significant development resources behind this open-source effort. I started this project at Greenplum before they were acquired, and have been happy to see EMC embrace it and push it further.

Not every acquisition has that happy result.

July 10, 2011

YouTube on Oracle’s Exadata?

Filed under: BigData,Open Source,SQL — Patrick Durusau @ 3:40 pm

Big data vs. traditional databases: Can you reproduce YouTube on Oracle’s Exadata?

Review of a report by Cowen & Co. analyst Peter Goldmacher on Big Data and traditional relational database vendors. Goldmacher is quoted as saying:

We believe the vast majority of data growth is coming in the form of data sets that are not well suited for traditional relational database vendors like Oracle. Not only is the data too unstructured and/or too voluminous for a traditional RDBMS, the software and hardware costs required to crunch through these new data sets using traditional RDBMS technology are prohibitive. To capitalize on the Big Data trend, a new breed of Big Data companies has emerged, leveraging commodity hardware, open source and proprietary technology to capture and analyze these new data sets. We believe the incumbent vendors are unlikely to be a major force in the Big Data trend primarily due to pricing issues and not a lack of technical know-how.

I doubt traditional relational database vendors like Oracle are going to be sitting targets for “…a new breed of Big Data companies….”

True, the “new breed” companies come without some of the licensing costs of traditional vendors, but licensing costs are only one factor in choosing a vendor.

The administrative and auditing requirements for large government contracts, for example, are likely only to be met by large traditional vendors.

And it is the skill with which Big Data is analyzed that makes it of interest to a customer. Skills that traditional vendors have in depth to bring to commodity hardware and open source technology.

Oracle, for example, could slowly replace its licensing revenue stream with a data analysis revenue stream that “new breed” vendors would find hard to match.

Or to paraphrase Shakespeare:

NewBreed:

“I can analyze Big Data.”

Oracle:

“Why, so can I, or so can any man; But will it be meaningful?”

(Henry IV, Part 1, Act III, Scene 1)


BTW, ZDNet forgot to mention in its coverage of this story that Peter Goldmacher worked for Oracle Corporation early in his career. His research coverage entry reads in part:

He started his career at Oracle, working for six years in variety of departments including sales ops, consulting, marketing, and finance, and he has also worked at BMC Software as Director, Corporate Planning and Strategy. (Accessed 10 June 2011, 11:00 AM, East Coast Time)


In the interest of fairness, I should point out that after Oracle’s acquisition of Sun Microsystems, they have sponsored my work as the OpenDocument Format (ODF) editor. I don’t speak on behalf of Oracle with regard to ODF, much less its other ventures. Their sponsorship simply enables me to devote time to the ODF project.

July 4, 2011

Translating SPARQL queries into SQL using R2RML

Filed under: R2RML,SPARQL,SQL,TMQL — Patrick Durusau @ 6:04 pm

Translating SPARQL queries into SQL using R2RML

From the post:

The efficient translation of SPARQL into SQL is an active field of research in the academy and in the industry. In fact, a number of triple stores are built as a layer on top of a relational solution. Support for SPARQL in these RDF stores supposes the translation of the SPARQL query to a SQL query that can be executed in a certain relational schema.

Some foundational papers in the field include “A Relational Algebra for SPARQL” by Richard Cyganiak that translates the semantics of SPARQL as they were finally defined by the W3C to the Relational Algebra semantics or “Semantics preserving SPARQL-to-SQL translation” by Chebotko, Lu and Fotohui, that introduces an algorithm to translate SPARQL queries to SQL queries.

This latter paper is specially interesting because the translation mechanism is parametric on the underlying relational schema. This makes possible to adapt their translation mechanism to any relational database using a couple of mapping functions, alpha and beta, that map a triple pattern of the SPARQL query and a triple pattern and a position in the triple to a table and a column in the database.

Provided that R2RML offers a generic mechanism for the description of relational databases, in order to support SPARQL queries in any R2RML RDF graph, we just need to find an algorithm that receives as an input the R2RML mapping and builds the mapping functions required by Chebotko et alter algorithm.

The straightest way to accomplished that is using the R2RML mapping to generate a virtual table with a single relation with only subject, predicate and object. The mapping for this table is trivial. A possible implementation of this algorithm can be found in the following Clojure code. (I added links to the Cyganiak and Chebotko papers.)

I recommend this post, as well as the Cyganiak and Chebotko papers to anyone interested in TMQL as background reading. Other suggestions?

June 20, 2011

MAD Skills: New Analysis Practices for Big Data

Filed under: Analytics,BigData,Data Integration,SQL — Patrick Durusau @ 3:33 pm

MAD Skills: New Analysis Practices for Big Data by Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton.

Abstract:

As massive data acquisition and storage becomes increasingly aff ordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present data-parallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

I found this passage very telling:

These desires for speed and breadth of data raise tensions with Data Warehousing orthodoxy. Inmon describes the traditional view:

There is no point in bringing data … into the data warehouse environment without integrating it. If the data arrives at the data warehouse in an unintegrated state, it cannot be used to support a corporate view of data. And a corporate view of data is one of the essences of the architected environment [13]

Unfortunately, the challenge of perfectly integrating a new data source into an “architected” warehouse is often substantial, and can hold up access to data for months – or in many cases, forever. The architectural view introduces friction into analytics, repels data sources from the warehouse, and as a result produces shallow incomplete warehouses. It is the opposite of the MAD ideal.

Marketing question for topic maps: Do you want a shallow, incomplete data warehouse?

Admittedly there is more to it, topic maps enable the integration of both data structures as well as the data itself. Both are subjects in the view of topic maps. Not to mention capturing the reasons why certain structures or data were mapped to other structures or data. I think the name for that is an audit trail.

Perhaps we should ask: Does your data integration methodology offer an audit trail?

(See MADLib for the source code growing out of this effort.)

June 18, 2011

VoltDB

Filed under: SQL,VoltDB — Patrick Durusau @ 5:39 pm

VoltDB

From the website:

VoltDB is a blazingly fast relational database system. It is specifically designed for modern software applications that are pushed beyond their limits by high velocity data sources. This new generation of systems – real-time feeds, machine-generated data, micro-transactions, high performance content serving – requires database throughput that can reach millions of operations per second. What’s more, the applications that use this data must be able to scale on demand, provide flawless fault tolerance and give real-time visibility into the data that drives business value.

Note that the “community” version is only for development, testing, tuning. If you want to go to deployment, commercial licensing kicks in.

It’s encouraging to see all the innovation and development in SQL, NoSQL (mis-named but has stuck), graph databases and the like. Only practical experience will decide which ones survive but in any event, data will be more accessible than ever before. Data analysis and not data access skills will come to the fore.

April 16, 2011

MERGE Ahead

Filed under: Marketing,Merging,SQL — Patrick Durusau @ 2:44 pm

Merge Ahead: Introducing the DB2 for i SQL MERGE statement

Karl Hanson of IBM writes:

As any shade tree mechanic or home improvement handyman knows, you can never have too many tools. Sure, you can sometimes get by with inadequate tools on hand, but the right tools can help complete a job in a simpler, safer, and quicker way. The same is true in programming. New in DB2 for i 7.1, the MERGE statement is a handy tool to synchronize data in two tables. But as you will learn later, it can also do more. You might think of MERGE as doing the same thing you could do by writing a program, but with less work and with simpler notation.

Don’t panic, it isn’t merge in the topic map sense but it does show there are market opportunities for what is a trivial task for a topic map.

That implies to me there are also opportunities for more complex tasks, suitable only for topic maps.

March 29, 2011

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin

Filed under: NoSQL,SQL — Patrick Durusau @ 12:48 pm

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin

From the article:

In this article we present a mathematical data model for the most common noSQL databases—namely, key/value relationships—and demonstrate that this data model is the mathematical dual of SQL’s relational data model of foreign-/primary-key relationships. Following established mathematical nomenclature, we refer to the dual of SQL as coSQL. We also show how a single generalization of the relational algebra over sets—namely, monads and monad comprehensions—forms the basis of a common query language for both SQL and noSQL. Despite common wisdom, SQL and coSQL are not diabolically opposed, but instead deeply connected via beautiful mathematical theory.

Just as Codd’s discovery of relational algebra as a formal basis for SQL shifted the database industry from a monopolistically competitive market to an oligopoly and thus propelled a billion-dollar industry around SQL and foreign-/primary-key stores, we believe that our categorical data-model formalization model and monadic query language will allow the same economic growth to occur for coSQL key-value stores.

Considering the authors’ claim that the current SQL oligopoly is woth $32 billion and still growing in double digits, color me interested!

😉

Since they are talking about query languages, maybe the TMQL editors should take a look as well.

March 17, 2011

MySQL 5.5 Released

Filed under: MySQL,SQL — Patrick Durusau @ 6:49 pm

MySQL 5.5 Released

Performance gains for MySQL 5.5, from the release:

In recent benchmarks, the MySQL 5.5 release candidate delivered significant performance improvements compared to MySQL 5.1. Results included:

  • On Windows: Up to 1,500 percent performance gains for Read/Write operations and up to 500 percent gain for Read Only.(1)
  • On Linux: Up to 360 percent performance gain in Read/Write operations and up to 200 percent improvement in Read Only.(2)

If you are using MySQL as a backend for your topic map application, these and other improvements will be welcome news.

March 1, 2011

NoSQL Databases: Why, what and when

NoSQL Databases: Why, what and when by Lorenzo Alberton.

When I posted RDBMS in the Social Networks Age I did not anticipate returning the very next day with another slide deck from Lorenzo. But, after viewing this slide deck, I just had to post it.

It is a very good overview of NoSQL databases and their underlying principles, with useful graphics as well (as opposed to the other kind).

I am going to have to study his graphic technique in hopes of applying it to the semantic issues that are at the core of topic maps.

February 15, 2011

TSearch Primer

Filed under: PostgreSQL,SQL,TSearch — Patrick Durusau @ 2:05 pm

TSearch Primer

From the website:

TSearch is a Full-Text Search engine that is packaged with PostgreSQL. The key developers of TSearch are Oleg Bartunov and Teodor Sigaev who have also done extensive work with GiST and GIN indexes used by PostGIS, PgSphere and other projects. For more about how TSearch and OpenFTS got started check out A Brief History of FTS in PostgreSQL. Check out the TSearch Official Site if you are interested in related TSearch tips or interested in donating to this very worthy project.

Tsearch is different from regular string searching in PostgreSQL in a couple of key ways.

  1. It is well-suited for searching large blobs of text since each word is indexed using a Generalized Inverted Index (GIN) or Generalized Search Tree (GiST) and searched using text search vectors. GIN is generally used for indexing. Search vectors are at word and phrase boundaries.
  2. TSearch has a concept of Linguistic significance using various language dictionaries, ISpell, thesaurus, stop words, etc. therefore it can ignore common words and equate like meaning terms and phrases.
  3. TSearch is for the most part case insensitive.
  4. While various dictionaries and configs are available out of the box with TSearch, one can create new ones and customize existing further to cater to specific niches within industries – e.g. medicine, pharmaceuticals, physics, chemistry, biology, legal matters.

Short introduction to TSearch, which is part of PostgreSQL.

Should be of interest to topic mappers using PostgreSQL.

January 26, 2011

SQLShell. A Cross-Database SQL Tool With NoSQL Potential

Filed under: NoSQL,SQL — Patrick Durusau @ 7:20 am

SQLShell. A Cross-Database SQL Tool With NoSQL Potential

From the website:

In this blog post I will introduce SQLShell and demonstrate, step-by-step, how to install it and start using it with MySQL. I will also reflect on the possibilites of using this with NoSQL technologies, such as HBase, MongoDB, Hive, CouchDB, Redis and Google BigQuery.

SQLShell is a cross-platform, cross-database command-line tool for SQL, much like psql for PostgreSQL or the mysql command-line tool for MySQL.

Discovers that JDBC drivers have not yet developed to the point where a common interface can be demonstrated.

It is only a matter of time until they do improve and tools such as SQLShell will be important for data exploration and harvesting.

Enterprise NoSQL: Silver Bullet or Poison Pill? – (Unique Questions?)

Filed under: NoSQL,SQL — Patrick Durusau @ 7:03 am

Enterprise NoSQL: Silver Bullet or Poison Pill? a presentation by Billy Newport (IBM).

Very informative comparison between SQL and NoSQL mindsets and what considerations lead to one or the other.

The “ah-ha” point in the presentation was Newport saying that for NoSQL, one has to ask what question do you want to have answered?

I am not entirely convinced by Newport’s argument that SQL supports arbitrary queries and that NoSQL design of necessity supports only a single query robustly.

Granting there are design choices that can point a NoSQL designer into a corner, but I don’t think it is fair to assume all NoSQL designers will make the same mistakes.

Or even that all NoSQL solutions obtain such limitations.

I don’t know of anything inherently query limiting about a graph database or even a hypergraph database architecture.

If you quickly point out sharding and it driving design to answer a particular question, my response is: And your question is?

How many arbitrary questions do you think there are for any given data set?

That would be an interesting research question.

How many unique questions (not queries) are asked of the average data set?

That is: unique queries != unique questions.

Application designers can design queries to match their application logic but that isn’t the same thing as a unique question.

Is that Newport’s concern (or at least part of it)? That NoSQL may put limits on the design of application logic? That could be good or bad.

January 25, 2011

Translate SQL to MongoDB MapReduce

Filed under: MapReduce,MongoDB,SQL — Patrick Durusau @ 10:18 am

Translate SQL to MongoDB MapReduce

There is a growing sense that SQL vs. MapReduce or NoSQL is a question of fitness of the tool for the purpose at hand.

If your problem is best solved by commodity hardware working in parallel, then NoSQL solutions may be the path to take.

I have seen that expressed in a number of ways but not with a lot of detail on what factors drive the choice one way or the other.

With enough detail, that could make both a very good guide and topic map for those faced with this sort of issue.

First noticed on Alex Popescu’s myNoSQL blog.

October 16, 2010

Proceedings of the Very Large Database Endowment Inc.

Filed under: Data Mining,Searching,SQL — Patrick Durusau @ 7:11 am

Proceedings of the Very Large Database Endowment Inc.

A resource made available by the Very Large Database Endowment Inc. Who also publish the The VLDB Journal.

With titles like: Scalable multi-query optimization for exploratory queries over federated scientific databases (http://www.vldb.org/pvldb/1/1453864.pdf if you are interested), the interest factor for topic mappers is obvious.

Questions:

  1. What library journals do you scan every week/month? What subject areas
  2. What CS journals do you scan every week/month? What subject areas?
  3. Pick two different subject areas to follow for the next two months.
  4. What reading strategies did you use for the additional materials?
  5. What did you see/learn that you would have otherwise missed?

PS: Turnabout is fair play. The class can decide on two subjects areas with up to 5 journals (total) that I should be following.

October 12, 2010

A Framework for SQL-Based Mining of Large Graphs on Relational Databases

Filed under: Graphs,SQL — Patrick Durusau @ 6:23 am

A Framework for SQL-Based Mining of Large Graphs on Relational Databases Authors: Sriganesh Srihari, Shruti Chandrashekar, Srinivasan Parthasarathy Keywords: Graph mining, SQL-based approach, Relational databases

Abstract:

We design and develop an SQL-based approach for querying and mining large graphs within a relational database management system (RDBMS). We propose a simple lightweight framework to integrate graph applications with the RDBMS through a tightly-coupled network layer, thereby leveraging efficient features of modern databases. Comparisons with straight-up main memory implementations of two kernels – breadth-first search and quasi clique detection – reveal that SQL implementations offer an attractive option in terms of productivity and performance.

Something for those with SQL backends for topic maps.

Implemented using PL/SQL so it isn’t clear how much work it would take to implement this framework on MySQL or Postgres.

If your topic map won’t fit into memory, might be worth a look.

September 5, 2010

Experience in Extending Query Engine for Continuous Analytics

Filed under: Data Integration,Data Mining,SQL,TMQL,Uncategorized — Patrick Durusau @ 4:37 pm

Experience in Extending Query Engine for Continuous Analytics by Qiming Chen and Meichun Hsu has this problem statement:

Streaming analytics is a data-intensive computation chain from event streams to analysis results. In response to the rapidly growing data volume and the increasing need for lower latency, Data Stream Management Systems (DSMSs) provide a paradigm shift from the load-first analyze-later mode of data warehousing….

Moving from load-first analyze-later has implications for topic maps over data warehouses. Particularly when events that are subjects may only have a transient existence in a data stream.

This is on my reading list to prepare to discuss TMQL in Leipzig.

PS: Only five days left to register for TMRA 2010. It is a don’t miss event.

July 11, 2010

UCI ISG Lecture Series on Scalable Data Management

Filed under: Information Retrieval,MapReduce,Searching,Semantics,SQL — Patrick Durusau @ 5:39 am

UCI ISG Lecture Series on Scalable Data Management is simply awesome! Slides and videos you will find:

  • Teradata Past, Present and Future Todd Walter, CTO, R&D, Teradata
  • Hadoop: Origins and Applications Chris Smith, Xavier Stevens and John Carnahan, FOX Audience Network
  • Pig: Building High-Level Dataflows over Map-Reduce Utkarsh Srivastava, Senior Research Scientist, Yahoo!
  • Database Scalability and Indexes Goetz Graefe, HP Fellow, Hewlett-Packard Laboratories
  • Cloud Data Serving: Key-Value Stores to DBMSs Raghu Ramakrishnan, Chief Scientist for Audience & Cloud Computing, Yahoo!
  • Scalable Data Management at Facebook Srinvas Narayanan, Software Engineer, Facebook
  • SCOPE: Parallel Data Processing of Massive Data Sets Jingren Zhou, Researcher, Microsoft
  • What We Got Right, What We Got Wrong: The Lessons I Learned Building a Large-Scale DBMS for XML. Mary Holstege, Principal Engineer, Mark Logic
  • Scalable Data Management with DB2 Matthias Nicola, DB2 pureXML Architect, IBM
  • SQL Server: A Data Platform for Large-Scale Applications José Blakeley, Partner Architect, Microsoft
  • Data in the Cloud: New Challenges or More of the Same? Divy Agrawal, Professor of Computer Science, UC Santa Barbara

Subject identity is as important in the realm of big data/table/etc. as it is anywhere.

It is our choice if topic maps are going to step up to the challenge.

That is going to require reaching out and across communities and becoming pro-active with regard to new opportunities and possibilities.

This resource was brought to my notice by Jack Park. Jack delights in sending these highly relevant and often quite large resource listings my way (and to be honest, I return the favor).

April 20, 2010

Lossy Mapping/Modeling

Filed under: Mapping,SQL — Patrick Durusau @ 6:42 pm

As I mentioned in Maps and Territories, relational database theory excludes SQL schemas from the items that can be modeled/mapped by a relational database.

All maps are lossy, but I think we can distinguish between types of loss.

Some losses are voluntary, in the sense that we choose, due to lack of interest, funding, fitness for use, or other reason to exclude some things from a map.

We could in a library catalog, which is a map of the library’s holdings, add the number of words on each page of each item to that map. Or not. But that would be a voluntary choice on our part.

The exclusion of SQL schemas from the mappings possible within the relational database paradigm, strikes me as a different type of loss. That is an involuntary loss that is mandated by the paradigm.

It simply isn’t possible to model an SQL schema in the relational paradigm. Those subjects, the subjects of the schema, are simply off limits to everyone writing an SQL schema.

I mention that because with topic maps, all the losses are voluntary. At least in the sense that the paradigm does not mandate the exclusion of any subjects, although particular legends may.

I think it would be helpful to have a table listing model/mapping systems and what, if anything, they exclude from modeling/mapping.

Suggestions?

April 18, 2010

An SQL Example for Michael

Filed under: SQL,TMDM,Topic Maps — Patrick Durusau @ 6:24 pm

Marijane White pointed out the following comment from Michael Sperberg-McQueen asking how topic maps differ from SQL:

The biggest set of open questions remains: how does modeling a collection of information with Topic Maps differ from modeling it using some other approach? Are there things we can do with Topic Maps that we can’t do, or cannot do as easily, with a SQL database? With a fact base in Prolog? With colloquial XML? It might be enlightening to see what the Italian Opera topic map might look like, if we designed a bespoke XML vocabulary for it, or if we poured it into SQL. (I have friends who tell me that SQL is really not suited for the kinds of things you can do with Topic Maps, but so far I haven’t understood what they mean; perhaps a concrete example will make it easier to compare the two.)

From http://cmsmcq.com/mib/?p=810

An SQL example:

firstName lastName
Patrick Durusau

And elsewhere:

givenName surName
Patrick Durusau

An interface could issue separate queries and returns a consolidated result.

Does that equal a topic map? My answer is NO!.

The questions that SQL doesn’t answer (topic maps do):

  • On what basis to map? There are no explicit properties of those subjects on which to make a mapping.
  • What rules should we follow? Because there are no explicit rules even assuming there were properties for these subjects.

Contrast that with (topics in CTM syntax):


http://en.wikipedia.org/wiki/First_name
- "firstName" .


http://en.wikipedia.org/wiki/First_name
- "givenName" .

The Topic Maps Data Model (TMDM) defines the subject identifier property (the URL string you see) and that when subject identifier properties are equal the topics merge.

Different situation from the SQL example.

First, we have a defined property that anyone can look at to judge both the merging (are these really the same two subjects?) as well as to decide if they want to merge their subject representatives with these.

Second, we have a rule by which the mapping/merging occurs. We are no long relying on a blind mapping between the two subject representatives.

Topic maps are a three fold trick: 1) No second class subjects, 2) Explicit properties for identification, 3) Explicit rules for when subject representatives are considered to represent the same subject.

Apologies for the length of this post! But, Michael wanted an example.

Questions?

(I will answer Michael’s questions about XML and Prolog separately.)

« Newer Posts

Powered by WordPress