Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 8, 2012

Induction

Filed under: Database,Query Language,Visualization — Patrick Durusau @ 8:49 pm

Induction: A Polyglot Database Client for Mac OS X

From the post:

Explore, Query, Visualize

Focus on the data, not the database. Induction is a new kind of tool designed for understanding and communicating relationships in data. Explore rows and columns, query to get exactly what you want, and visualize that data in powerful ways.

SQL? NoSQL? It Don’t Matter

Data is just data, after all. Induction supports PostgreSQL, MySQL, SQLite, Redis, and MongoDB out-of-the-box, and has an extensible architecture that makes it easy to write adapters for anything else you can think of. CouchDB? Oracle? Facebook Graph? Excel? Make it so!

Some commercial advice for the Induction crew:

Sounds great!

Be aware that Excel controls 75% of the BI market. I don’t know the numbers for Oracle products generally, but suspect “enterprise” and “Oracle” are most often heard together. I would make those “out of the box” even before 1.0.

If this is visualization, integration can’t be far behind.

March 7, 2012

Datomic

Filed under: Data,Database,Datomic — Patrick Durusau @ 5:43 pm

Michael Popescu (myNoSQL) has a couple of posts on resources for Datomic.

Intro Videos to Datomic and Datomic Datalog

and,

Datomic: Distributed Database Designed to Enable Scalable, Flexible and Intelligent Applications, Running on Next-Generation Cloud Architectures

I commend the materials you will find there but the white paper in particular, which has the following section:

ATOMIC DATA – THE DATOM

Once you are storing facts, it becomes imperative to choose an appropriate granularity for facts. If you want to record the fact that Sally likes pizza, how best to do so? Most databases require you to update either the Sally record or document, or the set of foods liked by Sally, or the set of likers of pizza. These kind of representational issues complicate and rigidify applications using relational and document models. This can be avoided by recording facts as independent atoms of information. Datomic calls such atomic facts ‘datoms‘. A datom consists of an entity, attribute, value and transaction (time). In this way, any of those sets can be discovered via query, without embedding them into a structural storage model that must be known by applications.

In some views of granularity, the datom “atom” looks like a four-atom molecule to me. 😉 Not to mention that entities/attributes and values can have relationships that don’t involve each other.

March 5, 2012

Trees in the Database: Advanced Data Structures

Filed under: Data Structures,Database,PostgreSQL,RDBMS,SQL,Trees — Patrick Durusau @ 7:52 pm

Trees in the Database: Advanced Data Structures

Lorenzo Alberton writes:

Despite the NoSQL movement trying to flag traditional databases as a dying breed, the RDBMS keeps evolving and adding new powerful weapons to its arsenal. In this talk we’ll explore Common Table Expressions (SQL-99) and how SQL handles recursion, breaking the bi-dimensional barriers and paving the way to more complex data structures like trees and graphs, and how we can replicate features from social networks and recommendation systems. We’ll also have a look at window functions (SQL:2003) and the advanced reporting features they make finally possible. The first part of this talk will cover several different techniques to model a tree data structure into a relational database: parent-child (adjacency list) model, materialized path, nested sets, nested intervals, hybrid models, Common Table Expressions. Then we’ll move one step forward and see how we can model a more complex data structure, i.e. a graph, with concrete examples from today’s websites. Starting from real-world examples of social networks’ and recommendation systems’ features, and with the help of some graph theory, this talk will explain how to represent and traverse a graph in the database. Finally, we will take a look at Window Functions and how they can be useful for data analytics and simple inline aggregations, among other things. All the examples have been tested on PostgreSQL >= 8.4.

Very impressive presentation!

Definitely makes me want to dust off my SQL installations and manuals for a closer look!

March 4, 2012

Graphs in the database: SQL meets social networks

Filed under: Database,Graphs,Social Networks,SQL — Patrick Durusau @ 7:17 pm

Graphs in the database: SQL meets social networks by Lorenzo Alberton.

If you are interested in graphs, SQL databases, Common Table Expressions (CTEs), together or in any combination, this is the article for you!

Lorenzo walks the reader through the basics of graphs with an emphasis on understanding how SQL techniques can be successfully used, depending upon your requirements.

From the post:

Graphs are ubiquitous. Social or P2P networks, thesauri, route planning systems, recommendation systems, collaborative filtering, even the World Wide Web itself is ultimately a graph! Given their importance, it’s surely worth spending some time in studying some algorithms and models to represent and work with them effectively. In this short article, we’re going to see how we can store a graph in a DBMS. Given how much attention my talk about storing a tree data structure in the db received, it’s probably going to be interesting to many. Unfortunately, the Tree models/techniques do not apply to generic graphs, so let’s discover how we can deal with them.

February 26, 2012

NuoDB

Filed under: Database,NuoDB — Patrick Durusau @ 8:30 pm

NouDB

NuoDB is in private beta but the homepage describes it as:

NuoDB is a NewSQL database. It looks and behaves like a traditional SQL database from the outside but under the covers it’s a revolutionary database solution. It is a new class of database for a new class of datacenter.

A technical webinar dated 14 December 2012 at slide 5 had a couple of points that puzzled me.

I need to check some references on some of them but the:

Zero DBA: No backups, minimal performance tuning, automated everything

seems a bit over the top.

Would anyone involved in the private beta care to comment on that claim?

February 8, 2012

Extending Data Beyond the Database – The Notion of “State”

Filed under: Data,Database — Patrick Durusau @ 5:12 pm

Extending Data Beyond the Database – The Notion of “State” by David Loshin

From the post:

In my last post, I essentially suggested that there is a difference between merging two static data sets and merging static data sets with dynamic ones. It is worth providing a more concrete example to demonstrate what I really mean by this idea: let’s say you had a single analytical database containing customer profile information (we’ll call this data set “Profiles”), but at the same time had access to a stream of web page transactions performed by individuals identified as customers (we can refer to this one as “WebStream”).

The challenge is that the WebStream data set may contain information with different degrees of believability. If an event can be verified as the result of a sequence of web transactions within a limited time frame, the resulting data should lead to an update of the Profiles data set. On the other hand, if the sequence does not take place, or takes place over an extended time frame, there is not enough “support” for the update and therefore the potential modification is dropped. For example, if a visitor places a set of items into a shopping cart and completes a purchase, the customer’s preferences are updated based on the items selected and purchased. But if the cart is abandoned and not picked up within 2 hours, the customer’s preferences may not be updated.

Because the update is conditional on a number of different variables, the system must hold into some data until it can either be determined that the preferences are updated or not. We can refer to this as maintaining some temporary state that either resolves into a modification to the Profiles data set or is thrown out after 2 hours.

Are your data sets static or dynamic? And if dynamic, how do you delay merging until some other criteria is met?

The first article David refers to is: Data Quality and State.

Interesting that as soon as we step away from static files and data, the world explodes in complexity. Add to that dynamic notions of identity and recognition and complexity seems like an inadequate term for what we face.

Be mindful those are just slices of what people automatically process all day long. Fix your requirements and build to spec. Leave the “real world” to wetware.

January 31, 2012

SAP MaxDB Downloads

Filed under: Database,SAP MaxDB — Patrick Durusau @ 4:33 pm

SAP MaxDB Downloads

In a post I was reading it was mentioned that the download area for this database wasn’t easy to find. Indeed not!

But, that plus a page that says:

SAP MaxDB is the database management system developed and supported by SAP AG. It can be used as an alternative to databases from other vendors for your own or third-party applications.

Note: These installation packages can be used free of charge according to the SAP Community License Agreement for SAP MaxDB. These packages are not for use with SAP applications. For that purpose, refer to the Download Area in SAP Service Marketplace (login required). (emphasis added)

made it worthy of mention.

Suspect the use with “SAP applications” may have additional features or tighter integration. An interesting approach to the community versus commercial edition practice.

Interested in hearing your experiences with using the SAP MaxDB with topic maps.

January 30, 2012

Ålenkå

Filed under: Benchmarks,Database,GPU — Patrick Durusau @ 8:02 pm

Ålenkå

If you don’t mind alpha code, ålenkå was pointed out in the bitmap posting I cited earlier today.

From its homepage:

Alenka is a modern analytical database engine written to take advantage of vector based processing and high bandwidth of modern GPUs.

Features include:

Vector-based processing
CUDA programming model allows a single operation to be applied to an entire set of data at once.

Self optimizing compression
Ultra fast compression and decompression performed directly inside GPU

Column-based storage
Minimize disk I/O by only accessing the relevant data

Fast database loads
Data load times measured in minutes, not in hours.

Open source and free

Apologies for the name spelling differences, Ålenkå versus Alenka. I suspect it has something to do with character support in whatever produced the readme file, but can’t say for sure.

The benchmarks (there is that term again) are impressive.

Would semantic benchmarks be different from the ones used in IR currently? Different from precision and recall? What about range (same subject but identified differently) or accuracy (different identifications but same subject, how many false positives)?

January 23, 2012

Percona Live DC 2012 Slides (MySQL)

Filed under: Database,MySQL — Patrick Durusau @ 7:44 pm

Percona Live DC 2012 Slides

I put the (MySQL) header for the benefit of hard core TM fans who can’t be bothered with MySQL posts. 😉

I won’t say what database system I originally learned databases on but I must admit that I became enchanted with MySQL years later.

For a large number of applications, including TM backends, MySQL is entirely appropriate.

Sure, when your company goes interplanetary you are going to need a bigger solution.

But in the mean time, get a solution that isn’t larger than the problem you are trying to solve.

BTW, MySQL installations have the same mapping for BI issues I noted in an earlier post today.

Thoughts on how you would fashion a generic solution that does not require conversion of data?

January 22, 2012

Combining Heterogeneous Classifiers for Relational Databases (Of Relational Prisons and such)

Filed under: Business Intelligence,Classifier,Database,Schema — Patrick Durusau @ 7:39 pm

Combining Heterogeneous Classifiers for Relational Databases by Geetha Manjunatha, M Narasimha Murty and Dinkar Sitaram.

Abstract:

Most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a ‘flat’ form (mega-join), even the human-specified semantic information present in the relations is lost. In this paper, we present a practical, two-phase hierarchical meta-classification algorithm for relational databases with a semantic divide and conquer approach. We propose a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual database tables. The proposed algorithm was evaluated on three diverse datasets, namely TPCH, PKDD and UCI benchmarks and showed considerable reduction in classification time without any loss of prediction accuracy.

When I read:

So, a typical enterprise dataset resides in such expert-designed multiple relational database tables. On the other hand, as known, most traditional classi cation algorithms still assume that the input dataset is available in a single table – a flat representation of data attributes. So, for applying these state-of-art single-table data mining techniques to enterprise data, one needs to convert the distributed relational data into a flat form.

a couple of things dropped into place.

First, the problem being described, the production of a flat form for analysis reminds me of the problem of record linkage in the late 1950’s (predating relational databases). There records were regularized to enable very similar analysis.

Second, as the authors state in a paragraph or so, conversion to such a format is not possible in most cases. Interesting that the choice of relational database table design has the impact of limiting the type of analysis that can be performed on the data.

Therefore, knowledge mining over real enterprise data using machine learning techniques is very valuable for what is called an intelligent enterprise. However, application of state-of-art pattern recognition techniques in the mainstream BI has not yet taken o [Gartner report] due to lack of in-memory analytics among others. The key hurdle to make this possible is the incompatibility between the input data formats used by most machine learning techniques and the formats used by real enterprises.

If freeing data from its relational prison is a key aspect to empowering business intelligence (BI), what would you suggest as a solution?

January 16, 2012

Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)

Filed under: Database,Knowledge Discovery,Machine Learning — Patrick Durusau @ 2:43 pm

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) will take place in Bristol, UK from September 24th to 28th, 2012.

Dates:

Abstract submission deadline: Thu 19 April 2012
Paper submission deadline: Mon 23 April 2012
Early author notification: Mon 28 May 2012
Author notification: Fri 15 June 2012
Camera ready submission: Fri 29 June 2012
Conference: Mon – Fri, 24-28 September, 2012.

From the call for papers:

The European Conference on “Machine Learning” and “Principles and Practice of Knowledge Discovery in Databases” (ECML-PKDD) provides an international forum for the discussion of the latest high-quality research results in all areas related to machine learning and knowledge discovery in databases and other innovative application domains.

Submissions are invited on all aspects of machine learning, knowledge discovery and data mining, including real-world applications.

The overriding criteria for acceptance will be a paper’s:

  • potential to inspire the research community by introducing new and relevant problems, concepts, solution strategies, and ideas;
  • contribution to solving a problem widely recognized as both challenging and important;
  • capability to address a novel area of impact of machine learning and data mining.

Other criteria are scientific rigour and correctness, challenges overcome, quality and reproducibility of the experiments, and presentation.

I rather like that: quality and reproducibility of the experiments.

As opposed to the “just believe in the power of ….” and you will get all manner of benefits. But no one can produce data to prove those claims.

Reminds me of the astronomer in Ben Johnson’s who claimed to:

I have possessed for five years the regulation of the weather and the distribution of the seasons. The sun has listened to my dictates, and passed from tropic to tropic by my direction; the clouds at my call have poured their waters, and the Nile has overflowed at my command. I have restrained the rage of the dog-star, and mitigated the fervours of the crab. The winds alone, of all the elemental powers, have hitherto refused my authority, and multitudes have perished by equinoctial tempests which I found myself unable to prohibit or restrain. I have administered this great office with exact justice, and made to the different nations of the earth an impartial dividend of rain and sunshine. What must have been the misery of half the globe if I had limited the clouds to particular regions, or confined the sun to either side of the equator?’”

And when asked how he knew this to be true, replied:

“‘Because,’ said he, ‘I cannot prove it by any external evidence; and I know too well the laws of demonstration to think that my conviction ought to influence another, who cannot, like me, be conscious of its force. I therefore shall not attempt to gain credit by disputation. It is sufficient that I feel this power that I have long possessed, and every day exerted it. But the life of man is short; the infirmities of age increase upon me, and the time will soon come when the regulator of the year must mingle with the dust. The care of appointing a successor has long disturbed me; the night and the day have been spent in comparisons of all the characters which have come to my knowledge, and I have yet found none so worthy as thyself.’” (emphasis added)

Project Gutenberg has a copy online: Rasselas, Prince of Abyssinia, by Samuel Johnson.

For my part, I think semantic integration has been, is and will be hard, not to mention expensive.

Determining your ROI is just as necessary for semantic integration project, whatever technology you choose, as for any other project.

December 26, 2011

Beyond Relational

Filed under: Database,MySQL,Oracle,PostgreSQL,SQL,SQL Server — Patrick Durusau @ 8:19 pm

Beyond Relational

I originally arrived at this site because of a blog hosted there with lessons on Oracle 10g. Exploring a bit I decided to post about it.

Seems to have fairly broad coverage, from Oracle and PostgreSQL to TSQL and XQuery.

Likely to be a good site for learning cross-overs between systems that you can map for later use.

Suggestions of similar sites?

December 24, 2011

IndexTank is now open source!

Filed under: Database,IndexTank,NoSQL — Patrick Durusau @ 4:43 pm

IndexTank is now open source! by Diego Basch, Director of Engineering, LinkedIn.

From the post:

We are proud to announce that the technology behind IndexTank has just been released as open-source software under the Apache 2.0 License! We promised to do this when LinkedIn acquired IndexTank, so here we go:

indextank-engine: Indexing engine

indextank-service: API, BackOffice, Storefront, and Nebulizer

We know that many of our users and other interested parties have been patiently waiting for this release. We want to thank you for your patience, for your kind emails, and for your continued support. We are looking forward to seeing IndexTank thrive as an open-source project. Of course we’ll do our part; our team is hard at work building search infrastructure at LinkedIn. We are part of a larger team that has built and released search technologies such as Zoie, Bobo, and just this past Monday, Cleo. We are excited to add IndexTank to this array of powerful open source tools.

From the indextank.com homepage:

PROVEN FULL-TEXT SEARCH API

  • Truly real-time: instant updates without reindexing
  • Geo & Social aware: use location, votes, ratings or comments
  • Works with Ruby, Rails, Python, Java, PHP, .NET & more!

CUSTOM SEARCH THAT YOU CONTROL

  • You control how to sort and score results
  • “Fuzzy”, Autocomplete, Facets for how users really search
  • Highlights & Snippets quickly shows search results relevance

EASY, FAST & HOSTED

  • Scalable from a personal blog to hundreds of millions of documents! (try Reddit)
  • Free up 100K documents
  • Easier than SQL, SOLR/Lucene & Sphinx.

If you are looking for documentation, rather than github, you best look here.

So far, I haven’t seen anything out of the ordinary for a search engine. I mention it in case some people prefer it over others.

Do you see anything out of the ordinary?

Topic Maps & Oracle: A Smoking Gun

Filed under: Database,Oracle,SQL — Patrick Durusau @ 4:42 pm

Using Similarity-based Operations for Resolving Data-Level Conflicts (2003)

Abstract:

Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of these operators, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples how the operators can be used in given application scenarios.

No, the title of the post is not a mistake.

The authors of this paper, in 2003, conclude:

In this paper we presented database operators for finding related data and identifying duplicates based on user-specific similarity criteria. The main application area of our work is the integration of heterogeneous data where the likelihood of occurrence of data objects representing related or the same real-world objects though containing discrepant values is rather high. Intended as an extended grouping operation and by combining it with aggregation functions for merging/reconciling groups of conflicting values our grouping operator fits well into the relational algebra framework and the SQL query processing model. In a similar way, an extended join operator takes similarity predicates used for both operators into consideration. These operators can be utilized in ad-hoc queries as part of more complex data integration and cleaning tasks.

In addition to a theoretical background, the authors illustrate an implementation of their techniques, using Oracle 8i. (Oracle 11i is the current version.)

Don’t despair! 😉

Leaves a lot to be done, including:

  • Interchange between relational database stores
  • Semantic integration in non-relational database stores
  • Interchange in mixed relational/non-relational environments
  • Identifying bases for semantic integration in particular data sets (the tough nut)
  • Others? (your comments can extend this list)

The good news for topic maps is that Oracle has some name recognition in IT contexts. 😉

There is a world of difference between a CIO saying to the CEO:

“That was a great presentation about how we can use our data more effectively with topic maps and some software, what did he say the name was?”

and,

“That was a great presentation about using our Oracle database more effectively!”

Yes?

Big iron for your practice of topic maps. A present for your holiday tradition.


Aside to Matt O’Donnell. Yes, I am going to be covering actual examples of using these operators for topic map purposes.

Right now I am sifting through a 400 document collection on “multi-dimensional indexing” where I discovered this article. Remind me to look at other databases/indexers with similar operators.

December 9, 2011

Database Indexes for the Inquisitive Mind

Filed under: Database,Indexing — Patrick Durusau @ 8:14 pm

Database Indexes for The Inquisitive Mind by Nuno Job

From the post:

I’ve used to be a developer advocate an awesome database product called MarkLogic, a NoSQL Document Database for the Enterprise. Now it’s pretty frequent that people ask me about database stuff.

In here I’m going to try to explain some fun stuff you can do with indexes. Not going to talk about implementing them but just about what they solve.

The point here is to help you reason about the choices you have when you are implementing stuff to speed up your applications. I’m sure if you think an idea is smart and fun you’ll research what’s the best algorithm to implement it.

If you are curious about MarkLogic you can always check the Inside MarkLogic white-paper.

Very nice introduction to database indexes. There is more to learn, as much if not more than you would care to. 😉

December 3, 2011

Meet Sheldon, Our Custom Database Server

Filed under: Database,Graphs — Patrick Durusau @ 8:20 pm

Meet Sheldon, Our Custom Database Server

From the post:

Building a recommendations engine is a challenging thing, but one thing that makes a difference is saving your data to make the whole process more efficient. Recommendation engines are fed by user interactions, so the first thought might be to use a graph processing system as a model that lets you abstract the whole system in a natural way. Since Moviepilot Labs is working with a graph database system to store and query all our data we built Sheldon, our custom graph database server.

I suppose being a movie recommendation service that it is only appropriate that it start with a teaser about its graph database server.

The post ends:

This article is Part 1 of a two-part series on graph databases being used at Moviepilot.Read about how Moviepilot walks the graph in Part 2. Learn more about graph databases here.

The pointer is a good source of information but still no detail on “Sheldon.”

November 28, 2011

Vectorwise

Filed under: Database — Patrick Durusau @ 7:10 pm

Vectorwise

I ran across this review, Vectorwise – Worth a Look? by Steve Miller and was wondering if anyone else had seen it or reviewed Vectorwise?

His non-complete evaluation resulted in:

The results I’ve gathered so far for my admittedly non-strenuous tests are nonetheless encouraging. My first experiment was loading a 600,000 row fact table with nine small star lookups from csv files. The big table load takes two seconds and all the queries I’ve attempted, even ones with a six table join, complete in a second or two.

The second test involved loading a 10 million+ row, four attribute, csv stock performance data set along with a 3000 record lookup. The big table imports in 7 seconds. Group-by queries that join to the lookup complete in under three seconds.

There is a 30-day free trial version for Windows.

BTW, can anyone forward me the technical white paper on Vectorwise? The last white paper I signed up for was a marketing document and not even a very good one of those.

November 27, 2011

H2

Filed under: Database,Java — Patrick Durusau @ 8:51 pm

H2

From the webpage:

Welcome to H2, the Java SQL database. The main features of H2 are:

  • Very fast, open source, JDBC API
  • Embedded and server modes; in-memory databases
  • Browser based Console application
  • Small footprint: around 1 MB jar file size

I ran across this the other day and it looked interesting.

Particularly since I want to start exploring the topic maps tool chain. And what parts can be best done by what software?

October 30, 2011

How to beat the CAP theorem

Filed under: CAP,Data Structures,Database — Patrick Durusau @ 7:05 pm

How to beat the CAP theorem by Nathan Marz.

After the Storm video, I ran across this post by Nathan and just had to add it as well!

From the post:

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can’t sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn’t available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn’t in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you’ve ever used a product that told you to “try again later”, you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is “eventual consistency”. If you use an eventually consistent database, then sometimes you’ll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called “read repair”).

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I’m saying that you’re damned if you do and damned if you don’t. The CAP theorem is a fact of nature, so what alternative can there possibly be?

Nathan finds a way and it is as clever as his coding for Storm.

Take your time and read slowly. See what you think. Comments welcome!

October 21, 2011

Using MongoDB in Anger

Filed under: Database,Indexing,MongoDB,NoSQL — Patrick Durusau @ 7:26 pm

Using MongoDB in Anger

Tips on building high performance applications with MongoDB.

Covers four topics:

  • Schema design
  • Indexing
  • Concurrency
  • Durability

Excellent presentation!

One of the first presentations I have seen that recommends a book about another product. Well, High Performance MySQL and MongoDB in Action.

October 13, 2011

SymmetricDS

Filed under: Data Replication,Database — Patrick Durusau @ 6:58 pm

SymmetricDS

From the website:

SymmetricDS is an asynchronous data replication software package that supports multiple subscribers and bi-directional synchronization. It uses web and database technologies to replicate tables between relational databases, in near real time if desired. The software was designed to scale for a large number of databases, work across low-bandwidth connections, and withstand periods of network outage.

By using database triggers, SymmetricDS guarantees that data changes are captured and atomicity is preserved. Support for database vendors is provided through a Database Dialect layer, with implementations for MySQL, Oracle, SQL Server, PostgreSQL, DB2, Firebird, HSQLDB, H2, and Apache Derby included.

This is very cool!

(Spotted by Marko Rodriguez)

September 15, 2011

5 Steps to Scaling MongoDB

Filed under: Database,MongoDB — Patrick Durusau @ 7:52 pm

5 Steps to Scaling MongoDB (Or Any DB) in 8 MInutes

From the post:

Jared Rosoff concisely, effectively, entertainingly, and convincingly gives an 8 minute MongoDB tutorial on scaling MongoDB at Scale Out Camp. The ideas aren’t just limited to MongoDB, they work for most any database: Optimize your queries; Know your working set size; Tune your file system; Choose the right disks; Shard. Here’s an explanation of all 5 strategies:

Note: The Scale Out Camp link isn’t working as of 9/14/2011. Web domain is there but no content.

September 10, 2011

A Uniform Fixpoint Approach to the Implementation of Inference Methods for Deductive Databases

Filed under: Database,Deductive Databases,Inference — Patrick Durusau @ 6:00 pm

A Uniform Fixpoint Approach to the Implementation of Inference Methods for Deductive Databases by Andreas Behrend.

Abstract:

Within the research area of deductive databases three different database tasks have been deeply investigated: query evaluation, update propagation and view updating. Over the last thirty years various inference mechanisms have been proposed for realizing these main functionalities of a rule-based system. However, these inference mechanisms have been rarely used in commercial DB systems until now. One important reason for this is the lack of a uniform approach well-suited for implementation in an SQL-based system. In this paper, we present such a uniform approach in form of a new version of the soft consequence operator. Additionally, we present improved transformation-based approaches to query optimization and update propagation and view updating which are all using this operator as underlying evaluation mechanism.

This one will take a while and discussions with people more familiar than I am with deductive databases.

But, having said that, it looks important. The approach has been validated for stock market data streams and management of airspace. Not to mention:

EU Project INFOMIX (IST-2001-33570)

Information system of University “La Sapienza” in Rome.

  • 14 global relations,
  • 29 integrity constraints,
  • 29 relations (in 3 legacy databases) and 12 web wrappers,

More than 24MB of data regarding students, professors and exams of the University.

September 6, 2011

First Look – Oracle Data Mining Update

Filed under: Data Mining,Database,Information Retrieval,SQL — Patrick Durusau @ 7:18 pm

First Look – Oracle Data Mining Update by James Taylor.

From the post:

I got an update from Oracle on Oracle Data Mining (ODM) recently. ODM is an in-database data mining and predictive analytics engine that allows you to build and use advanced predictive analytic models on data that can be accessed through your Oracle data infrastructure. I blogged about ODM extensively last year in this First Look – Oracle Data Mining and since then they have released ODM 11.2.

The fundamental architecture has not changed, of course. ODM remains a “database-out” solution surfaced through SQL and PL-SQL APIs and executing in the database. It has the 12 algorithms and 50+ statistical functions I discussed before and model building and scoring are both done in-database. Oracle Text functions are integrated to allow text mining algorithms to take advantage of them. Additionally, because ODM mines star schema data it can handle an unlimited number of input attributes, transactional data and unstructured data such as CLOBs, tables or views.

This release takes the preview GUI I discussed last time and officially releases it. This new GUI is an extension to SQL Developer 3.0 (which is available for free and downloaded by millions of SQL/database people). The “Classic” interface (wizard-based access to the APIs) is still available but the new interface is much more in line with the state of the art as far as analytic tools go.

BTW, the correct link to: First Look – Oracle Data Mining. (Taylor’s post last year on Oracle Data Mining.)

For all the buzz about NoSQL, topic map mavens should be aware of the near universal footprint of SQL and prepare accordingly.

August 30, 2011

Databases – Humanities and Social Sciences

Filed under: Database — Patrick Durusau @ 7:12 pm

Applications of Databases to Humanities and Social Sciences

NoSQL, Semantic Web and topic maps will mean little if you don’t understand the technical backdrop for those developments.

Library students take note: SQL databases are very common in libraries and academia in general so what you learn here will be doubly useful.

August 23, 2011

Query Execution in Column-Oriented Database Systems

Filed under: Column-Oriented,Database,Query Language — Patrick Durusau @ 6:38 pm

Query Execution in Column-Oriented Database Systems by Daniel Abadi (Ph.D. thesis).

Apologies for the length of the quote but this is an early dissertation on column-oriented data systems and I want to entice you into reading it. Not so much for the techniques, which are now common but the analysis.

Abstract:

There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it performs best on the most common application for database systems: business transactional data processing. However, there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read through the data to gain new insight and use it to drive decision making and planning.

In this dissertation, we study the problem of poor performance of row-by-row data layout for these emerging applications, and evaluate the column-by-column data layout opportunity as a solution to this problem. There have been a variety of proposals in the literature for how to build a database system on top of column-by-column layout. These proposals have different levels of implementation effort, and have different performance characteristics. If one wanted to build a new database system that utilizes the column-by-column data layout, it is unclear which proposal to follow. This dissertation provides (to the best of our knowledge) the only detailed study of multiple implementation approaches of such systems, categorizing the different approaches into three broad categories, and evaluating the tradeoffs between approaches. We conclude that building a query executer specifically designed for the column-by-column query layout is essential to achieve good performance.

Consequently, we describe the implementation of C-Store, a new database system with a storage layer and query executer built for column-by-column data layout. We introduce three new query execution techniques that significantly improve performance. First, we look at the problem of integrating compression and execution so that the query executer is capable of directly operating on compressed data. This improves performance by improving I/O (less data needs to be read off disk), and CPU (the data need not be decompressed). We describe our solution to the problem of executer extensibility – how can new compression techniques be added to the system without having to rewrite the operator code? Second, we analyze the problem of tuple construction (stitching together attributes from multiple columns into a row-oriented ”tuple”). Tuple construction is required when operators need to access multiple attributes from the same tuple; however, if done at the wrong point in a query plan, a significant performance penalty is paid. We introduce an analytical model and some heuristics to use that help decide when in a query plan tuple construction should occur. Third, we introduce a new join technique, the “invisible join” that improves performance of a specific type of join that is common in the applications for which column-by-column data layout is a good idea.

Finally, we benchmark performance of the complete C-Store database system against other column-oriented database system implementation approaches, and against row-oriented databases. We benchmark two applications. The first application is a typical analytical application for which column-by-column data layout is known to outperform row-by-row data layout. The second application is another emerging application, the Semantic Web, for which column-oriented database systems are not currently used. We find that on the first application, the complete C-Store system performed 10 to 18 times faster than alternative column-store implementation approaches, and 6 to 12 times faster than a commercial database system that uses a row-by-row data layout. On the Semantic Web application, we find that C-Store outperforms other state-of-the-art data management techniques by an order of magnitude, and outperforms other common data management techniques by almost two orders of magnitude. Benchmark queries, which used to take multiple minutes to execute, can now be answered in several seconds.

August 18, 2011

Introduction to Databases

Filed under: CS Lectures,Database,SQL — Patrick Durusau @ 6:50 pm

Introduction to Databases by Jennifer Widom.

Course Description:

This course covers database design and the use of database management systems for applications. It includes extensive coverage of the relational model, relational algebra, and SQL. It also covers XML data including DTDs and XML Schema for validation, and the query and transformation languages XPath, XQuery, and XSLT. The course includes database design in UML, and relational design principles based on dependencies and normal forms. Many additional key database topics from the design and application-building perspective are also covered: indexes, views, transactions, authorization, integrity constraints, triggers, on-line analytical processing (OLAP), and emerging “NoSQL” systems.

The third free Stanford course being offered this Fall.

The others are: Introduction to Artificial Intelligence and Introduction to Machine Learning.

As of today, the AI course has a registration of 84,000 from 175 countries. I am sure the machine learning with Ng and the database class will post similar numbers.

My only problem is that I lack the time to take all three while working full time. Best hope is for an annual repeat of these offerings.

July 25, 2011

Performance of Graph vs. Relational Databases

Filed under: Database,Graphs,SQL — Patrick Durusau @ 6:41 pm

Performance of Graph vs. Relational Databases by Josh Adell.

Short but interesting exploration of performance differences between relational and graph databases.

July 21, 2011

Oracle, Sun Burned, and Solr Exposure

Filed under: Data Mining,Database,Facets,Lucene,SQL,Subject Identity — Patrick Durusau @ 6:27 pm

Oracle, Sun Burned, and Solr Exposure

From the post:

Frankly we wondered when Oracle would move off the dime in faceted search. “Faceted search”, in my lingo, is showing users categories. You can fancy up the explanation, but a person looking for a subject may hit a dead end. The “facet” angle displays links to possibly related content. If you want to educate me, use the comments section for this blog, please.

We are always looking for a solution to our clients’ Oracle “findability” woes. It’s not just relevance. Think performance. Query and snack is the operative mode for at least one of our technical baby geese. Well, Oracle is a bit of a red herring. The company is not looking for a solution to SES11g functionality. Lucid Imagination, a company offering enterprise grade enterprise search solutions, is.

If “findability” is an issue at Oracle, I would be willing to bet that subject identity is as well. Rumor has it that they have paying customers.

July 4, 2011

RavenDB

Filed under: Database,NoSQL,RavenDB — Patrick Durusau @ 6:03 pm

RavenDB

Raven is an Open Source (with a commercial option) document database for the .NET/Windows platform. Raven offers a flexible data model design to fit the needs of real world systems. Raven stores schema-less JSON documents, allow you to define indexes using Linq queries and focus on low latency and high performance.

  • Scalable infrastructure: Raven builds on top of existing, proven and scalable infrastructure
  • Simple Windows configuration: Raven is simple to setup and run on windows as either a service or IIS7 website
  • Transactional: Raven support System.Transaction with ACID transactions. If you put data in it, that data is going to stay there
  • Map/Reduce: Easily define map/reduce indexes with Linq queries
  • .NET Client API: Raven comes with a fully functional .NET client API which implements Unit of Work and much more
  • RESTful: Raven is built around a RESTful API

Haven’t meant to neglect the .Net world, just don’t visit there very often. 😉 Will try to do better in the future.

« Newer PostsOlder Posts »

Powered by WordPress