Archive for the ‘Probabilistic Database’ Category

What’s Wrong with Probabilistic Databases? (multi-part)

Saturday, October 27th, 2012

Oliver Kennedy has started a series of posts on probabilistic databases:

What’s Wrong with Probabilistic Databases? (part 1): General introduction to probabilistic databases.

What’s Wrong With Probabilistic Databases? (part 2): Using probabilistic databases with noisy data.

I will be following this series here.

Other materials for introduction to probabilistic databases?

Update: What’s Wrong with Probabilistic Databases? (Part 3): Using probabilistic databases with modeled data.

Dr. Watson?

Wednesday, December 7th, 2011

I got up thinking that there needs to be a project for automated authoring of a topic map and the name, Dr. Watson suddenly occurred to me. After all, Dr. Watson was Sherlock Holmes’ sidekick so it would not be like saying it could stand on its own. Plus there would be some name recognition and/or confusion with the real Dr. Watson, or rather imaginary Dr. Watson of Sherlock Holmes’ fame.

And there would be confusion with the Dr. Watson that is the internal debugger for Windows (MS, I never can remember if the ™ goes on Windows or MS. Not that anyone else would want to call themselves MS. 😉 ) Plus the Watson research center at IBM.

Well, I suspect being an automated, probabilistic topic map authoring system will be enough to distinguish it from the foregoing uses.

Any others that I should be aware of?

I say probabilistic because even with the TMDM’s string matching on URIs, it is only probable that two or more topics actually represent the same topic. It is always possible that a URI has been incorrectly used to identity the subject that a topic represents. And in such cases, the error perpetuates itself across a topic map.

So we start off with the realization that even string matching results in a probability of less than 1.0 (where 1.0 is absolute certainty) that two or more topics represent the same subject.

Since we are already being probabilistic, why not be openly so?

But, before we get into the weeds and details, the project has to have a cool name. (As in not an acronym that is cool and we make up a long name to fit the acronym.)

All those in favor of Dr. Watson, please signify by raising your hands (or the beer you are holding).

More to follow.

MayBMS – A Probabilistic Database Management System

Thursday, June 30th, 2011

MayBMS – A Probabilistic Database Management System

From the homepage:

MayBMS is a state-of-the-art probabilistic database management system developed as an extension of the Postgres server backend (download).

The MayBMS project is founded on the thesis that a principled effort to use and extend mature relational database technology will be essential for creating robust and scalable systems for managing and querying large uncertain datasets.

MayBMS stands alone as a complete probabilistic database management system that supports a very powerful, compositional query language (examples) for which nevertheless worst-case efficiency and result quality guarantees can be made. Central to this is our choice of essentially using probabilistic versions of conditional tables as the representation system, but in a form engineered for admitting the efficient evaluation and automatic optimization of most operations of our language using robust and mature relational database technology.

Another probabilistic system.

I wonder about the consistency leg of CAP as a database principle. Is is a database principle only because we have had such locally located and small data sets that consistency was possible?

Think about any of the sensor arrays and memory banks located light seconds or even minutes away from data stores on Earth. As a practical matter they are always inconsistent with Earth bound data stores. Physical remoteness is the cause of inconsistency in that case. But what of something as simple as not all data having first priority for processing? Or varying priorities for processing depending upon system load? Or even analysis or processing of data that causes a lag between the states of data at different locations?

I’m not suggesting the usual cop-out of eventual consistency because the data may never be consistent. At least in the sense that we use the term for a database located on a single machine or local cluster. We may have to ask, “How consistent do you want the data to be upon delivery?,” knowing the data may be inconsistent on delivery with other data already in existence.

Scalable Query Processing in Probabilistic Databases

Monday, June 27th, 2011

Scalable Query Processing in Probabilistic Databases

From the webpage:

Today, uncertainty is commonplace in data management scenarios dealing with data integration, sensor readings, information extraction from unstructured sources, and whenever information is manually entered and therefore prone to inaccuracy or partiality. Key challenges in probabilistic data management are to design probabilistic database formalisms that can compactly represent large sets of possible interpretations of uncertain data together with their probability distributions, and to efficiently evaluate queries on very large probabilistic data. Such queries could ask for confidences in data patterns possibly in the presence of additional evidence. The problem of query evaluation in probabilistic databases is still in its infancy. Little is known about which queries can be evaluated in polynomial time, and the few existing evaluation methods employ expensive main-memory algorithms.

The aim of this project is to develop techniques for scalable query processing in probabilistic databases and use them to build a robust query engine called SPROUT ( Scalable PROcessing on Tables). We are currently exploring three main research directions.

  • We are investigating open problems in efficient query evaluation. In particular, we aim at discovering classes of tractable (i.e., computable in polynomial time wrt data complexity) queries on probabilistic databases. The query language under investigation is SQL (and its formal core, relational algebra) extended with uncertainty-aware query constructs to create probabilistic data under various probabilistic data models (such as tuple-independent databases, block-independent disjoint databases, or U-relations of MayBMS).
  • For the case of intractable queries, we investigate approximate query evaluation. In contrast to exact evaluation, which computes query answers together with their exact confidences, approximate evaluation computes the query answers with approximate confidences. We are working on new techniques for approximate query evaluation that are aware of the query and the input probabilistic database model (tuple-independent, block-independent disjoint, etc).
  • Our open-source query engine for probabilistic data management systems uses the insights gained from the first two directions. This engine is based on efficient secondary-storage exact and approximate evaluation algorithms for arbitrary queries.

As of June 2, 2011, order Probabilistic Databases by Dan Suciu, Dan Olteanu, Christopher Re, and Christoph Kock from Amazon.

Exciting work!

It occurs to me that semantics are always “probabilistic.”

What does that say about the origin of the semantics of a term?

If semantics are probabilistic, is it ever possible to fix the semantic of a term?

If so, how?