Archive for the ‘Reverse Data Management’ Category

Indexing Reverse Top-k Queries

Monday, May 7th, 2012

Indexing Reverse Top-k Queries by Sean Chester, Alex Thomo, S. Venkatesh, and Sue Whitesides.

Abstract:

We consider the recently introduced monochromatic reverse top-k queries which ask for, given a new tuple q and a dataset D, all possible top-k queries on D union {q} for which q is in the result. Towards this problem, we focus on designing indexes in two dimensions for repeated (or batch) querying, a novel but practical consideration. We present the insight that by representing the dataset as an arrangement of lines, a critical k-polygon can be identified and used exclusively to respond to reverse top-k queries. We construct an index based on this observation which has guaranteed worst-case query cost that is logarithmic in the size of the k-polygon.

We implement our work and compare it to related approaches, demonstrating that our index is fast in practice. Furthermore, we demonstrate through our experiments that a k-polygon is comprised of a small proportion of the original data, so our index structure consumes little disk space.

This was the article that made me start looking for resources on “reverse data management.”

Interesting in its own right but I commend it to your attention in part because of the recognition that interesting challenges lie ahead with higher-dimensional indexing.

If you think about it, “indexing” in the sense of looking for a simple string token, is indexing in one dimension.

If you index a simple string token, but with the further requirement that it appear in JAMA (Journal of the American Medical Association), that is indexing in two dimensions.

If you index a simple string token, appearing in JAMA, but it must also be in an article with the “tag” cancer, now you are indexing in three dimensions.

A human indexer, creating an annual index of cancer publications would move along those dimensions with ease.

Topic maps are an attempt to capture the insight that allows automatic replication of the human indexer’s insight.

Reverse Data Management

Monday, May 7th, 2012

Reverse Data Management

I encountered “reverse data management” today.

Being unfamiliar I searched for more information and encountered this page, which reads in part as follows:

Forward and Reverse Data Transformations

Database research mainly focuses on forward-moving data flows: source data is subjected to transformations and evolves through queries, aggregations, and view definitions to form a new target in- stance, possibly with a different schema. This forward paradigm underpins most data management tasks today, such as querying, data integration, data mining, etc. We contrast this forward processing with Reverse Data Management (RDM), where the action needs to be performed on the input data, on behalf of desired outcomes in the output data. Some data management tasks already fall under this paradigm, for example updates through views, data generation, data cleaning and repair. RDM is, by necessity, conceptually more difficult to define, and computationally harder to achieve. Today, however, as increasingly more of the available data is derived from other data, there is an increased need to be able to modify the input in order to achieve a desired effect on the output, motivating a systematic study of RDM.

[graphic omitted]

In general, RDM problems are harder to formulate and implement, because of the simple fact that the inverse of a function is not always a function. Given a desired output (or change to the output), there are multiple inputs (or none at all) that can produce it. This is a prevailing difficulty in all RDM problems. This project aims to develop a unified framework for Reverse Data Management problems, which will bring together several subfields of database research. RDM problems can be classified along two dimensions, as shown in the table below. On the “target” dimension, problems are divided into those that have explicit and those that have implicit specifications. The former means that the desired target effect is given as a tuple-level data instance; this is the case in causality and view updates. The latter means that the target effect is described indirectly, through statistics and constraints; examples include how-to queries and data generation. On the “source” dimension, problems are divided in those that use a reference source, and those that do not. For example, view updates and how-to queries fall under the former category, while data generation under the latter.

I am encouraged by the view that changes in inputs can cause changes in outputs.

It sounds trivial to say.

It is one step down the slippery path where outputs aren’t determinate, in some manner “out there” and inter-subjective.

Outputs depend upon what we make of the inputs.

If I don’t like the outcome, I just need a new set of inputs. Or a new reading of the old ones.

And I need to be mindful that is always the case, whatever I think of the current outputs.

If the edges of database research are exploring RDM issues, that should be a warning to the intelligence community that appears to think value can be derived from crunching enough data.

Perhaps, but be mindful that data crunching produces outcomes based on inputs. If the inputs change, so may the outputs. Something to think about.

Particularly if your integration solution is “lite” on enabling you to probe (alter?) the subject identities as inputs that are shaping your outputs.

Make no mistake, whether we acknowledge it or not, ever datum, every data structure, every operation, every input and every output represents choices. Choices that can be explicit and accounted for or not.

RDM looks like a coming “hot” field of research that addresses some of those issues.