Archive for the ‘Top-k Query Processing’ Category

Crossfilter

Friday, March 8th, 2013

Crossfilter: Fast Multidimensional Filtering for Coordinated Views

From the webpage:

Crossfilter is a JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Crossfilter uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Crossfilter works, see the API reference.

See the webpage for an impressive demonstration with a 5.3 MB dataset.

Is there a trend towards “big data” manipulation on clusters and “less big data” in browsers?

Will be interesting to see how the benchmarks for “big” and “less big” move over time.

I first saw this in Nat Torkington’s Four Short links: 4 March 2013.

Indexing Reverse Top-k Queries

Monday, May 7th, 2012

Indexing Reverse Top-k Queries by Sean Chester, Alex Thomo, S. Venkatesh, and Sue Whitesides.

Abstract:

We consider the recently introduced monochromatic reverse top-k queries which ask for, given a new tuple q and a dataset D, all possible top-k queries on D union {q} for which q is in the result. Towards this problem, we focus on designing indexes in two dimensions for repeated (or batch) querying, a novel but practical consideration. We present the insight that by representing the dataset as an arrangement of lines, a critical k-polygon can be identified and used exclusively to respond to reverse top-k queries. We construct an index based on this observation which has guaranteed worst-case query cost that is logarithmic in the size of the k-polygon.

We implement our work and compare it to related approaches, demonstrating that our index is fast in practice. Furthermore, we demonstrate through our experiments that a k-polygon is comprised of a small proportion of the original data, so our index structure consumes little disk space.

This was the article that made me start looking for resources on “reverse data management.”

Interesting in its own right but I commend it to your attention in part because of the recognition that interesting challenges lie ahead with higher-dimensional indexing.

If you think about it, “indexing” in the sense of looking for a simple string token, is indexing in one dimension.

If you index a simple string token, but with the further requirement that it appear in JAMA (Journal of the American Medical Association), that is indexing in two dimensions.

If you index a simple string token, appearing in JAMA, but it must also be in an article with the “tag” cancer, now you are indexing in three dimensions.

A human indexer, creating an annual index of cancer publications would move along those dimensions with ease.

Topic maps are an attempt to capture the insight that allows automatic replication of the human indexer’s insight.

RANK: Top-k Query Processing

Monday, May 7th, 2012

RANK: Top-k Query Processing

Although a bit dated, the project offers useful illustrations of top-k query processing in a variety of areas.