Archive for the ‘Query Rewriting’ Category

Fast and Flexible Query Analysis at MapD with Apache Calcite [Merging Data?]

Thursday, February 9th, 2017

Fast and Flexible Query Analysis at MapD with Apache Calcite by Alex Şuhan.

From the post:

After evaluating a few other options, we decided for Apache Calcite, an incubation stage project at the time. It takes SQL queries and generates extended relational algebra, using a highly configurable cost-based optimizer. Several projects use Calcite already for SQL parsing and query optimization.

One of the main strengths of Calcite is its highly modular structure, which allows for multiple integration points and creative uses. It offers a relational algebra builder, which makes moving to a different SQL parser (or adding a non-SQL frontend) feasible.

In our product, we need runtime functions which are not recognized by Calcite by default. For example, trigonometric functions are necessary for on-the-fly geo projections used for point map rendering. Fortunately, Calcite allows specifying such functions and they become first-class citizens, with proper type checking in place.

Calcite also includes a highly capable and flexible cost-based optimizer, which can apply high-level transformations to the relational algebra based on query patterns and statistics. For example, it can push part of a filter through a join in order to reduce the size of the input, like the following figure shows:


You can find this example and more about the cost-based optimizer in Calcite in this presentation on using it in the Apache Phoenix project. Such optimizations complement the low-level optimizations we do ourselves to achieve great speed improvements.

Relational algebra example
Let’s take a simple query: SELECT A.x, COUNT(*) FROM test JOIN B ON A.x = B.x WHERE A.y > 41 GROUP BY A.x; and analyze the relational algebra generated for it.

In Calcite relational algebra, there are a few main node types, corresponding to the theoretical extended relational algebra model: Scan, Filter, Project, Aggregate and Join. Each type of node, except Scan, has one or more (in the case of Join) inputs and its output can become the input of another node. The graph of nodes connected by data flow relationships is a
directed acyclic graph (abbreviated as “DAG”). For our query, Calcite outputs the following DAG:


The Scan nodes have no inputs and output all the rows and the columns in tables A and B, respectively. The Join node specifies the join condition (in our case A.x = B.x) and its output contains the columns in A and B concatenated. The Filter node only allows the rows which pass the specified condition and its output preserves all columns of input. The Project node only preserves the specified expressions as columns in the output. Finally, the Aggregate specifies the group by expressions and aggregates.

The physical implementation of the nodes is up to the system using Calcite as a frontend. Nothing in the Join node mandates a certain implementation of the join operation (equijoin in our case). Indeed, using a condition which can’t be implemented as a hash join, like A.x < B.x, would only be reflected by the condition in the Filter node.

You’re not MapD today but that’s no excuse for poor query performance.

Besides, learning Apache Calcite will increase your attractiveness as data and queries on it become more complex.

I haven’t read all the documentation but the “metadata” in Apache Calcite is as flat as any you will find.

Which means integration of different data sources is either luck of the draw or you asked someone the “meaning” of the metadata.

The tutorial has this example:


The column header “GENDER” for example appears to presume the common male/female distinction. But without further exploration of the data set, there could be other genders encoded in that field as well.

If “GENDER” seems too easy, what would you say about “NAME,” bearing in mind that Japanese family names are written first and given names written second. How would those appear under “NAME?”

Apologies! My screen shot missed field “S.”

I have utterly no idea what “S” may or may not represent as a field header. Do you?

If the obviousness of field headers fails with “GENDER” and “NAME,” what do you suspect will happen with less “obvious” field headers?

How successful will merging of data be?

Where would you add subject identity information and how would you associate it with data processed by Apache Calcite?

QRU-1: A Public Dataset…

Saturday, September 8th, 2012

QRU-1: A Public Dataset for Promoting Query Representation and Understanding Research by Hang Li, Gu Xu, W. Bruce Croft, Michael Bendersky, Ziqi Wang and Evelyne Viegas.


A new public dataset for promoting query representation and understanding research, referred to as QRU-1, was recently released by Microsoft Research. The QRU-1 dataset contains reformulations of Web TREC topics that are automatically generated using a large-scale proprietary web search log, without compromising user privacy. In this paper, we describe the content of this dataset and the process of its creation. We also discuss the potential uses of the dataset, including a detailed description of a query reformulation experiment.

And the data set:

Query Representation and Understanding Set

The Query Representation and Understanding (QRU) data set contains a set of similar queries that can be used in web research such as query transformation and relevance ranking. QRU contains similar queries that are related to existing benchmark data sets, such as TREC query sets. The QRU data set was created by extracting 100 TREC queries, training a query-generation model and a commercial search engine, generating similar queries from TREC queries with the model, and removal of mistakenly generated queries.

Are query reformulations in essence different identifications of the subject of a search?

But the issue isn’t “more” search results but rather higher quality search results.

Why search engines bother (other than bragging rights) to report “hits” beyond the ones displayed isn’t clear. Just have a “next N hits” button.

You could consider the number of “hits” you don’t look at as a measure of your search engine’s quality. The higher the number…., well, you know. Could be gold in those “hits” but you will never know. And your current search engine will never say.


Wednesday, April 25th, 2012


From the website:

LAILAPS combines a keyword driven search engine for an integrative access to life science databases, machine learning for a content driven relevance ranking, recommender systems for suggestion of related data records and query refinements with a user feedback tracking system for an self learning relevance training.


  • ultra fast keyword based search
  • non-static relevance ranking
  • user specific relevance profiles
  • suggestion of related entries
  • suggestion of related query terms
  • self learning by user tracking
  • deployable at standard desktop PC
  • 100% JAVA
  • installer for in-house deployment

I like the idea of a recommender system that “suggests” related data records and query refinements. It could be wrong.

I am as guilty as anyone of thinking in terms of “correct” recommendations that always lead to relevant data.

That is applying “crisp” set thinking to what is obviously a “rough” set situation. We as readers have to sort out the items in the “rough” set and construct for ourselves, a temporary and fleeting “crisp” set for some particular purpose.

If you are using LAILAPS, I would appreciate a note about your experiences and impressions.

Constraint-Based XML Query Rewriting for Data Integration

Monday, April 16th, 2012

Constraint-Based XML Query Rewriting for Data Integration by Cong Yu and Lucian Popa.


We study the problem of answering queries through a target schema, given a set of mappings between one or more source schemas and this target schema, and given that the data is at the sources. The schemas can be any combination of relational or XML schemas, and can be independently designed. In addition to the source-to-target mappings, we consider as part of the mapping scenario a set of target constraints specifying additional properties on the target schema. This becomes particularly important when integrating data from multiple data sources with overlapping data and when such constraints can express data merging rules at the target. We define the semantics of query answering in such an integration scenario, and design two novel algorithms, basic query rewrite and query resolution, to implement the semantics. The basic query rewrite algorithm reformulates target queries in terms of the source schemas, based on the mappings. The query resolution algorithm generates additional rewritings that merge related information from multiple sources and assemble a coherent view of the data, by incorporating target constraints. The algorithms are implemented and then evaluated using a comprehensive set of experiments based on both synthetic and real-life data integration scenarios.

Who does this sound like?:

Data merging is notoriously hard for data integration and often not dealt with. Integration of scientific data, however, offers many complex scenarios where data merging is required. For example, proteins (each with a unique protein id) are often stored in multiple biological databases, each of which independently maintains different aspects of the protein data (e.g., structures, biological functions, etc.). When querying on a given protein through a target schema, it is important to merge all its relevant data (e.g., structures from one source, functions from another) given the constraint that protein id identifies all components of the protein.

When target constraints are present, it is not enough to consider only the mappings for query answering. The target instance that a query should “observe” must be defined by the interaction between all the mappings from the sources and all the target constraints. This interaction can be quite complex when schemas and mappings are nested and when the data merging rules can enable each other, possibly, in a recursive way. Hence, one of the first problems that we study in this paper is what it means, in a precise sense, to answer the target queries in the “best” way, given that the target instance is specified, indirectly, via the mappings and the target constraints. The rest of the paper will then address how to compute the correct answers without materializing the full target instance, via two novel algorithms that rewrite the target query into a set of corresponding source queries.

Wrong! 😉

The ACM reports sixty-seven (67) citations of this paper as of today. (Paper published in 2004.) Summaries of any of the citing literature welcome!

The question of data integration persists to this day. I take that to indicate that whatever the merits of this approach, data integration issues remain unsolved.

What are the merits/demerits of this approach?

German Compound Words

Monday, April 16th, 2012

German Compound Words by Brian Johnson.

From the post:

Mark Twain is quoted as having said, “Some German words are so long that they have a perspective.”

Although eBay users are unlikely to search using fearsome beasts like “rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz”, which stands for the “beef labeling supervision duties delegation law”, we do frequently see compound words in our users’ queries. While some might look for “damenlederhose”, others might be searching for the same thing (women’s leather pants) using the decompounded forms “damen lederhose” or “damen leder hose”. And even though a German teacher would tell you only “damenlederhose” or “damen lederhose” are correct, the users’ expectation is to see the same results regardless of which form is used.

This scenario exists on the seller side as well. That is, people selling on eBay might describe their item using one or more of these forms. In such cases, what should a search engine do? While the problem might seem simple at first, German word-breaking – or decompounding, as it is also known – is not so simple.

And you thought all this worrying about subject identifiers was just another intellectual pose! 😉

There are serious people who spend serious money in the quest to make even more money who worry about subject identification. They don’t discuss it in topic map terms, but it is subject identity none the less.

This post should get your started on some issues with German.

What other languages/scripts have the same or similar issues? Are the solutions here extensible or are new solutions needed?

Pointers to similar resources most welcome!

Query Rewriting in Search Engines

Monday, April 16th, 2012

Query Rewriting in Search Engines by Hugh Williams was mentioned in Amazon CloudSearch, Elastic Search as a Service by Jeff Dalton. (You need to read Jeff’s comments on Amazon Cloudsearch but onto query rewriting.)

From the post:

There’s countless information on search ranking – creating ranking functions, and their factors such as PageRank and text. Query rewriting is less conspicuous but equally important. Our experience at eBay is that query rewriting has the potential to deliver as much improvement to search as core ranking, and that’s what I’ve seen and heard at other companies.

What is query rewriting?

Let’s start with an example. Suppose a user queries for Gucci handbags at eBay. If we take this literally, the results will be those that have the words Gucci and handbags somewhere in the matching documents. Unfortunately, many great answers aren’t returned. Why?

Consider a document that contains Gucci and handbag, but never uses the plural handbags. It won’t match the query, and won’t be returned. Same story if the document contains Gucci and purse (rather than handbag). And again for a document that contains Gucci but doesn’t contain handbags or a synonym – instead it’s tagged in the “handbags” category on eBay; the user implicitly assumed it’d be returned when a buyer types Gucci handbags as their query.

To solve this problem, we need to do one of two things: add words to the documents so that they match other queries, or add words to the queries so that they match other documents. Query rewriting is the latter approach, and that’s the topic of this post. What I will say about expanding documents is there are tradeoffs: it’s always smart to compute something once in search and store it, rather than compute it for every query, and so there’s a certain attraction to modifying documents once. On the other hand, there are vastly more words in documents than there are words in queries, and doing too much to documents gets expensive and leads to imprecise matching (or returning too many irrelevant documents). I’ve also observed over the years that what works for queries doesn’t always work for documents.

You really need to read the post by Hugh a couple of times.

Query rewriting is approaching the problem of subject identity from the other side of topic maps.

Topic maps collect different identifiers for a subject as a basis for “merging”.

Query rewriting changes a query so it specifies different identifiers for a subject.

Let me try to draw a graphic for you (my graphic skills are crude at best):

Topic Maps versus Query Rewrite

I used “/” as an alternative marker for topic maps to illustrate that matching any identifier returns all of them. For query rewrite, the “+” sign indicates that each identifier is searched for in addition to the others.

The result is the same set of identifiers and results from using them on a query set.

From a different point of view.