Archive for the ‘Query Language’ Category

An Empirical Investigation into Programming Language Syntax

Monday, July 14th, 2014

An Empirical Investigation into Programming Language Syntax by Greg Wilson.

A great synopsis of Andreas Stefik and Susanna Siebert’s “An Empirical Investigation into Programming Language Syntax.” ACM Transactions on Computing Education, 13(4), Nov. 2013.

A sample to interest you in the post:

  1. Programming language designers needlessly make programming languages harder to learn by not doing basic usability testing. For example, “…the three most common words for looping in computer science, for, while, and foreach, were rated as the three most unintuitive choices by non-programmers.”
  2. C-style syntax, as used in Java and Perl, is just as hard for novices to learn as a randomly-designed syntax. Again, this pain is needless, because the syntax of other languages (such as Python and Ruby) is significantly easier.

Let me repeat part of that:

C-style syntax, as used in Java and Perl, is just as hard for novices to learn as a randomly-designed syntax.

Randomly-designed syntax?

Now, think about the latest semantic syntax or semantic query syntax you have read about.

Was it designed for users? Was there any user testing at all?

Is there a lesson here for designers of semantic syntaxes and query languages?


I first saw this in Greg Wilson’s Software Carpentry: Lessons Learned video.

The Reemergence of Datalog

Tuesday, June 24th, 2014

The Reemergence of Datalog by Michael Fogus.

Not all “old” ideas in CS are out dated by any means. Michael gives a brief history of Datalog and then describes current usage in Datalog, Cascalog and the Bacwn Clojure library.

Notes from the presentation.

I first saw this in Nat Torkington’s Four short links: 24 June 2014.

FQL: A Functorial Query Language

Thursday, March 6th, 2014

FQL: A Functorial Query Language

From the webpage:

The FQL IDE is a visual schema mapping tool for developing FQL programs. It can run FQL programs, generate SQL from FQL, generate FQL from SQL, and generate FQL from schema correspondences. Using JDBC, it can run transparently using an external SQL engine and on external database instances. It can output RDF/OWL/XML and comes with many built-in examples. David Spivak and Ryan Wisnesky are the primary contributors. Requires Java 7.

As if FQL and the IDE weren’t enough, papers, slides, source code await you.

I first saw this in a tweet by Computer Science.

Querying rich text with Lux

Thursday, November 14th, 2013

Querying rich text with Lux – XQuery for Lucene by Michael Sokolov.

Slide deck that highlights features of Lux, which is billed at its homepage as:

The XML Search Engine Lux is an open source XML search engine formed by fusing two excellent technologies: the Apache Lucene/Solr search index and the Saxon XQuery/XSLT processor.

Not surprisingly, I am in favor of using XML to provide context for data.

You can get a better feel for Lux by:

Reading Indexing Queries in Lux by Michael Sokolov (Balisage 2013)

Visiting the Lux homepage:

Downloading Lux Source:

BTW, Michael does have experience with XML based content:,,, and others.

PS: Remember any comments on XQuery 3.0 are due by November 19, 2013.

3 Myths about graph query languages…

Tuesday, October 15th, 2013

3 Myths about graph query languages. Busted by Pixy. by Sridhar Ramachandran.

A very short slide deck that leaves you wanting more information.

It got my attention because I didn’t know there were any myths about graph query languages. 😉

I think the sense of “myth” here is more “misunderstanding” or simply “incorrect information.”

Some references that may be helpful while reading these slides:

I must confess that “Myth #3: GQLs [Graph Query Languages] can’t be relational” has always puzzled me.

In part because hypergraphs have been used to model databases for quite some time.

For example:

Making use of arguments from information theory it is shown that a boolean function can represent multivalued dependencies. A method is described by which a hypergraph can be constructed to represent dependencies in a relation. A new normal form called generalized Boyce-Codd normal form is defined. An explicit formula is derived for representing dependencies that would remain in a projection of a relation. A definition of join is given which makes the derivation of many theoretical results easy. Another definition given is that of information in a relation. The information gets conserved whenever lossless decompositions are involved. It is shown that the use of null elements is important in handling data.

Would you believe: Some analytic tools for the design of relational database systems by K. K. Nambiar in 1980?

So far as I know, hypergraphs are a form of graph so it isn’t true that “graphs can only express binary relations/predicates.”

One difference (there are others) is that a hypergraph database doesn’t require derivation of relationships because those relationships are already captured by a hyperedge.

Moreover, a vertex can (whether it “may” or not in a particular hypergraph is another issue) be a member of more than one hyperedge.

Determination of common members becomes a straight forward query as opposed to two or more derivations of associations and then calculation of any intersection.

For all of that, it remains important as notice of a new declarative graph query language (GQL).

imMens: Real-time Visual Querying of Big Data

Sunday, April 7th, 2013

imMens: Real-time Visual Querying of Big Data by Zhicheng Liu, Biye Jiangz and Jeffrey Heer.


Data analysts must make sense of increasingly large data sets, sometimes with billions or more records. We present methods for interactive visualization of big data, following the principle that perceptual and interactive scalability should be limited by the chosen resolution of the visualized data, not the number of records. We first describe a design space of scalable visual summaries that use data reduction methods (such as binned aggregation or sampling) to visualize a variety of data types. We then contribute methods for interactive querying (e.g., brushing & linking) among binned plots through a combination of multivariate data tiles and parallel query processing. We implement our techniques in imMens, a browser-based visual analysis system that uses WebGL for data processing and rendering on the GPU. In benchmarks imMens sustains 50 frames-per-second brushing & linking among dozens of visualizations, with invariant performance on data sizes ranging from thousands to billions of records.

Code is available at:

The emphasis on “real-time” with “big data” continues.

Impressive work but I wonder if there is a continuum of “big data” for “real-time” access, analysis and/or visualization?

Some types of big data are simple enough for real-time analysis, but other types are less so and there are types of big data where real-time analysis is inappropriate.

What I don’t know is what factors you would evaluate to place one big data set at one point on that continuum and another data set at another. Closer to one end or the other.

Research that you are aware of on the appropriateness of “real-time” analysis of big data?

I first saw this in This Week in Research by Isaac Lopez.

Semantic Queries by Example [Identity by Example (IBE)?]

Sunday, March 17th, 2013

Semantic Queries by Example by Lipyeow Lim, Haixun Wang, Min Wang.


With the ever increasing quantities of electronic data, there is a growing need to make sense out of the data. Many advanced database applications are beginning to support this need by integrating domain knowledge encoded as ontologies into queries over relational data. However, it is extremely difficult to express queries against graph structured ontology in the relational SQL query language or its extensions. Moreover, semantic queries are usually not precise, especially when data and its related ontology are complicated. Users often only have a vague notion of their information needs and are not able to specify queries precisely. In this paper, we address these challenges by introducing a novel method to support semantic queries in relational databases with ease. Instead of casting ontology into relational form and creating new language constructs to express such queries, we ask the user to provide a small number of examples that satisfy the query she has in mind. Using those examples as seeds, the system infers the exact query automatically, and the user is therefore shielded from the complexity of interfacing with the ontology. Our approach consists of three steps. In the first step, the user provides several examples that satisfy the query. In the second step, we use machine learning techniques to mine the semantics of the query from the given examples and related ontologies. Finally, we apply the query semantics on the data to generate the full query result. We also implement an optional active learning mechanism to find the query semantics accurately and quickly. Our experiments validate the effectiveness of our approach.

Potentially deeply important work for both a topic map query language and topic map authoring.

The authors conclude:

In this paper, we introduce a machine learning approach to support semantic queries in relational database. In semantic query processing, the biggest hurdle is to represent ontological data in relational form so that the relational database engine can manipulate the ontology in a way consistent with manipulating the data. Previous approaches include transforming the graph ontological data into tabular form, or representing ontological data in XML and leveraging database extenders on XML such as DB2’s Viper. These approaches, however, are either expensive (materializing a transitive relationship represented by a graph may increase the data size exponentially) or requiring changes in the database engine and new extensions to SQL. Our approach shields the user from the necessity of dealing with the ontology directly. Indeed, as our user study indicates, the difficulty of expressing ontology-based query semantics in a query language is the major hurdle of promoting semantic query processing. With our approach, the users do not even need to know ontology representation. All that is required is that the user gives some examples that satisfy the query he has in mind. The system then automatically finds the answer to the query. In this process, semantics, which is a concept usually hard to express, remains as a concept in the mind of user, without having to be expressed explicitly in a query language. Our experiments and user study results show that the approach is efficient, effective, and general in supporting semantic queries in terms of both accuracy and usability. (emphasis added)

I rather like: “In this process, semantics, which is a concept usually hard to express, remains as a concept in the mind of user, without having to be expressed explicitly in a query language.

To take it a step further, it should apply to the authoring of topic maps as well.

A user selects from a set of examples the subjects they want to talk about. Quite different from any topic map authoring interface I have seen to date.

The “details” of capturing and querying semantics have stymied RDF:

F-16 cockpit

(From: The Semantic Web Is Failing — But Why? (Part 4))

And topic map authoring as well.

Is your next authoring/querying interface going to be by example?

I first saw this in a tweet by Stefano Bertolo.

? Google Guide [Improve Google Searching]

Thursday, January 31st, 2013

? Google Guide by Nancy Blachman.

Non-official documentation for Google searching but very nice non-official documentation.

If you want to improve your Google searching, this is a good place to start!

Available in English, Dutch, German, Hebrew and Italian.

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations

Tuesday, January 29th, 2013

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations by Arnab Bhattacharya and B. Palvali Teja.
(Submitted on 28 Jun 2012)


The multi-criteria decision making, which is possible with the advent of skyline queries, has been applied in many areas. Though most of the existing research is concerned with only a single relation, several real world applications require finding the skyline set of records over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation, has been proposed. In many of those cases, however, the join often involves performing aggregate operations among some of the attributes from the different relations. In this paper, we introduce such queries as “aggregate skyline join queries”. Since the naive algorithm is impractical, we propose three algorithms to efficiently process such queries. The algorithms utilize certain properties of skyline sets, and processes the skylines as much as possible locally before computing the join. Experiments with real and synthetic datasets exhibit the practicality and scalability of the algorithms with respect to the cardinality and dimensionality of the relations.

The authors illustrate a “skyline” query with a search for a hotel that has a good price and it close to the beach. A “skyline” set of hotels excludes hotels that are not as good on those points as hotels in the set. They then observe:

In real applications, however, there often exists a scenario when a single relation is not sufficient for the application, and the skyline needs to be computed over multiple relations [16]. For example, consider a flight database. A person traveling from city A to city B may use stopovers, but may still be interested in flights that are cheaper, have a less overall journey time, better ratings and more amenities. In this case, a single relation specifying all direct flights from A to B may not suffice or may not even exist. The join of multiple relations consisting of flights starting from A and those ending at B needs to be processed before computing the preferences.

The above problem becomes even more complex if the person is interested in the travel plan that optimizes both on the total cost as well as the total journey time for the two flights (other than the ratings and amenities of each
airline). In essence, the skyline now needs to be computed on attributes that have been aggregated from multiple relations in addition to attributes whose preferences are local within each relation. The common aggregate operations are sum, average, minimum, maximum, etc.

No doubt the travel industry thinks it has conquered semantic diversity in travel arrangements. If they have, it has since I stopped traveling several years ago.

Even simple tasks such as coordination of air and train schedules was unnecessarily difficult.

I suspect that is still the case and so mention “skyline” queries as a topic to be aware of and if necessary, to include in a topic map application that brings sanity to travel arrangements.

True, you can get a travel service that handles all the details, but only for a price and only if you are that trusting.

Optimizing TM Queries?

Wednesday, January 16th, 2013

A recent paper by V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyễn, Optimizing XML querying using type-based document projection, suggests some interesting avenues for optimizing topic map queries.


XML data projection (or pruning) is a natural optimization for main memory query engines: given a query Q over a document D, the subtrees of D that are not necessary to evaluate Q are pruned, thus producing a smaller document D ; the query Q is then executed on D , hence avoiding to allocate and process nodes that will never be reached by Q.

In this article, we propose a new approach, based on types, that greatly improves current solutions. Besides providing comparable or greater precision and far lesser pruning overhead, our solution ―unlike current approaches― takes into account backward axes, predicates, and can be applied to multiple queries rather than just to single ones. A side contribution is a new type system for XPath able to handle backward axes. The soundness of our approach is formally proved. Furthermore, we prove that the approach is also complete (i.e., yields the best possible type-driven pruning) for a relevant class of queries and Schemas. We further validate our approach using the XMark and XPathMark benchmarks and show that pruning not only improves the main memory query engine’s performances (as expected) but also those of state of the art native XML databases.

Phrased in traditional XML terms but imagine pruning a topic map by topic or association types, for example, before execution of a query.

While true enough that a query could include topic type, the remains the matter of examining all the instances of topic type before proceeding to the rest of the query.

For common query sub-maps as it were, I suspect that to prune once and store the results could be a viable alternative.

Despite the graphic chart enhancement from processing millions or billions of nodes, processing the right set of nodes and producing a useful answer has its supporters.

MongoDB Puzzlers #1

Sunday, December 30th, 2012

MongoDB Puzzlers #1 by Kristina Chodorow.

If you are not too deeply invested in the fiscal cliff debate, ;-), you may enjoy the distraction of a puzzler based on the MongoDB query language.

Collecting puzzler’s for MongoDB and other query languages would be a good idea.

Something to be enjoyed in times of national “crisis,” aka, collective hand wringing by the media.

Static and Dynamic Semantics of NoSQL Languages […Combining Operators…]

Tuesday, December 25th, 2012

Static and Dynamic Semantics of NoSQL Languages (PDF) by Véronique Benzaken, Giuseppe Castagna, Kim Nguy˜ên and Jérôme Siméon.


We present a calculus for processing semistructured data that spans differences of application area among several novel query languages, broadly categorized as “NoSQL”. This calculus lets users define their own operators, capturing a wider range of data processing capabilities, whilst providing a typing precision so far typical only of primitive hard-coded operators. The type inference algorithm is based on semantic type checking, resulting in type information that is both precise, and flexible enough to handle structured and semistructured data. We illustrate the use of this calculus by encoding a large fragment of Jaql, including operations and iterators over JSON, embedded SQL expressions, and co-grouping, and show how the encoding directly yields a typing discipline for Jaql as it is, namely without the addition of any type definition or type annotation in the code.

From the conclusion:

On the structural side, the claim is that combining recursive records and pairs by unions, intersections, and negations suffices to capture all possible structuring of data, covering a palette ranging from comprehensions, to heterogeneous lists mixing typed and untyped data, through regular expressions types and XML schemas. Therefore, our calculus not only provides a simple way to give a formal semantics to, reciprocally compare, and combine operators of different NoSQL languages, but also offers a means to equip these languages, in they current definition (ie, without any type definition or annotation), with precise type inference.

With lots of work in between the abstract and conclusion.

The capacity to combine operators of different NoSQL languages sounds relevant to a topic maps query language.


I first saw this in a tweet by Computer Science.

A thrift to CQL3 upgrade guide [Cassandra Query Language]

Saturday, November 24th, 2012

A thrift to CQL3 upgrade guide by Sylvain Lebresne.

From the post:

CQL3 (the Cassandra Query Language) provides a new API to work with Cassandra. Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure. This is A Good Thing as it allows hiding from the API a number of distracting and useless implementation details (such as range ghosts) and allows to provide native syntaxes for common encodings/idioms (like the CQL3 collections as we’ll discuss below), instead of letting each client or client library reimplement them in their own, different and thus incompatible, way. However, the fact that CQL3 provides a thin layer of abstraction means that thrift users will have to understand the basics of this abstraction if they want to move existing application to CQL3. This is what this post tries to address. It explains how to translate thrift to CQL3. In doing so, this post also explains the basics of the implementation of CQL3 and can thus be of interest for those that want to understand that.

But before getting to the crux of the matter, let us have a word about when one should use CQL3. As described above, we believe that CQL3 is a simpler and overall better API for Cassandra than the thrift API is. Therefore, new projects/applications are encouraged to use CQL3 (though remember that CQL3 is not final yet, and so this statement will only be fully valid with Cassandra 1.2). But the thrift API is not going anywhere. Existing applications do not have to upgrade to CQL3. Internally, both CQL3 and thrift use the same storage engine, so all future improvements to this engine will impact both of them equally. Thus, this guide is for those that 1) have an existing application using thrift and 2) want to port it to CQL3.

Finally, let us remark that CQL3 does not claim to fundamentally change how to model applications for Cassandra. The main modeling principles are the same than they always have been: efficient modeling is still based on collocating that data that are accessed together through denormalization and the use of the ordering the storage engine provides, and is thus largely driven by the queries. If anything, CQL3 claims to make it easier to model along those principles by providing a simpler and more intuitive syntax to implement a number of idioms that this kind of modeling requires.

If you are using Cassandra, definitely the time to sit up and take notice. CQL3 is coming.

I first saw this at Alex Popescu’s myNoSQL.

Proximity Operators [LucidWorks]

Thursday, August 16th, 2012

Proximity Operators

From the webpage:

A proximity query searches for terms that are either near each other or occur in a specified order in a document rather than simply whether they occur in a document or not.

You will use some of these operators more than others but having a bookmark to the documentation will prove to be useful.

Replacing dtSearch

Wednesday, April 25th, 2012

An open source replacement for the dtSearch closed source search engine

From the webpage:

We’ve been working on a client project where we needed to replace the dtSearch closed source search engine, which doesn’t perform that well at scale in this case. As the client has significant investment in stored queries (it’s for a monitoring application) they were keen that the new engine spoke exactly the same query language as the old – so we’ve built a version of Apache Lucene to replace dtSearch. There are a few other modifications we had to do as well, to return such things as positional information from deep within the Lucene code (this is particularly important in monitoring as you want to show clients where the keywords they were interested in appeared in an article – they may be checking their media coverage in detail, and position on the page is important).

The preservation/reuse of stored queries is a testimony to the configurable nature of Lucene software.

How far can the query preservation/reuse capabilities of Lucene be extended?

Dates, date boosting, and NOW

Thursday, April 5th, 2012

Dates, date boosting, and NOW by Erick Erickson

From the post:

More NOW evil

Prompted by a subtle issue a client raised, I was thinking about date boosting. According to the Wiki, a good way to bost by date is by something like the following:


(see: date boosting link). And this works well, no question.

However, there’s a subtle issue when paging. NOW evaluates to the current time, and every subsequent request will have a different value for NOW. This blog post about the effects of this on filter queries provides some useful background.

If you like really subtle Solr issues then you will love this post. It doesn’t really have a happy ending per se but you will pick up some experience at deep analysis.

Erick’s concluding advice that users rarely go to the second page of search results, making the problem here an edge case, makes me uneasy.

I am sure Erick is right about the numbers, but I remain uneasy. I would have to monitor users for occurrences of the “edge case” before I would be confident enough to simply ignore it.


Thursday, March 8th, 2012

Induction: A Polyglot Database Client for Mac OS X

From the post:

Explore, Query, Visualize

Focus on the data, not the database. Induction is a new kind of tool designed for understanding and communicating relationships in data. Explore rows and columns, query to get exactly what you want, and visualize that data in powerful ways.

SQL? NoSQL? It Don’t Matter

Data is just data, after all. Induction supports PostgreSQL, MySQL, SQLite, Redis, and MongoDB out-of-the-box, and has an extensible architecture that makes it easy to write adapters for anything else you can think of. CouchDB? Oracle? Facebook Graph? Excel? Make it so!

Some commercial advice for the Induction crew:

Sounds great!

Be aware that Excel controls 75% of the BI market. I don’t know the numbers for Oracle products generally, but suspect “enterprise” and “Oracle” are most often heard together. I would make those “out of the box” even before 1.0.

If this is visualization, integration can’t be far behind.

Bird’s Eye View of the ElasticSearch Query DSL

Saturday, February 18th, 2012

Bird’s Eye View of the ElasticSearch Query DSL
Peter Karich.

From the post:

I’ve copied the whole post into a gist so that you can simply clone, copy and paste the important stuff and even could contribute easily.

Several times per month there are questions regarding the query structure on the ElasticSearch user group.

Although there are good docs explaining this in depth, I think a bird’s eye view of the Query DSL is necessary to understand what is written there. There is even some good external documentation available. And there were attempts to define a schema but nevertheless I’ll add my 2 cents here. I assume you set up your ElasticSearch instance correctly and on the local machine filled with exactly those 3 articles.

Do you have a feel for what a “bird’s eye view” would say about the differences in NoSQL query languages?

SQL has been relatively uniform, enabling users to learn the basics and then fill in the particulars as necessary. How far are we from a query DSL that obscures most of the differences from the average user?

Using Bitmap Indexes in Query Processing

Monday, January 30th, 2012

Why are column oriented databases so much faster than row oriented databases?

Be sure to read all the comments. Some of the techniques described are covered by patents (according to the comments) but there are open source implementations of alternatives. There is also a good discussion of the trade-offs in using this technique.

Search terms: Hybrid Word Aligned Bitmaps, HWAB, EWAB, FastBit.

Follow the links in the post and comments for more resources.


From a topic map perspective, how would you structure a set of relational tables to represent the information items defined by the Topic Map Data Model? (Yes, it has been done before but no peeking! Your result will likely be very similar but I am interested in how you would structure the data. (If you want to think ahead, same question for the various NoSQL options.)

For the relational database, how would you structure a chain of selects to choose all the information items that should merge for any particular item. In other words, start off with the values of an item that should merge and construct a select that gathers up the other items with which it should merge.

Enumerate the operations you would need to perform post-select to present a “merged” information item to the final user.

Observation: Isn’t indexing the first step towards merging? That is we have to gather up all the relevant representatives of a subject before we can consider the mechanics of merging?

First seen at myNoSQL.

Querying Semi-Structured Data

Friday, January 6th, 2012

Querying Semi-Structured Data

The Semi-structured data and P2P graph databases post I point to has a broken reference to Serge Abiteboul’s “Querying Semi-Structured Data.” Since I could not correct it there and the topic is of interest for topic maps, I created this entry for it here.

From the Introduction:

The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in diff erent forms, ranging from unstructured data in le systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specifi c interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases.

As will seen later when the notion of semi-structured data is more precisely defi ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data-formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research.

The main purpose of the paper is to isolate the essential aspects of semi-structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data. The “lightweight” data models they use (based on labelled graphs) are very similar.

As we shall see, the topic of semi-structured data has no precise boundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some important issues in this context.

The paper is organized as follows. In Section 2, we discuss the particularities of semi-structured data. In Section 3, we consider the issue of the data structure
and in Section 4, the issue of the query language.

A bit dated, 1996, but still worth reading. Updating the paper would make a nice semester size project

BTW, note the download graphics. Makes me think that archives should have an “anonymous notice” feature that allows anyone downloading a paper to send an email to anyone who has downloaded the paper in the past, without disclosing the emails of the prior downloaders.

I would really like to know what the people in Jan/Feb of 2011 were looking for? Perhaps they are working on an update of the paper? Or would like to collaborate on updating the paper.

Seems like a small “feature” that would allow researchers to contact others without disclosure of email addresses (other than for the sender of course).

Formal publication data:

Abiteboul, S. (1996) Querying Semi-Structured Data. Technical Report. Stanford InfoLab. (Publication Note: Database Theory – ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997)

Why Not AND, OR, And NOT?

Friday, December 30th, 2011

Why Not AND, OR, And NOT?

From the post:

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

Required reading if you want to understand how these the “Boolean Operators” work in Lucene/Solr and a superior alternative.

Efficient Similarity Query Processing (Previously Efficient Exact Similarity Join)

Saturday, December 24th, 2011

Efficient Similarity Query Processing (Previously Efficient Exact Similarity Join)

From the webpage:

Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.

In this project, we focus on efficient algorithms to perform the similarity join on (multi-) sets or strings both exactly and approximately. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.

We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Recently, we have obtained several interesting new results on edit similarity search and joins by exploiting asymmetric signature schemes. The resulting IndexChunkTurbo algorithm can process most of the similarity queries efficiently while occupying almost the least amount of index space.

We also investigated the rpbolem of approximate entity matching. Our SIGMOD09 work can extract approximate mentions of entities in a document given an entity dictionary.

The first five (5) papers:

  • Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang. Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012.

    Summary: An efficient algorithm to compute graphs within certain (graph) edit distance away from a query graph.

  • Sunanda Patro, Wei Wang. Learning Top-k Transformation Rules. DEXA 2011.

    Summary: A new algorithm to learn transformation rules (e.g., VLDB = Very Large Data Base) from unannotated, potentially noisy data. Compared with previous approaches, our method can find much more valid rules in the top-k output.

  • Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Guoren Wang. Efficient Similarity Joins for Near Duplicate Detection. ACM Transaction of Database Systems (TODS).

    Summary: This is the journal version of our WWW 2008 paper, with extension to implementing the PPJoin family of algorithms on top of relational database systems.

  • Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.

    Summary: Two simple yet highly efficient algorithms are proposed in this paper that works very well for both edit similarity search and joins. Perhaps equally intereting is a comprehensive experiment involding Flamingo (ver 3), PartEnum, Ed-Join, Bed-tree, Trie-Join, NGPP, VGRAM, NS09.

  • Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, Rui Li. MapDupReducer: Detecting Near Duplicates over Massive Datasets. SIGMOD 2010. PPT

    Summary: This work essentially ports the ppjoin+ algorithm to the Map-Reduce framework in order to deal with huge volume of data.

The first paper is saying exactly what your suspect: similarity of arbitrary sub-graphs. Nothing like picking a hard problem is there? Not fully general solution but see what you think of theirs.

Did I mention there is working code at this site as well? Definitely a group to watch in the future.

Hadapt is moving forward

Friday, November 25th, 2011

Hadapt is moving forward

A bullet-point type review, mostly a summary of information from the vendor. Not a bad thing, can be useful. But, you would think that when reviewing a vendor or their product, there would be a link to the vendor/product. Yes? No one that I can find in that post.

Let me make it easy for you: How hard was that? Maybe 10 seconds of my time and that is because I have gotten slow? The point of the WWW, at least as I understand it, is to make information more accessible to users. But it doesn’t happen by itself. Put in hyperlinks where appropriate.

There is a datasheet on the Adaptive Analytic Platform &trade:.

You can follow the link for the technical report and register, but it is little more than a sales brochure.

More informative is: Efficient Processing of Data Warehousing Queries in a Split Execution Environment.

I don’t have a local setup that would exercise Hadapt. If you do or if you are using it in the cloud, would appreciate any comments or pointers you have.

Last Call Working Draft of SPARQL 1.1 Federated Query

Thursday, November 17th, 2011

Last Call Working Draft of SPARQL 1.1 Federated Query

From the W3C:

A Last Call Working Draft of SPARQL 1.1 Federated Query, which offers data consumers an opportunity to merge data distributed across the Web from multiple SPARQL query services. Comments on this working draft are welcome before 31 December 2011.

Some “lite” holiday reading. 😉

Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point

Thursday, November 3rd, 2011

Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point

From the post:

During the previous post, I’ve explained what is Neo4j and then, explained how graph traversal could be done on Neo4j using the Java API. Next, I’ve introduced Cypher and how it helped write queries, in order to retrieve data from the graph. After introducing Cypher’s syntax, we dissected the Start Clause, which is the start point (duh) for any query being written on Cypher. If you hadn’t read it, go there, and then come back to read this one.

In this second part, I’ll show the other clauses existents in Cypher, the Match, Where, Return, Skip and Limit, OrderBy and Return. Some will be simple, some not and I’ll go in a more detailed way on those clauses that aren’t so trivial. After that, we will take a look at the Cypher query entry point, and how the query parsing is unleashed.

Nuff said, let’s get down to business.

This and part 1 are starting points for understanding Cypher. A key to evaluation of Neo4j as a topic map storage/application platform.

True enough, at present (1.4) Neo4j only supports 32 billion nodes, 32 billion relationships and 64 billion properties per database but on the other hand, I have fewer than 32 billion books than that so at a certain level of coarseness it should be fine. 😉

BTW, I do collect CS texts, old as well as new. Mostly algorithm, parsing, graph, IR, database sort of stuff but occasionally other stuff too. Just in case you have a author’s copy or need to clear out space for more books. Drop me a line if you would like to make a donation to my collection.

Data Structures for Range-Sum Queries (slides)

Saturday, October 29th, 2011

Data Structures for Range-Sum Queries (slides) by Paul Butler.

From the post:

This week I attended the Canadian Undergraduate Mathematics Conference. I enjoyed talks from a number of branches of mathematics, and gave a talk of my own on range-sum queries. Essentially, range-aggregate queries are a class of database queries which involve taking an aggregate (in SQL terms, SUM, AVG, COUNT, MIN, etc.) over a set of data where the elements are filtered by simple inequality operators (in SQL terms, WHERE colname {<, <=, =, >=, >} value AND …). Range-sum queries are the subset of those queries where SUM is the aggregation function.

Due to the nature of the conference, I did my best to make things as accessible to someone with a general mathematics background rather than assuming familiarity with databases or order notation.

I’ve put the slides (pdf link, embedded below also) online. They may be hard to follow as slides, but I hope they pique your interest enough to check out the papers referenced at the end if that’s the sort of thing that interests you. I may turn them into a blog post at some point. The presentation begins with tabular data and shows some of the insights that led to the Dynamic Data Cube, which is a clever data structure for answering range-sum queries.

I will run down the links and see what other materials I can find on the “Dynamic Data Cube” (this post is from 2010). Data structures for range-sum queries look quite interesting.

Extending tolog

Friday, September 30th, 2011

Extending tolog by Lars Marius Garshol.


This paper describes a number of extensions that might be made to the tolog query language for topic maps in order to make it fulfill all the requirements for the ISO-standardized TMQL (Topic Map Query Language). First, the lessons to be learned from the considerable body of research into the Datalog query language are considered. Finally, a number of different extensions to the existing tolog query language are considered and evaluated.

This paper extends and improves on earlier work on tolog, first described in [Garshol01].

As you can see from some recent post here, Datalog research continues!

XQuery Survey

Wednesday, September 21st, 2011

XQuery Survey

From the webpage:

I am looking for feedback on the XQuery programming language. Please answer all questions as completely as possible. This poll and other information is forming the basis of a talk I am giving at GOTO 2011 Arhus, Denmark ( I will share all results at the end of October 2011.

Please help Jim Fuller out with his survey on XQuery!

How Neo4j uses Scala’s Parser Combinator: Cypher’s internals – Part 1

Thursday, September 8th, 2011

How Neo4j uses Scala’s Parser Combinator: Cypher’s internals – Part 1

From the post:

I think that most of us, software developers, while kids, always wanted to know how things were made by inside. Since I was a child, I always wanted to understand how my toys worked. And then, what I used to do? Opened’em, sure. And of course, later, I wasn’t able to re-join its pieces properly, but this is not this post subject 😉 . Well, understanding how things works behind the scenes can teach us several things, and in software this is no different, and we can study how an specific piece of code was created and mixed together with other code.

In this series of posts I’ll share what I’ve found inside Neo4J implementation, specifically, at Cypher’s code (its query language).

In this first part, I’ll briefly introduce Neo4J and Cypher and then I’ll start to explain the internals of its parser and how it works. Since it is a long (very very long subject, in fact), part 2 and subsequents are coming very very soon.

If you want to understand the internals of a graph query language, this looks like a good place to start.

Update: Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point

Query Execution in Column-Oriented Database Systems

Tuesday, August 23rd, 2011

Query Execution in Column-Oriented Database Systems by Daniel Abadi (Ph.D. thesis).

Apologies for the length of the quote but this is an early dissertation on column-oriented data systems and I want to entice you into reading it. Not so much for the techniques, which are now common but the analysis.


There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it performs best on the most common application for database systems: business transactional data processing. However, there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read through the data to gain new insight and use it to drive decision making and planning.

In this dissertation, we study the problem of poor performance of row-by-row data layout for these emerging applications, and evaluate the column-by-column data layout opportunity as a solution to this problem. There have been a variety of proposals in the literature for how to build a database system on top of column-by-column layout. These proposals have different levels of implementation effort, and have different performance characteristics. If one wanted to build a new database system that utilizes the column-by-column data layout, it is unclear which proposal to follow. This dissertation provides (to the best of our knowledge) the only detailed study of multiple implementation approaches of such systems, categorizing the different approaches into three broad categories, and evaluating the tradeoffs between approaches. We conclude that building a query executer specifically designed for the column-by-column query layout is essential to achieve good performance.

Consequently, we describe the implementation of C-Store, a new database system with a storage layer and query executer built for column-by-column data layout. We introduce three new query execution techniques that significantly improve performance. First, we look at the problem of integrating compression and execution so that the query executer is capable of directly operating on compressed data. This improves performance by improving I/O (less data needs to be read off disk), and CPU (the data need not be decompressed). We describe our solution to the problem of executer extensibility – how can new compression techniques be added to the system without having to rewrite the operator code? Second, we analyze the problem of tuple construction (stitching together attributes from multiple columns into a row-oriented ”tuple”). Tuple construction is required when operators need to access multiple attributes from the same tuple; however, if done at the wrong point in a query plan, a significant performance penalty is paid. We introduce an analytical model and some heuristics to use that help decide when in a query plan tuple construction should occur. Third, we introduce a new join technique, the “invisible join” that improves performance of a specific type of join that is common in the applications for which column-by-column data layout is a good idea.

Finally, we benchmark performance of the complete C-Store database system against other column-oriented database system implementation approaches, and against row-oriented databases. We benchmark two applications. The first application is a typical analytical application for which column-by-column data layout is known to outperform row-by-row data layout. The second application is another emerging application, the Semantic Web, for which column-oriented database systems are not currently used. We find that on the first application, the complete C-Store system performed 10 to 18 times faster than alternative column-store implementation approaches, and 6 to 12 times faster than a commercial database system that uses a row-by-row data layout. On the Semantic Web application, we find that C-Store outperforms other state-of-the-art data management techniques by an order of magnitude, and outperforms other common data management techniques by almost two orders of magnitude. Benchmark queries, which used to take multiple minutes to execute, can now be answered in several seconds.