Archive for the ‘Query Expansion’ Category

Massive Query Expansion by Exploiting Graph Knowledge Bases

Friday, October 25th, 2013

Massive Query Expansion by Exploiting Graph Knowledge Bases by Joan Guisado-Gámez, David Dominguez-Sal, Josep-LLuis Larriba-Pey.


Keyword based search engines have problems with term ambiguity and vocabulary mismatch. In this paper, we propose a query expansion technique that enriches queries expressed as keywords and short natural language descriptions. We present a new massive query expansion strategy that enriches queries using a knowledge base by identifying the query concepts, and adding relevant synonyms and semantically related terms. We propose two approaches: (i) lexical expansion that locates the relevant concepts in the knowledge base; and, (ii) topological expansion that analyzes the network of relations among the concepts, and suggests semantically related terms by path and community analysis of the knowledge graph. We perform our expansions by using two versions of the Wikipedia as knowledge base, concluding that the combination of both lexical and topological expansion provides improvements of the system’s precision up to more than 27%.

Heavy reading for the weekend but this paragraph caught my eye:

In this paper, we propose a novel query expansion which uses a hybrid input in the form of a query phrase and a context, which are a set of keywords and a short natural language description of the query phrase, respectively. Our method is based on the combination of both a lexical and topological analysis of the concepts related to the input in the knowledge base. We differ from previous works because we are not considering the links of each article individually, but we are mining the global link structure of the knowledge base to find related terms using graph mining techniques. With our technique we are able to identify: (i) the most relevant concepts and their synonyms, and (ii) a set of semantically related concepts. Most relevant concepts provide equivalent reformulations of the query that reduce the vocabulary mismatch. Semantically related concepts introduce many different terms that are likely to appear in a relevant document, which is useful to solve the lack of topic expertise and also disambiguate the keywords.

Wondering that since it works with Wikipedia, should the same be true for the references but not hyperlinks of traditional publishing?

Say documents in Citeseer for example?

Nothing against Wikipedia but general knowledge doesn’t have a very high retail value.

A Data Driven Approach to Query Expansion in Question Answering

Wednesday, January 30th, 2013

A Data Driven Approach to Query Expansion in Question Answering by Leon Derczynski, Jun Wang, Robert Gaizauskas, and Mark A. Greenwood.


Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions.

In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method.

Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Work on query expansion in natural language answering systems. Closely related to synonymy.

Query expansion tools could be useful prompts for topic map authors seeking terms for mapping.

QRU-1: A Public Dataset…

Saturday, September 8th, 2012

QRU-1: A Public Dataset for Promoting Query Representation and Understanding Research by Hang Li, Gu Xu, W. Bruce Croft, Michael Bendersky, Ziqi Wang and Evelyne Viegas.


A new public dataset for promoting query representation and understanding research, referred to as QRU-1, was recently released by Microsoft Research. The QRU-1 dataset contains reformulations of Web TREC topics that are automatically generated using a large-scale proprietary web search log, without compromising user privacy. In this paper, we describe the content of this dataset and the process of its creation. We also discuss the potential uses of the dataset, including a detailed description of a query reformulation experiment.

And the data set:

Query Representation and Understanding Set

The Query Representation and Understanding (QRU) data set contains a set of similar queries that can be used in web research such as query transformation and relevance ranking. QRU contains similar queries that are related to existing benchmark data sets, such as TREC query sets. The QRU data set was created by extracting 100 TREC queries, training a query-generation model and a commercial search engine, generating similar queries from TREC queries with the model, and removal of mistakenly generated queries.

Are query reformulations in essence different identifications of the subject of a search?

But the issue isn’t “more” search results but rather higher quality search results.

Why search engines bother (other than bragging rights) to report “hits” beyond the ones displayed isn’t clear. Just have a “next N hits” button.

You could consider the number of “hits” you don’t look at as a measure of your search engine’s quality. The higher the number…., well, you know. Could be gold in those “hits” but you will never know. And your current search engine will never say.

Web query disambiguation using PageRank

Sunday, July 1st, 2012

Web query disambiguation using PageRank by Christos Makris, Yannis Plegas, and Sofia Stamou. (Makris, C., Plegas, Y. and Stamou, S. (2012), Web query disambiguation using PageRank. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22685)


In this article, we propose new word sense disambiguation strategies for resolving the senses of polysemous query terms issued to Web search engines, and we explore the application of those strategies when used in a query expansion framework. The novelty of our approach lies in the exploitation of the Web page PageRank values as indicators of the significance the different senses of a term carry when employed in search queries. We also aim at scalable query sense resolution techniques that can be applied without loss of efficiency to large data sets such as those on the Web. Our experimental findings validate that the proposed techniques perform more accurately than do the traditional disambiguation strategies and improve the quality of the search results, when involved in query expansion.

A better summary of the author’s approach lies within the article:

The intuition behind our method is that we could improve the Web users’ search experience if we could correlate the importance that the sense of a term has when employed in a query (i.e., the importance of the sense as perceived by the information seeker) with the importance the same sense has when contained in a Web page (i.e., the importance of the sense as perceived by the information provider). We rely on the exploitation of PageRank because of its effectiveness in capturing the importance of every page on the Web graph based on their links’ connectivity, and from which we may infer the importance of every page in the “collective mind” of the Web content providers/creators. To account for that, we explore whether the PageRank value of a page may serve as an indicator of how significant the dominant senses of a query-matching term in the page are and, based on that, disambiguate the query.

Which reminds me of statistical machine translation, which replaced syntax based methods years ago.

Perhaps pagerank is summing our linguistic preferences from some word senses.

If that is the case, how would you incorporate that in ranking results to be delivered to a user from a topic map? There are different possible search outcomes, how do we establish the one a user prefers?

German Compound Words

Monday, April 16th, 2012

German Compound Words by Brian Johnson.

From the post:

Mark Twain is quoted as having said, “Some German words are so long that they have a perspective.”

Although eBay users are unlikely to search using fearsome beasts like “rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz”, which stands for the “beef labeling supervision duties delegation law”, we do frequently see compound words in our users’ queries. While some might look for “damenlederhose”, others might be searching for the same thing (women’s leather pants) using the decompounded forms “damen lederhose” or “damen leder hose”. And even though a German teacher would tell you only “damenlederhose” or “damen lederhose” are correct, the users’ expectation is to see the same results regardless of which form is used.

This scenario exists on the seller side as well. That is, people selling on eBay might describe their item using one or more of these forms. In such cases, what should a search engine do? While the problem might seem simple at first, German word-breaking – or decompounding, as it is also known – is not so simple.

And you thought all this worrying about subject identifiers was just another intellectual pose! 😉

There are serious people who spend serious money in the quest to make even more money who worry about subject identification. They don’t discuss it in topic map terms, but it is subject identity none the less.

This post should get your started on some issues with German.

What other languages/scripts have the same or similar issues? Are the solutions here extensible or are new solutions needed?

Pointers to similar resources most welcome!

Query Rewriting in Search Engines

Monday, April 16th, 2012

Query Rewriting in Search Engines by Hugh Williams was mentioned in Amazon CloudSearch, Elastic Search as a Service by Jeff Dalton. (You need to read Jeff’s comments on Amazon Cloudsearch but onto query rewriting.)

From the post:

There’s countless information on search ranking – creating ranking functions, and their factors such as PageRank and text. Query rewriting is less conspicuous but equally important. Our experience at eBay is that query rewriting has the potential to deliver as much improvement to search as core ranking, and that’s what I’ve seen and heard at other companies.

What is query rewriting?

Let’s start with an example. Suppose a user queries for Gucci handbags at eBay. If we take this literally, the results will be those that have the words Gucci and handbags somewhere in the matching documents. Unfortunately, many great answers aren’t returned. Why?

Consider a document that contains Gucci and handbag, but never uses the plural handbags. It won’t match the query, and won’t be returned. Same story if the document contains Gucci and purse (rather than handbag). And again for a document that contains Gucci but doesn’t contain handbags or a synonym – instead it’s tagged in the “handbags” category on eBay; the user implicitly assumed it’d be returned when a buyer types Gucci handbags as their query.

To solve this problem, we need to do one of two things: add words to the documents so that they match other queries, or add words to the queries so that they match other documents. Query rewriting is the latter approach, and that’s the topic of this post. What I will say about expanding documents is there are tradeoffs: it’s always smart to compute something once in search and store it, rather than compute it for every query, and so there’s a certain attraction to modifying documents once. On the other hand, there are vastly more words in documents than there are words in queries, and doing too much to documents gets expensive and leads to imprecise matching (or returning too many irrelevant documents). I’ve also observed over the years that what works for queries doesn’t always work for documents.

You really need to read the post by Hugh a couple of times.

Query rewriting is approaching the problem of subject identity from the other side of topic maps.

Topic maps collect different identifiers for a subject as a basis for “merging”.

Query rewriting changes a query so it specifies different identifiers for a subject.

Let me try to draw a graphic for you (my graphic skills are crude at best):

Topic Maps versus Query Rewrite

I used “/” as an alternative marker for topic maps to illustrate that matching any identifier returns all of them. For query rewrite, the “+” sign indicates that each identifier is searched for in addition to the others.

The result is the same set of identifiers and results from using them on a query set.

From a different point of view.

Flexible Searching with Solr and Sunspot

Sunday, April 1st, 2012

Flexible Searching with Solr and Sunspot.

Mike Pack writes:

Just about every type of datastore has some form of indexing. A typical relational database, such as MySQL or PostreSQL, can index fields for efficient querying. Most document databases, like MongoDB, contain indexing as well. Indexing in a relational database is almost always done for one reason: speed. However, sometimes you need more than just speed, you need flexibility. That’s where Solr comes in.

In this article, I want to outline how Solr can benefit your project’s indexing capabilities. I’ll start by introducing indexing and expand to show how Solr can be used within a Rails application.

If you are a Ruby fan (or not), this post is a nice introduction to some of the power of Solr for indexing.

At the same time, it is a poster child for what is inflexible about Solr query expansion.

Mike uses the following example for synonyms/query expansion:

# citi is the stem of cities
citi => city

# copi is the stem of copies
copi => copy

Well, that works no doubt, if those expansions are uniform across a body of texts. Depending on the size of the collection, that may or may not be the case. That is the uniformity of the expansion of strings.

We could say:

#cop is a synonym for the police
cop => police

Meanwhile, elsewhere in the collection we need:

#cop is the stem of copulate
cop => copulate

Without more properties to distinguish the two (or more) cases, we are going to get false positives in one case or the other.

Document Frequency Limited MultiTermQuerys

Monday, March 19th, 2012

Document Frequency Limited MultiTermQuerys

From the post:

If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words.  In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.

Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed.  One way to includes matches on variations of a word is to use Lucene’s MultiTermQuerys such as FuzzyQuery or WildcardQuery.  For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”.  Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.

You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.

Lets explore how you can focus your MultiTermQuerys on the most common terms in your index.

Not to give too much away, but you will learn how to tune a fuzzy match of terms. (To account for misspellings, for example.)

This is a very good site and blog for search issues.

A Survey of Automatic Query Expansion in Information Retrieval

Saturday, February 25th, 2012

A Survey of Automatic Query Expansion in Information Retrieval by Claudio Carpineto, Giovanni Romano.


The relative ineffectiveness of information retrieval systems is largely caused by the inaccuracy with which a query formed by a few keywords models the actual user information need. One well known method to overcome this limitation is automatic query expansion (AQE), whereby the user’s original query is augmented by new features with a similar meaning. AQE has a long history in the information retrieval community but it is only in the last years that it has reached a level of scientific and experimental maturity, especially in laboratory settings such as TREC. This survey presents a unified view of a large number of recent approaches to AQE that leverage various data sources and employ very different principles and techniques. The following questions are addressed. Why is query expansion so important to improve search effectiveness? What are the main steps involved in the design and implementation of an AQE component? What approaches to AQE are available and how do they compare? Which issues must still be resolved before AQE becomes a standard component of large operational information retrieval systems (e.g., search engines)?

Have you heard topic maps described as being the solution to the following problem?

The most critical language issue for retrieval effectiveness is the term mismatch problem: the indexers and the users do often not use the same words. This is known as the vocabulary problem Furnas et al. [1987], compounded by synonymy (same word with different meanings, such as “java”) and polysemy (different words with the same or similar meanings, such as “tv” and “television”). Synonymy, together with word inflections (such as with plural forms, “television” versus “televisions”), may result in a failure to retrieve relevant documents, with a decrease in recall (the ability of the system to retrieve all relevant documents). Polysemy may cause retrieval of erroneous or irrelevant documents, thus implying a decrease in precision (the ability of the system to retrieve only relevant documents).

That sounds like the XWindows index merging problem doesn’t it? (Different terms being used by *nix vendors who wanted to use a common set of XWindows documentation.)

The authors describe the amount of data on the web searched with only one, two or three terms:

In this situation, the vocabulary problem has become even more serious because the paucity of query terms reduces the possibility of handling synonymy while the heterogeneity and size of data make the effects of polysemy more severe.

But the size of the data isn’t a given. What if a topic map with scoped names were used to delimit the sites searched using a particular identifier.

For example, a topic could have the name: “TRIM19” and a scope of: “” If you try a search with “TRIM19” at the scoping site, you get a very different result than if you use “TRIM19” with say “”

Try it, I’ll wait.

Now, imagine that your scoping topic on “TRIM19” isn’t just that one site but a topic that represents all the gene database sites known to you. I don’t know the number but it can’t be very large, at least when compared to the WWW.

That simple act of delimiting the range of your searches, makes them far less subject to polysemy.

Not to mention that a topic map could be used to supply terms for use in automated query expansion.

BTW, the survey is quite interesting and deserves a slow read with follow up on the cited references.

Different ways to make auto suggestions with Solr

Saturday, February 18th, 2012

Different ways to make auto suggestions with Solr

From the post:

Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That’s a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.

(graphic omitted)

There are different ways to make auto complete suggestions with Solr. You can find many articles and examples on the internet, but making the right choice is not always easy. The goal of this post is compare the available options in order to identify the best solution tailored to your needs, rather than describe any one specific approach in depth.

It’s common practice to make auto-suggestions based on the indexed data. In fact a user is usually looking for something that can be found within the index, that’s why we’d like to show the words that are similar to the current query and at the same time relevant within the index. On the other hand, it is recommended to provide query suggestions; we can for example capture and index on a specific solr core all the user queries which return more than zero results, so we can use those information to make auto-suggestions as well. What actually matters is that we are going to make suggestions based on what’s inside the index; for this purpose it’s not relevant if the index contains user queries or “normal data”, the solutions we are going to consider can be applied in both cases.

The Suggester module is the method that looks the most promising:

This solution has its own separate index which you can automatically build on every commit. Using collation you can have multi-term suggestions. Furthermore, it is possible to use a custom dictionary instead of the index content, which makes the current solution even more flexible.

I like to think of multi-term suggestions as tuneable query expansions that return materials on a subject more precisely than the original query.

The custom dictionary has even more potential:

When a file-based dictionary is used (non-empty sourceLocation parameter above) then it’s expected to be a plain text file in UTF-8 encoding. Blank lines and lines that start with a ‘#’ are ignored. The remaining lines must consist of either a string without literal TAB (\u0007) character, or a string and a TAB separated floating-point weight. (

The custom dictionary can contain single terms or phrases.

Hmmm, a custom dictionary:

  1. Is easy to author
  2. Contains words and phrases
  3. Is an editorial artifact
  4. Not limited to a single Solr installation
  5. Could be domain specific
  6. Assists in returning more, not less precise results

The handling of the more precise results is up to your imagination.

Lucene-3759: Support joining in a distributed environment

Thursday, February 9th, 2012

Support joining in a distributed environment.

From the description:

Add two more methods in JoinUtil to support joining in a distributed manner.

  • Method to retrieve all from values.
  • Method to create a TermsQuery based on a set of from terms.

With these two methods distributed joining can be supported following these steps:

  1. Retrieve from values from each shard
  2. Merge the retrieved from values.
  3. Create a TermsQuery based on the merged from terms and send this query to all shards.

Topic maps that have been split into shards could have values that would trigger merging if present in a single shard.

This appears to be a way to address that issue.

Time spent with Lucene is time well spent.

Query processing in distributed, taxonomy-based information sources

Tuesday, September 13th, 2011

Query processing in distributed, taxonomy-based information sources by Carlo Meghini, Yannis Tzitzikas, Veronica Coltella, and Anastasia Analyti.


We address the problem of answering queries over a distributed information system, storing objects indexed by terms organized in a taxonomy. The taxonomy consists of subsumption relationships between negation-free DNF formulas on terms and negation-free conjunctions of terms. In the first part of the paper, we consider the centralized case, deriving a hypergraph-based algorithm that is efficient in data complexity. In the second part of the paper, we consider the distributed case, presenting alternative ways implementing the centralized algorithm. These ways descend from two basic criteria: direct vs. query re-writing evaluation, and centralized vs. distributed data or taxonomy allocation. Combinations of these criteria allow to cover a wide spectrum of architectures, ranging from client-server to peer-to-peer. We evaluate the performance of the various architectures by simulation on a network with O(10^4) nodes, and derive final results. An extensive review of the relevant literature is finally included.

Two quick comments:

While simulations are informative, I am curious how the five architectures would fare against actual taxonomies? Thinking that the complexity at any particular level varies greatly from taxonomy to taxonomy, assuming they are taxonomies that record natural phenomena.

Second, I think there is a growing recognition that while some data can be successfully gathered to a single location for processing, there is an increasing amount of data that may be partially accessible but that cannot be transfered for privacy, security or other concerns. And such diverse systems are likely to have their own means of identifying subjects.

Reverted Indexing

Tuesday, March 29th, 2011

Reverted Indexing

From the website:

Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify matching documents for a query.

Term-based search, however, is just one example of interactive information seeking. Other examples include offering suggestions of documents similar to ones already found, or identifying effective query expansion terms that the user might wish to use. More generally, these fall into several categories: query term suggestion, relevance feedback, and pseudo-relevance feedback.

We can combine the inverted index with the notion of retrievability to create an efficient query expansion algorithm that is useful for a number of applications, such as query expansion and relevance (and pseudo-relevance) feedback. We call this kind of index a reverted index because rather than mapping terms onto documents, it maps document ids onto queries that retrieved the associated documents.

As to its performance:

….the short answer is that our query expansion technique outperforms PL2 and Bose-Einstein algorithms (as implemented in Terrier) by 15-20% on several TREC collections. This is just a first stab at implementing and evaluating this indexing, but we are quite excited by the results.

An interesting example of innovative thinking about indexing.

With a useful result.