Archive for the ‘Click Graph’ Category

Clicks in Search

Monday, April 16th, 2012

Clicks in Search by Hugh E. Williams.

From the post:

Have you heard of the Pareto principle? The idea that 80% of sales come from 20% of customers, or that the 20% of the richest people control 80% of the world’s wealth.

How about George K. Zipf? The author of the “Human behavior and the principle of least effort” and “The Psycho-Biology of Language” is best-known for “Zipf’s Law“, the observation that the frequency of a word is inversely proportional to the rank of its frequency. Over simplifying a little, the word “the” is about twice as frequent as the word “of”, and then comes “and”, and so on. This also applies to the populations of cities, corporation sizes, and many more natural occurrences.

I’ve spent time understanding and publishing work how Zipf’s work applies in search engines. And the punchline in search is that the Pareto principle and Zipf’s Law are hard at work: the first item in a list gets about twice as many clicks as the second, and so on. There are inverse power law distributions everywhere.

Interesting conclusion: If curves don’t decay rapidly, worry.

How do subject identification curves decay? Same, different? Domain specific?

Now that could be interesting, viewed a a feature of a domain. Could lead to an empirical measure of which identification works “best” in a particular domain.

Random Walks on the Click Graph

Monday, April 16th, 2012

Random Walks on the Click Graph by Nick Craswell and Martin Szummer.

Abstract:

Search engines can record which documents were clicked for which query, and use these query-document pairs as ‘soft’ relevance judgments. However, compared to the true judgments, click logs give noisy and sparse relevance information. We apply a Markov random walk model to a large click log, producing a probabilistic ranking of documents for a given query. A key advantage of the model is its ability to retrieve relevant documents that have not yet been clicked for that query and rank those effectively. We conduct experiments on click logs from image search, comparing our (‘backward’) random walk model to a different (‘forward’) random walk, varying parameters such as walk length and self-transition probability. The most effective combination is a long backward walk with high self-transition probability.

Two points that may capture your interest:

  • The model does not consider query or document content. “Just the clicks, Ma’am.”
  • Image data is said to have “less noise” since users can see thumbnails before they follow a link. (True?)

I saw this cited quite recently but it is about five years old now (2007). Any recent literature on click graphs that you would point out?