Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 16, 2015

7 Traps to Avoid Being Fooled by Statistical Randomness

Filed under: Random Walks,Randomness,Statistics — Patrick Durusau @ 5:47 pm

7 Traps to Avoid Being Fooled by Statistical Randomness by Kirk Borne.

From the post:

Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere — if a process is truly random, then it is not predictable, in the analytic sense of that term. Randomness refers to the absence of patterns, order, coherence, and predictability in a system.

Unfortunately, we are often fooled by random events whenever apparent order emerges in the system. In moments of statistical weakness, some folks even develop theories to explain such “ordered” patterns. However, if the events are truly random, then any correlation is purely coincidental and not causal. I remember learning in graduate school a simple joke about erroneous scientific data analysis related to this concept: “Two points in a monotonic sequence display a tendency. Three points in a monotonic sequence display a trend. Four points in a monotonic sequence define a theory.” The message was clear — beware of apparent order in a random process, and don’t be tricked into developing a theory to explain random data.

Suppose I have a fair coin (with a head or a tail being equally likely to appear when I toss the coin). Of the following 3 sequences (each representing 12 sequential tosses of the fair coin), which sequence corresponds to a bogus sequence (i.e., a sequence that I manually typed on the computer)?

(a) HTHTHTHTHTHH

(b) TTTTTTTTTTTT

(c) HHHHHHHHHHHT

(d) None of the above.

In each case, a coin toss of head is listed as “H”, and a coin toss of tail is listed as “T”.

The answer is “(d) None of the Above.”

None of the above sequences was generated manually. They were all actual subsequences extracted from a larger sequence of random coin tosses. I admit that I selected these 3 subsequences non-randomly (which induces a statistical bias known as a selection effect) in order to try to fool you. The small-numbers phenomenon is evident here — it corresponds to the fact that when only 12 coin tosses are considered, the occurrence of any “improbable result” may lead us (incorrectly) to believe that it is statistically significant. Conversely, if we saw answer (b) continuing for dozens of more coin tosses (nothing but Tails, all the way down), then that would be truly significant.

Great post on randomness where Kirk references a fun example using Nobel Prize winners with various statistical “facts” for your amusement.

Kirk suggests a reading pack for partial avoidance of this issue in your work:

  1. Fooled By Randomness“, by Nassim Nicholas Taleb.
  2. The Flaw of Averages“, by Sam L. Savage.
  3. The Drunkard’s Walk – How Randomness Rules Our Lives, by Leonard Mlodinow.

I wonder if you could get Amazon to create a also-bought-with package of those three books? Something you could buy for your friends in big data and intelligence work. 😉

Interesting that I saw this just after posting Structuredness coefficient to find patterns and associations. The call on “likely” or “unlikely” comes down to human agency. Yes?

April 16, 2012

Random Walks on the Click Graph

Filed under: Click Graph,Markov Decision Processes,Probabilistic Ranking,Random Walks — Patrick Durusau @ 7:13 pm

Random Walks on the Click Graph by Nick Craswell and Martin Szummer.

Abstract:

Search engines can record which documents were clicked for which query, and use these query-document pairs as ‘soft’ relevance judgments. However, compared to the true judgments, click logs give noisy and sparse relevance information. We apply a Markov random walk model to a large click log, producing a probabilistic ranking of documents for a given query. A key advantage of the model is its ability to retrieve relevant documents that have not yet been clicked for that query and rank those effectively. We conduct experiments on click logs from image search, comparing our (‘backward’) random walk model to a different (‘forward’) random walk, varying parameters such as walk length and self-transition probability. The most effective combination is a long backward walk with high self-transition probability.

Two points that may capture your interest:

  • The model does not consider query or document content. “Just the clicks, Ma’am.”
  • Image data is said to have “less noise” since users can see thumbnails before they follow a link. (True?)

I saw this cited quite recently but it is about five years old now (2007). Any recent literature on click graphs that you would point out?

Powered by WordPress