Archive for the ‘Probabilistic Data Structures’ Category

None/Some/All … Are Suicide Bombers & Probabilistic Programming Languages

Tuesday, November 8th, 2016

The Design and Implementation of Probabilistic Programming Languages by Noah D. Goodman and Andreas Stuhlmüller.


Probabilistic programming languages (PPLs) unify techniques for the formal description of computation and for the representation and use of uncertain knowledge. PPLs have seen recent interest from the artificial intelligence, programming languages, cognitive science, and natural languages communities. This book explains how to implement PPLs by lightweight embedding into a host language. We illustrate this by designing and implementing WebPPL, a small PPL embedded in Javascript. We show how to implement several algorithms for universal probabilistic inference, including priority-based enumeration with caching, particle filtering, and Markov chain Monte Carlo. We use program transformations to expose the information required by these algorithms, including continuations and stack addresses. We illustrate these ideas with examples drawn from semantic parsing, natural language pragmatics, and procedural graphics.

If you want to sharpen the discussion of probabilistic programming languages, substitute in the pragmatics example:

‘none/some/all of the children are suicide bombers’,

The substitution raises the issue of how “certainty” can/should vary depending upon the gravity of results.

Who is a nice person?, has low stakes.

Who is a suicide bomber?, has high stakes.

Probabilistic Data Structures for Web Analytics and Data Mining

Friday, July 27th, 2012

Probabilistic Data Structures for Web Analytics and Data Mining by Ilya Katsov.

Speaking of scalability, consider:

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

For some subjects, we have probabilistic identifications, based upon data that is too voluminous or rapid to allow for a “definitive” identification.

The techniques introduced here will give you a grounding in data structures to deal with those situations. Interesting reading.

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.