## Archive for the ‘Bayesian Models’ Category

### Kalman and Bayesian Filters in Python

Monday, March 9th, 2015

Apologies for the lengthy quote but Roger makes a great case for interactive textbooks, IPython notebooks, writing for the reader as opposed to making the author feel clever, and finally, making content freely available.

It is a quote that I am going to make a point to read on a regular basis.

And all of that before turning to the subject at hand!

Enjoy!

From the preface:

This is a book for programmers that have a need or interest in Kalman filtering. The motivation for this book came out of my desire for a gentle introduction to Kalman filtering. I’m a software engineer that spent almost two decades in the avionics field, and so I have always been ‘bumping elbows’ with the Kalman filter, but never implemented one myself. They always has a fearsome reputation for difficulty, and I did not have the requisite education. Everyone I met that did implement them had multiple graduate courses on the topic and extensive industrial experience with them. As I moved into solving tracking problems with computer vision the need to implement them myself became urgent. There are classic textbooks in the field, such as Grewal and Andrew’s excellent Kalman Filtering. But sitting down and trying to read many of these books is a dismal and trying experience if you do not have the background. Typically the first few chapters fly through several years of undergraduate math, blithely referring you to textbooks on, for example, Itō calculus, and presenting an entire semester’s worth of statistics in a few brief paragraphs. These books are good textbooks for an upper undergraduate course, and an invaluable reference to researchers and professionals, but the going is truly difficult for the more casual reader. Symbology is introduced without explanation, different texts use different words and variables names for the same concept, and the books are almost devoid of examples or worked problems. I often found myself able to parse the words and comprehend the mathematics of a definition, but had no idea as to what real world phenomena these words and math were attempting to describe. “But what does that mean?” was my repeated thought.

However, as I began to finally understand the Kalman filter I realized the underlying concepts are quite straightforward. A few simple probability rules, some intuition about how we integrate disparate knowledge to explain events in our everyday life and the core concepts of the Kalman filter are accessible. Kalman filters have a reputation for difficulty, but shorn of much of the formal terminology the beauty of the subject and of their math became clear to me, and I fell in love with the topic.

As I began to understand the math and theory more difficulties itself. A book or paper’s author makes some statement of fact and presents a graph as proof. Unfortunately, why the statement is true is not clear to me, nor is the method by which you might make that plot obvious. Or maybe I wonder “is this true if R=0?” Or the author provides pseudocode – at such a high level that the implementation is not obvious. Some books offer Matlab code, but I do not have a license to that expensive package. Finally, many books end each chapter with many useful exercises. Exercises which you need to understand if you want to implement Kalman filters for yourself, but exercises with no answers. If you are using the book in a classroom, perhaps this is okay, but it is terrible for the independent reader. I loathe that an author withholds information from me, presumably to avoid ‘cheating’ by the student in the classroom.

None of this necessary, from my point of view. Certainly if you are designing a Kalman filter for a aircraft or missile you must thoroughly master of all of the mathematics and topics in a typical Kalman filter textbook. I just want to track an image on a screen, or write some code for my Arduino project. I want to know how the plots in the book are made, and chose different parameters than the author chose. I want to run simulations. I want to inject more noise in the signal and see how a filter performs. There are thousands of opportunities for using Kalman filters in everyday code, and yet this fairly straightforward topic is the provenance of rocket scientists and academics.

I wrote this book to address all of those needs. This is not the book for you if you program avionics for Boeing or design radars for Raytheon. Go get a degree at Georgia Tech, UW, or the like, because you’ll need it. This book is for the hobbyist, the curious, and the working engineer that needs to filter or smooth data.

This book is interactive. While you can read it online as static content, I urge you to use it as intended. It is written using IPython Notebook, which allows me to combine text, python, and python output in one place. Every plot, every piece of data in this book is generated from Python that is available to you right inside the notebook. Want to double the value of a parameter? Click on the Python cell, change the parameter’s value, and click ‘Run’. A new plot or printed output will appear in the book.

This book has exercises, but it also has the answers. I trust you. If you just need an answer, go ahead and read the answer. If you want to internalize this knowledge, try to implement the exercise before you read the answer.

This book has supporting libraries for computing statistics, plotting various things related to filters, and for the various filters that we cover. This does require a strong caveat; most of the code is written for didactic purposes. It is rare that I chose the most efficient solution (which often obscures the intent of the code), and in the first parts of the book I did not concern myself with numerical stability. This is important to understand – Kalman filters in aircraft are carefully designed and implemented to be numerically stable; the naive implementation is not stable in many cases. If you are serious about Kalman filters this book will not be the last book you need. My intention is to introduce you to the concepts and mathematics, and to get you to the point where the textbooks are approachable.

Finally, this book is free. The cost for the books required to learn Kalman filtering is somewhat prohibitive even for a Silicon Valley engineer like myself; I cannot believe the are within the reach of someone in a depressed economy, or a financially struggling student. I have gained so much from free software like Python, and free books like those from Allen B. Downey here [1]. It’s time to repay that. So, the book is free, it is hosted on free servers, and it uses only free and open software such as IPython and mathjax to create the book.

I first saw this in a tweet by nixCraft.

### Augur:…

Tuesday, December 31st, 2013

Augur: a Modeling Language for Data-Parallel Probabilistic Inference by Jean-Baptiste Tristan, et.al.

Abstract:

It is time-consuming and error-prone to implement inference procedures for each new probabilistic model. Probabilistic programming addresses this problem by allowing a user to specify the model and having a compiler automatically generate an inference procedure for it. For this approach to be practical, it is important to generate inference code that has reasonable performance. In this paper, we present a probabilistic programming language and compiler for Bayesian networks designed to make effective use of data-parallel architectures such as GPUs. Our language is fully integrated within the Scala programming language and benefits from tools such as IDE support, type-checking, and code completion. We show that the compiler can generate data-parallel inference code scalable to thousands of GPU cores by making use of the conditional independence relationships in the Bayesian network.

A very good paper but the authors should highlight the caveat in the introduction:

We claim that many MCMC inference algorithms are highly data-parallel (Hillis & Steele, 1986; Blelloch, 1996) if we take advantage of the conditional independence relationships of the input model (e.g. the assumption of i.i.d. data makes the likelihood independent across data points).

(Where i.i.d. = Independent and identically distributed random variables.)

That assumption does allow for parallel processing, but users should be cautious about accepting assumptions about data.

The algorithms will still work, even if your assumptions about the data are incorrect.

But the answer you get may not be as useful as you would like.

I first saw this in a tweet by Stefano Bertolo.

### Introduction to Bayesian Networks & BayesiaLab

Thursday, October 17th, 2013

Introduction to Bayesian Networks & BayesiaLab by Stefan Conrady and Dr. Lionel Jouffe.

From the webpage:

With Professor Judea Pearl receiving the prestigious 2011 A.M. Turing Award, Bayesian networks have presumably received more public recognition than ever before. Judea Pearl’s achievement of establishing Bayesian networks as a new paradigm is fittingly summarized by Stuart Russell:

“[Judea Pearl] is credited with the invention of Bayesian networks, a mathematical formalism for defining complex probability models, as well as the principal algorithms used for inference in these models. This work not only revolutionized the field of artificial intelligence but also became an important tool for many other branches of engineering and the natural sciences. He later created a mathematical framework for causal inference that has had significant impact in the social sciences.”

While their theoretical properties made Bayesian networks immediately attractive for academic research, especially with regard to the study of causality, the arrival of practically feasible machine learning algorithms has allowed Bayesian networks to grow beyond its origin in the field of computer science. Since the first release of the BayesiaLab software package in 2001, Bayesian networks have finally become accessible to a wide range of scientists and analysts for use in many other disciplines.

In this introductory paper, we present Bayesian networks (the paradigm) and BayesiaLab (the software tool), from the perspective of the applied researcher.

The webpage gives an overview of the white paper. Or you can jump directly to the paper (PDF).

With the emphasis on machine processing, there will be people going through the motions of data processing with a black box and data dumps going into it.

And there will be people who understand the box but not the data flowing into it.

Finally there will be people using cutting edge techniques who understand the box and the data flowing into it.

Which group do you think will have the better results?

### Diagrams for hierarchical models – we need your opinion

Monday, October 14th, 2013

Diagrams for hierarchical models – we need your opinion by John K. Kruschke.

If you haven’t done any good deeds lately, here is a chance to contribute to the common good.

From the post:

When trying to understand a hierarchical model, I find it helpful to make a diagram of the dependencies between variables. But I have found the traditional directed acyclic graphs (DAGs) to be incomplete at best and downright confusing at worst. Therefore I created differently styled diagrams for Doing Bayesian Data Analysis (DBDA). I have found them to be very useful for explaining models, inventing models, and programming models. But my idiosyncratic impression might be only that, and I would like your insights about the pros and cons of the two styles of diagrams. (emphasis in original)

John’s post has the details of the different diagram styles.

Which do you like better?

John is also the author of: Doing Bayesian data analysis : a tutorial with R and BUGS. My library system doesn’t have a copy but I can report that it has gotten really good reviews.

### Bayesian Methods for Hackers

Thursday, July 11th, 2013

Bayesian Methods for Hackers by a community of authors!

From the readme:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

(…)

Useful in case all the knowledge you want to put in a topic map is far from certain. 😉

### Probabilistic Programming and Bayesian Methods for Hackers

Thursday, May 23rd, 2013

Probabilistic Programming and Bayesian Methods for Hackers by Cam Davidson-Pilon and others.

From the website:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

Not yet complete but what is there you will find very useful.

### Probabilistic Programming and Bayesian Methods for Hackers

Saturday, March 30th, 2013

Probabilistic Programming and Bayesian Methods for Hackers

From the webpage:

Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simplely not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

DARPA (Logic and Probabilistic Programming) should be glad that someone else is working on probabilistic programming.

I first saw this at Nat Torkington’s Four short links: 29 March 2103.

### Using Bayesian networks to discover relations…

Saturday, March 23rd, 2013

Using Bayesian networks to discover relations between genes, environment, and disease by Chengwei Su, Angeline Andrew, Margaret R Karagas and Mark E Borsuk. (BioData Mining 2013, 6:6 doi:10.1186/1756-0381-6-6)

Abstract:

We review the applicability of Bayesian networks (BNs) for discovering relations between genes, environment, and disease. By translating probabilistic dependencies among variables into graphical models and vice versa, BNs provide a comprehensible and modular framework for representing complex systems. We first describe the Bayesian network approach and its applicability to understanding the genetic and environmental basis of disease. We then describe a variety of algorithms for learning the structure of a network from observational data. Because of their relevance to real-world applications, the topics of missing data and causal interpretation are emphasized. The BN approach is then exemplified through application to data from a population-based study of bladder cancer in New Hampshire, USA. For didactical purposes, we intentionally keep this example simple. When applied to complete data records, we find only minor differences in the performance and results of different algorithms. Subsequent incorporation of partial records through application of the EM algorithm gives us greater power to detect relations. Allowing for network structures that depart from a strict causal interpretation also enhances our ability to discover complex associations including gene-gene (epistasis) and gene-environment interactions. While BNs are already powerful tools for the genetic dissection of disease and generation of prognostic models, there remain some conceptual and computational challenges. These include the proper handling of continuous variables and unmeasured factors, the explicit incorporation of prior knowledge, and the evaluation and communication of the robustness of substantive conclusions to alternative assumptions and data manifestations.

From the introduction:

BNs have been applied in a variety of settings for the purposes of causal study and probabilistic prediction, including medical diagnosis, crime and terrorism risk, forensic science, and ecological conservation (see [7]). In bioinformatics, they have been used to analyze gene expression data [8,9], derive protein signaling networks [10-12], predict protein-protein interactions [13], perform pedigree analysis [14], conduct genetic epidemiological studies [5], and assess the performance of microsatellite markers on cancer recurrence [15].

Not to mention criminal investigations: Bayesian Network – [Crime Investigation] (Youtube). 😉

Once relations are discovered, you are free to decorate them with roles, properties, etc., in other words, associations.

### Bayesian Reasoning and Machine Learning (update)

Sunday, March 10th, 2013

Bayesian Reasoning and Machine Learning by David Barber.

I first posted about this work at: Bayesian Reasoning and Machine Learning in 2011.

The current draft (that corresponds to the Cambridge University Press hard copy is dated January 9, 2013.

If you use the online version and have the funds, please order a hard copy to encourage the publisher to continue to make published texts available online.

### Accelerating Inference: towards a full Language, Compiler and Hardware stack

Friday, December 14th, 2012

Accelerating Inference: towards a full Language, Compiler and Hardware stack by Shawn Hershey, Jeff Bernstein, Bill Bradley, Andrew Schweitzer, Noah Stein, Theo Weber, Ben Vigoda.

Abstract:

We introduce Dimple, a fully open-source API for probabilistic modeling. Dimple allows the user to specify probabilistic models in the form of graphical models, Bayesian networks, or factor graphs, and performs inference (by automatically deriving an inference engine from a variety of algorithms) on the model. Dimple also serves as a compiler for GP5, a hardware accelerator for inference.

From the introduction:

Graphical models alleviate the complexity inherent to large dimensional statistical models (the so-called curse of dimensionality) by dividing the problem into a series of logically (and statistically) independent components. By factoring the problem into subproblems with known and simple interdependencies, and by adopting a common language to describe each subproblem, one can considerably simplify the task of creating complex Bayesian models. Modularity can be taken advantage of further by leveraging this modeling hierarchy over several levels (e.g. a submodel can also be decomposed into a family of sub-submodels). Finally, by providing a framework which abstracts the key concepts underlying classes of models, graphical models allow the design of general algorithms which can be efﬁciently applied across completely different ﬁelds, and systematically derived from a model description.

Suggestive of sub-models of merging?

I first saw this in a tweet from Stefano Bertolo.

### Think Bayes: Bayesian Statistics Made Simple

Thursday, October 11th, 2012

Think Bayes: Bayesian Statistics Made Simple by Allen B. Downey.

Think Bayes is an introduction to Bayesian statistics using computational methods. This version of the book is a rough draft. I am making this draft available for comments, but it comes with the warning that it is probably full of errors.

Allen has written free books on Python, statistics, complexity and now Bayesian statistics.

If you don’t know his books, good opportunity to give them a try.

### Stan (Bayesian Inference) [update]

Sunday, October 7th, 2012

Stan

From the webpage:

Stan is an open-source package for obtaining Bayesian inference using the No-U-Turn sampler, a variant of Hamiltonian Monte Carlo.

I first reported on a presentation: Stan: A (Bayesian) Directed Graphical Model Compiler last January when Stan was unreleased.

Following a link from Christophe Lalanne’s A bag of tweets / September 2012, I find the released version of the software!

Very cool!

### Flexibility to Discover…

Sunday, April 22nd, 2012

David W. Hogg writes:

If you want to have the flexibility to discover correct structure in your data, you may have to adopt methods that permit variable model complexity.

Context to follow but think about that for a minute.

Do you want to discover structures or confirm what you already believe to be present?

In context:

On day zero of AISTATS, I gave a workshop on machine learning in astronomy, concentrating on the ideas of (a) trusting unreliable data and (b) the necessity of having a likelihood, or probability of the data given the model, making use of a good noise model. Before me, Zoubin Ghahramani gave a very valuable overview of Bayesian non-parametric methods. He emphasized something that was implied to me by Brendon Brewer’s success on my MCMC High Society challenge and mentioned by Rob Fergus when we last talked about image modeling, but which has rarely been explored in astronomy: If you want to have the flexibility to discover correct structure in your data, you may have to adopt methods that permit variable model complexity. The issues are two-fold: For one, a sampler or an optimizer can easily get stuck in a bad local spot if it doesn’t have the freedom to branch more model complexity somewhere else and then later delete the structure that is getting it stuck. For another, if you try to model an image that really does have five stars in it with a model containing only four stars, you are requiring that you will do a bad job! Bayesian non-parametrics is this kind of argument on speed, with all sorts of processes named after different kinds of restaurants. But just working with the simple dictionary of stars and galaxies, we could benefit from the sampling ideas at least. (emphasis added)

Isn’t that awesome? With all the astronomy data that is coming online? (With lots of it already online.)

Not to mention finding structures in other data as well. Maybe even in “big data.”

### DUALIST: Utility for Active Learning with Instances and Semantic Terms

Wednesday, April 18th, 2012

DUALIST: Utility for Active Learning with Instances and Semantic Terms

From the webpage:

DUALIST is an interactive machine learning system for quickly building classifiers for text processing tasks. It does so by asking “questions” of a human “teacher” in the form of both data instances (e.g., text documents) and features (e.g., words or phrases). It uses active learning and semi-supervised learning to build text-based classifiers at interactive speed.

(video demo omitted)

The goals of this project are threefold:

1. A practical tool to facilitate annotation/learning in text analysis projects.
2. A framework to facilitate research in interactive and multi-modal active learning. This includes enabling actual user experiments with the GUI (as opposed to simulated experiments, which are pervasive in the literature but sometimes inconclusive for use in practice) and exploring HCI issues, as well as supporting new dual supervision algorithms which are fast enough to be interactive, accurate enough to be useful, and might make more appropriate modeling assumptions than multinomial naive Bayes (the current underlying model).
3. A starting point for more sophisticated interactive learning scenarios that combine multiple “beyond supervised learning” strategies. See the proceedings of the recent ICML 2011 workshop on this topic.

This could be quite useful for authoring a topic map across a corpus of materials. With interactive recognition of occurrences of subjects, etc.

Sponsored in part by the folks at DARPA. Unlike Al Gore, they did build the Internet.

### Information Field Theory

Wednesday, December 7th, 2011

Information Field Theory

May be something, may be nothing.

I saw a news flash about the use of this technique to combine 41,000 observations to create a magnetic map of the Milky Way. Subject to a lot of noise and smoothing of the data.

Which made me think that perhaps, just perhaps this technique could be used across a semantic field?

From the webpage:

Information field theory (IFT) is information theory, the logic of reasoning under uncertainty, applied to fields. A field can be any quantity defined over some space, e.g. the air temperature over Europe, the magnetic field strength in the Milky Way, or the matter density in the Universe. IFT describes how data and knowledge can be used to infer field properties. Mathematically it is a statistical field theory and exploits many of the tools developed for such. Practically, it is a framework for signal processing and image reconstruction.

All the examples I found were in the physical sciences but I would check closely before claiming to be the first to use the technique in a social science context.

### Probabilistic Graphical Models (class)

Monday, November 21st, 2011

Probabilistic Graphical Models (class) by Daphne Koller. (Stanford University)

From the web page:

What are Probabilistic Graphical Models?

Uncertainty is unavoidable in real-world applications: we can almost never predict with certainty what will happen in the future, and even in the present and the past, many important aspects of the world are not observed with certainty. Probability theory gives us the basic foundation to model our beliefs about the different possible states of the world, and to update these beliefs as new evidence is obtained. These beliefs can be combined with individual preferences to help guide our actions, and even in selecting which observations to make. While probability theory has existed since the 17th century, our ability to use it effectively on large problems involving many inter-related variables is fairly recent, and is due largely to the development of a framework known as Probabilistic Graphical Models (PGMs). This framework, which spans methods such as Bayesian networks and Markov random fields, uses ideas from discrete data structures in computer science to efficiently encode and manipulate probability distributions over high-dimensional spaces, often involving hundreds or even many thousands of variables. These methods have been used in an enormous range of application domains, which include: web search, medical and fault diagnosis, image understanding, reconstruction of biological networks, speech recognition, natural language processing, decoding of messages sent over a noisy communication channel, robot navigation, and many more. The PGM framework provides an essential tool for anyone who wants to learn how to reason coherently from limited and noisy observations.

About The Course

In this class, you will learn the basics of the PGM representation and how to construct them, using both human knowledge and machine learning techniques; you will also learn algorithms for using a PGM to reach conclusions about the world from limited and noisy evidence, and for making good decisions under uncertainty. The class covers both the theoretical underpinnings of the PGM framework and practical skills needed to apply these techniques to new problems. Topics include: (i) The Bayesian network and Markov network representation, including extensions for reasoning over domains that change over time and over domains with a variable number of entities; (ii) reasoning and inference methods, including exact inference (variable elimination, clique trees) and approximate inference (belief propagation message passing, Markov chain Monte Carlo methods); (iii) learning methods for both parameters and structure in a PGM; (iv) using a PGM for decision making under uncertainty. The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply PGM methods to computer vision, text understanding, medical decision making, speech recognition, and many other areas.

Another very strong resource from Stanford.

Serious (or aspiring) data miners will be lining up for this course!

### Bayesian variable selection [off again]

Wednesday, November 16th, 2011

Bayesian variable selection [off again]

From the post:

As indicated a few weeks ago, we have received very encouraging reviews from Bayesian Analysis about our [Gilles Celeux, Mohammed El Anbari, Jean-Michel Marin and myself] our comparative study of Bayesian and non-Bayesian variable selections procedures (“Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation“) to Bayesian Analysis. We have just rearXived and resubmitted it with additional material and hope this is the last round. (I must acknowledge a limited involvement at this final stage of the paper. Had I had more time available, I would have liked to remove the numerous tables and turn them into graphs…)

If you are not conversant in Bayesian thinking and recent work, this paper is going to be … difficult. Despite just having gotten past the introduction and looking references to help with part 2, I think it will be a good intellectual exercise and important for your use of Bayesian models in the future. Two very good reasons to spend the time to understand this paper.

Or to put it another way, the world is non-probabilistic only when viewed with a certain degree of coarseness. How useful a coarse view is, varies from circumstance to circumstance. If you don’t have the capability to use a probabilistic view, you will be limited to a coarse one. (Neither better than the other, but having both seems advantageous to me.)

### Domain Adaptation with Hierarchical Logistic Regression

Thursday, October 6th, 2011

Domain Adaptation with Hierarchical Logistic Regression

Bob Carpenter continues his series on domain adaptation:

Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.

Hierarchical Logistic Regression

Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.

Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.

### Bayesian Statistical Reasoning

Saturday, October 1st, 2011

DM SIG “Bayesian Statistical Reasoning ” 5/23/2011 by Prof. David Draper, PhD.

I think you will be surprised at how interesting and even compelling this presentation becomes at points. Particularly his comments early in the presentation about needing an analogy machine, to find things not expressed in the way you usually look for them. And he has concrete examples of where that has been needed.

Title: Bayesian Statistical Reasoning: an inferential, predictive and decision-making paradigm for the 21st century

Professor Draper gives examples of Bayesian inference, prediction and decision-making in the context of several case studies from medicine and health policy. There will be points of potential technical interest for applied mathematicians, statisticians, and computer scientists. Broadly speaking, statistics is the study of uncertainty: how to measure it well, and how to make good choices in the face of it. Statistical activities are of four main types: description of a data set, inference about the underlying process generating the data, prediction of future data, and decision-making under uncertainty. The last three of these activities are probability based. Two main probability paradigms are in current use: the frequentist (or relative-frequency) approach, in which you restrict attention to phenomena that are inherently repeatable under “identical” conditions and define P(A) to be the limiting relative frequency with which A would occur in hypothetical repetitions, as n goes to infinity; and the Bayesian approach, in which the arguments A and B of the probability operator P(A|B) are true-false propositions (with the truth status of A unknown to you and B assumed by you to be true), and P(A|B) represents the weight of evidence in favor of the truth of A, given the information in B. The Bayesian approach includes the frequentest paradigm as a special case,so you might think it would be the only version of probability used in statistical work today, but (a) in quantifying your uncertainty about something unknown to you, the Bayesian paradigm requires you to bring all relevant information to bear on the calculation; this involves combining information both internal and external to the data you’ve gathered, and (somewhat strangely) the external-information part of this approach was controversial in the 20th century, and (b) Bayesian calculations require approximating high-dimensional integrals (whereas the frequentist approach mainly relies on maximization rather than integration), and this was a severe limitation to the Bayesian paradigm for a long time (from the 1750s to the 1980s). The external-information problem has been solved by developing methods that separately handle the two main cases: (1) substantial external information, which is addressed by elicitation techniques, and (2) relatively little external information, which is covered by any of several methods for (in the jargon) specifying diffuse prior distributions. Good Bayesian work also involves sensitivity analysis: varying the manner in which you quantify the internal and external information across reasonable alternatives, and examining the stability of your conclusions. Around 1990 two things happened roughly simultaneously that completely changed the Bayesian computational picture: * Bayesian statisticians belatedly discovered that applied mathematicians (led by Metropolis), working at the intersection between chemistry and physics in the 1940s, had used Markov chains to develop a clever algorithm for approximating integrals arising in thermodynamics that are similar to the kinds of integrals that come up in Bayesian statistics, and * desk-top computers finally became fast enough to implement the Metropolis algorithm in a feasibly short amount of time. As a result of these developments, the Bayesian computational problem has been solved in a wide range of interesting application areas with small-to-moderate amounts of data; with large data sets, variational methods are available that offer a different approach to useful approximate solutions. The Bayesian paradigm for uncertainty quantification does appear to have one remaining weakness, which coincides with a strength of the frequentest paradigm: nothing in the Bayesian approach to inference and prediction requires you to pay attention to how often you get the right answer (thisis a form of calibration of your uncertainty assessments), which is an activity that’s (i) central to good science and decision-making and (ii) natural to emphasize from the frequentist point of view. However, it has recently been shown that calibration can readily be brought into the Bayesian story by means of decision theory, turning the Bayesian paradigm into an approach that is (in principle) both logically internally consistent and well-calibrated. In this talk I’ll (a) offer some historical notes about how we have arrived at the present situation and (b) give examples of Bayesian inference, prediction and decision-making in the context of several case studies from medicine and health policy. There will be points of potential technical interest for applied mathematicians, statisticians and computer scientists.

### Bayes’ Rule and App Search

Thursday, September 29th, 2011

Bayes’ Rule and App Search by Paul Christiano.

From the post:

In order to provide relevant search results, Quixey needs to integrate many different sources of evidence — not only each app’s name and description, but content from all over the web that refers to specific apps. Aggregating huge quantities of information into a single judgment is a notoriously difficult problem, and modern machine learning offers many approaches.

When you need to incorporate just a few pieces of information, there’s a mathematical version of “brute force” that works quite well, based on Bayes’ Rule:

Very smooth explanation of Bayes’ Rule if you need to get your feet wet!

### Domain Adaptation with Hierarchical Naive Bayes Classifiers

Sunday, September 25th, 2011

Domain Adaptation with Hierarchical Naive Bayes Classifiers by Bob Carpenter.

From the post:

This will be the first of two posts exploring hierarchical and multilevel classifiers. In this post, I’ll describe a hierarchical generalization of naive Bayes (what the NLP world calls a “generative” model). The next post will explore hierarchical logistic regression (called a “discriminative” or “log linear” or “max ent” model in NLP land).

Very entertaining and useful if you use NLP at all in your pre-topic map phase.

### Learning Topic Models by Belief Propagation

Friday, September 16th, 2011

Learning Topic Models by Belief Propagation by Jia Zeng, William K. Cheung, and Jiming Liu.

Abstract:

Latent Dirichlet allocation (LDA) is an important class of hierarchical Bayesian models for probabilistic topic modeling, which attracts worldwide interests and touches many important applications in text mining, computer vision and computational biology. This paper proposes a novel tree-structured factor graph representation for LDA within the Markov random field (MRF) framework, which enables the classic belief propagation (BP) algorithm for exact inference and parameter estimation. Although two commonly-used approximation inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great successes in learning LDA, the proposed BP is competitive in both speed and accuracy validated by encouraging experimental results on four large-scale document data sets. Furthermore, the BP algorithm has the potential to become a generic learning scheme for variants of LDA-based topic models. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using belief propagation based on the factor graph representation.

I have just started reading this paper but wanted to bring it to your attention. I peeked at the results and it looks quite promising.

This work was tested against the following data sets:

1) CORA [30] contains abstracts from the CORA research paper search engine in machine learning area, where the documents can be classified into 7 major categories.

2) MEDL [31] contains abstracts from the MEDLINE biomedical paper search engine, where the documents fall broadly into 4 categories.

3) NIPS [32] includes papers from the conference “Neural Information Processing Systems”, where all papers are grouped into 13 categories. NIPS has no citation link information.

4) BLOG [33] contains a collection of political blogs on the subject of American politics in the year 2008. where all blogs can be broadly classified into 6 categories. BLOG has no author information.

with positive results.

### Naive Bayes Classifiers – Python

Thursday, September 15th, 2011

Naive Bayes Classifiers – Python

From the post:

In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking the frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. The label whose likelihood estimate is the highest is then assigned to the input value.

Just one recent post from Python Language Processing. There are a number of others, some of which I will call out in future posts.

### Visualizing Bayes’ Theorem

Sunday, September 4th, 2011

Visualizing Bayes’ Theorem by Oscar Bonilla.

Uses Venn diagrams to construct a visual derivation of Bayes’ theorem.

### Public Policy by Bayesian Model?

Sunday, August 21st, 2011

Abstract:

Fienberg convincingly demonstrates that Bayesian models and methods represent a powerful approach to squeezing illumination from data in public policy settings. However, no school of inference is without its weaknesses, and, in the face of the ambiguities, uncertainties, and poorly posed questions of the real world, perhaps we should not expect to find a formally correct inferential strategy which can be universally applied, whatever the nature of the question: we should not expect to be able to identify a “norm” approach. An analogy is made between George Box’s “no models are right, but some are useful,” and inferential systems.

A cautionary tale that reaches beyond Bayesian models. It is very often (always?) the case that models find the object of investigation.

### Visitor Conversion with Bayesian Discriminant and Hadoop

Monday, August 15th, 2011

Visitor Conversion with Bayesian Discriminant and Hadoop

From the post:

You have lots of visitors on your eCommerce web site and obviously you would like most of them to convert. By conversion, I mean buying your product or service. It could also mean the visitor taking an action, which potentially could financially benefit the business e.g., opening an account or signing up for email new letter. In this post, I will cover some predictive data mining techniques that may facilitate higher conversion rate.

Wouldn’t it be nice if for any ongoing session, you could predict the odds of the visitor converting during the session, based on the visitor’s behavior during the session.

Armed with such information, you could take different kinds of actions to enhance the chances of conversion. You could entice the visitor with a discount offer. Or you could engage the visitor in a live chat to answer any product related questions.

There are simple predictive analytic techniques to predict the probability of of a visitor converting. When the predicted probability crosses a predefined threshold, the visitor could be considered to have high potential of converting.

I would ask the question of “conversion” more broadly.

That is how can we dynamically change the model of subject identity in a topic map to match a user’s expectations? What user behavior and how would we track it to reach such an end?

Reasoning that users are more interested in and more likely to support topic maps that reinforce their world views. And selling someone topic map output that they find agreeable is easier than output they find disagreeable.

### Bayesian Reasoning and Machine Learning

Monday, August 8th, 2011

Bayesian Reasoning and Machine Learning by David Barber.

Whom this book is for

The book is designed to appeal to students with only a modest mathematical background in undergraduate calculus and linear algebra. No formal computer science or statistical background is required to follow the book, although a basic familiarity with probability, calculus and linear algebra would be useful. The book should appeal to students from a variety of backgrounds, including Computer Science, Engineering, applied Statistics, Physics, and Bioinformatics that wish to gain an entry to probabilistic approaches in Machine Learning. In order to engage with students, the book introduces fundamental concepts in inference using only minimal reference to algebra and calculus. More mathematical techniques are postponed until as and when required, always with the concept as primary and the mathematics secondary.

The concepts and algorithms are described with the aid of many worked examples. The exercises and demonstrations, together with an accompanying MATLAB toolbox, enable the reader to experiment and more deeply understand the material. The ultimate aim of the book is to enable the reader to construct novel algorithms. The book therefore places an emphasis on skill learning, rather than being a collection of recipes. This is a key aspect since modern applications are often so specialised as to require novel methods. The approach taken throughout is to fi rst describe the problem as a graphical model, which is then translated in to a mathematical framework, ultimately leading to an algorithmic implementation in the BRMLtoolbox.

The book is primarily aimed at fi nal year undergraduates and graduates without signifi cant experience in mathematics. On completion, the reader should have a good understanding of the techniques, practicalities and philosophies of probabilistic aspects of Machine Learning and be well equipped to understand more advanced research level material.

The main page for the book and link to software.

David Barber’s homepage.

The book is due to be published by Cambridge University Press in the summer of 2011.

### What are the Differences between Bayesian Classifiers and Mutual-Information Classifiers?

Tuesday, May 3rd, 2011

What are the Differences between Bayesian Classifiers and Mutual-Information Classifiers?

I am sure we have all laid awake at night worrying about this question at some point. 😉

Seriously, the paper shows that Bayesian and mutual information classifiers compliment each other in classification roles and merits your attention.

Abstract:

In this study, both Bayesian classifiers and mutual information classifiers are examined for binary classifications with or without a reject option. The general decision rules in terms of distinctions on error types and reject types are derived for Bayesian classifiers. A formal analysis is conducted to reveal the parameter redundancy of cost terms when abstaining classifications are enforced. The redundancy implies an intrinsic problem of “non-consistency” for interpreting cost terms. If no data is given to the cost terms, we demonstrate the weakness of Bayesian classifiers in class-imbalanced classifications. On the contrary, mutual-information classifiers are able to provide an objective solution from the given data, which shows a reasonable balance among error types and reject types. Numerical examples of using two types of classifiers are given for confirming the theoretical differences, including the extremely-class-imbalanced cases. Finally, we briefly summarize the Bayesian classifiers and mutual-information classifiers in terms of their application advantages, respectively.

After detailed analysis, which will be helpful in choosing appropriate situations for the use of Bayesian or mutual information classifiers, the paper concludes:

Bayesian and mutual-information classifiers are different essentially from their applied learning targets. From application viewpoints, Bayesian classifiers are more suitable to the cases when cost terms are exactly known for trade-off of error types and reject types. Mutual-information classifiers are capable of objectively balancing error types and reject types automatically without employing cost terms, even in the cases of extremely class-imbalanced datasets, which may describe a theoretical interpretation why humans are more concerned about the accuracy of rare classes in classifications.

### How to apply Naive Bayes Classifiers to document classification problems.

Wednesday, March 23rd, 2011

How to apply Naive Bayes Classifiers to document classification problems.

Nils Haldenwang does a good job of illustrating the actual application of a naive Bayes classifier to document classification.

A good introduction to an important topic for the construction of topic maps.

### Combining Pattern Classifiers: Methods and Algorithms

Saturday, March 12th, 2011

Combining Pattern Classifiers: Methods and Algorithms, Ludmila I. Kuncheva (2004)

WorldCat entry: Combining Pattern Classifiers: Methods and Algorithms

From the preface:

Everyday life throws at us an endless number of pattern recognition problems: smells, images, voices, faces, situations, and so on. Most of these problems we solve at a sensory level or intuitively, without an explicit method or algorithm. As soon as we are able to provide an algorithm the problem becomes trivial and we happily delegate it to the computer. Indeed, machines have confidently replaced humans in many formerly difficult or impossible, now just tedious pattern recognition tasks such as mail sorting, medical test reading, military target recognition, signature verification, meteorological forecasting, DNA matching, fingerprint recognition, and so on.

In the past, pattern recognition focused on designing single classifiers. This book is about combining the “opinions” of an ensemble of pattern classifiers in the hope that the new opinion will be better than the individual ones. “Vox populi, vox Dei.”

The field of combining classifiers is like a teenager: full of energy, enthusiasm, spontaneity, and confusion; undergoing quick changes and obstructing the attempts to bring some order to its cluttered box of accessories. When I started writing this book, the field was small and tidy, but it has grown so rapidly that I am faced with the Herculean task of cutting out a (hopefully) useful piece of this rich, dynamic, and loosely structured discipline. This will explain why some methods and algorithms are only sketched, mentioned, or even left out and why there is a chapter called “Miscellanea” containing a collection of important topics that I could not fit anywhere else.

Appreciate the author’s suggesting of older material to see how the pattern recognition developed.

Suggestions/comments on this or later literature on pattern recognition?