Archive for the ‘Bayesian Data Analysis’ Category

Probabilistic Programming and Bayesian Methods for Hackers

Saturday, March 30th, 2013

Probabilistic Programming and Bayesian Methods for Hackers

From the webpage:

Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simplely not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

DARPA (Logic and Probabilistic Programming) should be glad that someone else is working on probabilistic programming.

I first saw this at Nat Torkington’s Four short links: 29 March 2103.

Using Bayesian networks to discover relations…

Saturday, March 23rd, 2013

Using Bayesian networks to discover relations between genes, environment, and disease by Chengwei Su, Angeline Andrew, Margaret R Karagas and Mark E Borsuk. (BioData Mining 2013, 6:6 doi:10.1186/1756-0381-6-6)

Abstract:

We review the applicability of Bayesian networks (BNs) for discovering relations between genes, environment, and disease. By translating probabilistic dependencies among variables into graphical models and vice versa, BNs provide a comprehensible and modular framework for representing complex systems. We first describe the Bayesian network approach and its applicability to understanding the genetic and environmental basis of disease. We then describe a variety of algorithms for learning the structure of a network from observational data. Because of their relevance to real-world applications, the topics of missing data and causal interpretation are emphasized. The BN approach is then exemplified through application to data from a population-based study of bladder cancer in New Hampshire, USA. For didactical purposes, we intentionally keep this example simple. When applied to complete data records, we find only minor differences in the performance and results of different algorithms. Subsequent incorporation of partial records through application of the EM algorithm gives us greater power to detect relations. Allowing for network structures that depart from a strict causal interpretation also enhances our ability to discover complex associations including gene-gene (epistasis) and gene-environment interactions. While BNs are already powerful tools for the genetic dissection of disease and generation of prognostic models, there remain some conceptual and computational challenges. These include the proper handling of continuous variables and unmeasured factors, the explicit incorporation of prior knowledge, and the evaluation and communication of the robustness of substantive conclusions to alternative assumptions and data manifestations.

From the introduction:

BNs have been applied in a variety of settings for the purposes of causal study and probabilistic prediction, including medical diagnosis, crime and terrorism risk, forensic science, and ecological conservation (see [7]). In bioinformatics, they have been used to analyze gene expression data [8,9], derive protein signaling networks [10-12], predict protein-protein interactions [13], perform pedigree analysis [14], conduct genetic epidemiological studies [5], and assess the performance of microsatellite markers on cancer recurrence [15].

Not to mention criminal investigations: Bayesian Network – [Crime Investigation] (Youtube). ;-)

Once relations are discovered, you are free to decorate them with roles, properties, etc., in other words, associations.

Missing-Data Imputation

Saturday, December 29th, 2012

New book by Stef van Buuren on missing-data imputation looks really good! by Andrew Gelman.

From the post:

Ben points us to a new book, Flexible Imputation of Missing Data. It’s excellent and I highly recommend it. Definitely worth the $89.95. Van Buuren’s book is great even if you don’t end up using the algorithm described in the book (I actually like their approach but I do think there are some limitations with their particular implementation, which is one reason we’re developing our own package); he supplies lots of intuition, examples, and graphs.

Steve Newcomb makes the point that data is dirty. Always.

Stef van Buuren suggests that data may be missing and requires imputation.

Together that means dirty data may be missing and requires imputation.

;-)

Imputed or not, data is no more reliable than we are. Use with caution.

Think Bayes: Bayesian Statistics Made Simple

Thursday, October 11th, 2012

Think Bayes: Bayesian Statistics Made Simple by Allen B. Downey.

Think Bayes is an introduction to Bayesian statistics using computational methods. This version of the book is a rough draft. I am making this draft available for comments, but it comes with the warning that it is probably full of errors.

Allen has written free books on Python, statistics, complexity and now Bayesian statistics.

If you don’t know his books, good opportunity to give them a try.

Stan (Bayesian Inference) [update]

Sunday, October 7th, 2012

Stan

From the webpage:

Stan is an open-source package for obtaining Bayesian inference using the No-U-Turn sampler, a variant of Hamiltonian Monte Carlo.

I first reported on a presentation: Stan: A (Bayesian) Directed Graphical Model Compiler last January when Stan was unreleased.

Following a link from Christophe Lalanne’s A bag of tweets / September 2012, I find the released version of the software!

Very cool!

SkyQuery: …Parallel Probabilistic Join Engine… [When Static Mapping Isn't Enough]

Sunday, July 1st, 2012

SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases by László Dobos, Tamás Budavári, Nolan Li, Alexander S. Szalay, and István Csabai.

Abstract:

Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.

Astronomy is a cool area to study and has data out the wazoo, but I was struck by:

One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios.

Is identity with sharp edges, susceptible to pair-wise mapping, the common case?

Or do we just see some identity issues that way?

Commend the paper to you as an example of dynamic merging practice.

Predictive Analytics: NeuralNet, Bayesian, SVM, KNN [part 4]

Monday, June 4th, 2012

Predictive Analytics: NeuralNet, Bayesian, SVM, KNN by Ricky Ho.

From the post:

Continuing from my previous blog in walking down the list of Machine Learning techniques. In this post, we’ll be covering Neural Network, Support Vector Machine, Naive Bayes and Nearest Neighbor. Again, we’ll be using the same iris data set that we prepared in the last blog.

Ricky continues his march through machine learning techniques. This post promises one more to go.

Flexibility to Discover…

Sunday, April 22nd, 2012

David W. Hogg writes:

If you want to have the flexibility to discover correct structure in your data, you may have to adopt methods that permit variable model complexity.

Context to follow but think about that for a minute.

Do you want to discover structures or confirm what you already believe to be present?

In context:

On day zero of AISTATS, I gave a workshop on machine learning in astronomy, concentrating on the ideas of (a) trusting unreliable data and (b) the necessity of having a likelihood, or probability of the data given the model, making use of a good noise model. Before me, Zoubin Ghahramani gave a very valuable overview of Bayesian non-parametric methods. He emphasized something that was implied to me by Brendon Brewer’s success on my MCMC High Society challenge and mentioned by Rob Fergus when we last talked about image modeling, but which has rarely been explored in astronomy: If you want to have the flexibility to discover correct structure in your data, you may have to adopt methods that permit variable model complexity. The issues are two-fold: For one, a sampler or an optimizer can easily get stuck in a bad local spot if it doesn’t have the freedom to branch more model complexity somewhere else and then later delete the structure that is getting it stuck. For another, if you try to model an image that really does have five stars in it with a model containing only four stars, you are requiring that you will do a bad job! Bayesian non-parametrics is this kind of argument on speed, with all sorts of processes named after different kinds of restaurants. But just working with the simple dictionary of stars and galaxies, we could benefit from the sampling ideas at least. (emphasis added)

Isn’t that awesome? With all the astronomy data that is coming online? (With lots of it already online.)

Not to mention finding structures in other data as well. Maybe even in “big data.”

Big-data Naive Bayes and Classification Trees with R and Netezza

Monday, March 19th, 2012

Big-data Naive Bayes and Classification Trees with R and Netezza

From the post:

The IBM Netezza analytics appliances combine high-capacity storage for Big Data with a massively-parallel processing platform for high-performance computing. With the addition of Revolution R Enterprise for IBM Netezza, you can use the power of the R language to build predictive models on Big Data.

In the demonstration below, Revolution Analytics’ Derek Norton analyzes loan approval data stored on the IBM appliance. You’ll see the R code used to:

  • Explore the raw data (with summary statistics and charts)
  • Prepare the data for statistical analysis, and create training and test sets
  • Create predictive models using classificiation trees and Naïve Bayes
  • Predict using the models, and evaluate model performance using confusion matrices

[embedded presentation omitted]

Note that while R code is being run on Derek’s laptop, the raw data is never moved from the appliance, and the analytic computations take place “in-database” within the appliance itself (where the Revolution R Enterprise engine is also running on each parallel core).

Another incentive for you to be learning R.

Does it sound to you like “Derek’s computer” is a terminal entering instructions that are executed elsewhere? ;-) (If the computing fabric develops fast enough, we may lose the distinction of a “personal” computer. There will simply be computing.)

Meant to mention this the other day. Enjoy!

Skytree: Big Data Analytics

Saturday, March 3rd, 2012

Skytree: Big Data Analytics

Released this last week, Skytree offers both local as well as cloud-based data analytics.

From the website:

Skytree Server can accurately perform machine learning on massive datasets at high speed.

In the same way a relational database system (or database accelerator) is designed to perform SQL queries efficiently, Skytree Server is designed to efficiently perform machine learning on massive datasets.

Skytree Server’s scalable architecture performs state-of-the-art machine learning methods on data sets that were previously too big for machine learning algorithms to process. Leveraging advanced algorithms implemented on specialized systems and dedicated data representations tuned to machine learning, Skytree Server delivers up to 10,000 times performance improvement over existing approaches.

Currently supported machine learning methods:

  • Neighbors (Nearest, Farthest, Range, k, Classification)
  • Kernel Density Estimation and Non-parametric Bayes Classifier
  • K-Means
  • Linear Regression
  • Support Vector Machines (SVM)
  • Fast Singular Value Decomposition (SVD)
  • The Two-point Correlation

There is a “free” local version with a data limit (100,000 records) and of course the commercial local and cloud versions.

Comments?

Will the Circle Be Unbroken? Interactive Annotation!

Wednesday, February 29th, 2012

I have to agree with Bob Carpenter, the title is a bit much:

Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

From the post:

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

Both the paper and Bob’s review merit close reading.

Stan: A (Bayesian) Directed Graphical Model Compiler

Sunday, January 22nd, 2012

Stan: A (Bayesian) Directed Graphical Model Compiler

Post with link to presentation to NYC machine learning meetup.

Stan: a C++ library for probability and sampling has not (yet) been released (BSD license) but has the following components:

From the Google Code page:

  • Directed Graphical Model Compiler
  • (Adaptive) Hamiltonian Monte Carlo Sampling
  • Hamiltonian Monte Carlo Sampling
  • Gibbs Sampling for Discrete Parameters
  • Reverse Mode Algorithmic Differentiation
  • Probability Distributions
  • Special Functions
  • Matrices and Linear Algebra

Stan: A Bayesian Directed Graphical Model Compiler

Saturday, January 7th, 2012

Stan: A Bayesian Directed Graphical Model Compiler by Bob Carpenter.

I (Bob) am going to give a talk at the next NYC Machine Learning Meetupon 19 January 2012 at 7 PM:

There’s an abstract on the meetup site. The short story is that Stan’s a directed graphical model compiler (like BUGS) that uses adaptive Hamiltonian Monte Carlo sampling to estimate posterior distributions for Bayesian models.

The official version 1 release is coming up soon, but until then, you can check out our work in progress at:

  • Google Code: Stan.

If you are in New York on the 19th of this month, please attend and post a note about the meeting.

Otherwise, play with “Stan” while we await the next release.

Now in JAGS! Now in JAGS!

Wednesday, January 4th, 2012

Now in JAGS! Now in JAGS!

John K. Kruschke writes:

I have created JAGS versions of all the BUGS programs in Doing Bayesian Data Analysis. Unlike BUGS, JAGS runs on MacOS, Linux, and Windows. JAGS has other features that make it more robust and user-friendly than BUGS. I recommend that you use the JAGS versions of the programs. Please let me know if you encounter any errors or inaccuracies in the programs. (hyperlink to book added)

First spotted by Matthew O’Donnell (@mdbod).