Archive for the ‘Bayesian Data Analysis’ Category

…Whether My Wife is Pregnant or Not

Friday, November 6th, 2015

A Bayesian Model to Calculate Whether My Wife is Pregnant or Not by Rasmus Bååth.

From the post:

On the 21st of February, 2015, my wife had not had her period for 33 days, and as we were trying to conceive, this was good news! An average period is around a month, and if you are a couple trying to go triple, then a missing period is a good sign something is going on. But at 33 days, this was not yet a missing period, just a late one, so how good news was it? Pretty good, really good, or just meh?

To get at this I developed a simple Bayesian model that, given the number of days since your last period and your history of period onsets, calculates the probability that you are going to be pregnant this period cycle. In this post I will describe what data I used, the priors I used, the model assumptions, and how to fit it in R using importance sampling. And finally I show you why the result of the model really didn’t matter in the end. Also I’ll give you a handy script if you want to calculate this for yourself. 🙂

I first saw this post in a tweet by Neil Saunders who commented:

One of the clearest Bayesian methods articles I’ve read (I struggle with most of them)

I agree with Neil’s assessment and suspect you will as well.

Unlike most Bayesian methods articles, this one also has a happy ending.

You should start following Rasmus Bååth on Twitter.

Uncovering Community Structures with Initialized Bayesian Nonnegative Matrix Factorization

Wednesday, October 1st, 2014

Uncovering Community Structures with Initialized Bayesian Nonnegative Matrix Factorization by Xianchao Tang, Tao Xu, Xia Feng, and, Guoqing Yang.

Abstract:

Uncovering community structures is important for understanding networks. Currently, several nonnegative matrix factorization algorithms have been proposed for discovering community structure in complex networks. However, these algorithms exhibit some drawbacks, such as unstable results and inefficient running times. In view of the problems, a novel approach that utilizes an initialized Bayesian nonnegative matrix factorization model for determining community membership is proposed. First, based on singular value decomposition, we obtain simple initialized matrix factorizations from approximate decompositions of the complex network’s adjacency matrix. Then, within a few iterations, the final matrix factorizations are achieved by the Bayesian nonnegative matrix factorization method with the initialized matrix factorizations. Thus, the network’s community structure can be determined by judging the classification of nodes with a final matrix factor. Experimental results show that the proposed method is highly accurate and offers competitive performance to that of the state-of-the-art methods even though it is not designed for the purpose of modularity maximization.

Some titles grab you by the lapels and say, “READ ME!,” don’t they? 😉

I found the first paragraph a much friendlier summary of why you should read this paper (footnotes omitted):

Many complex systems in the real world have the form of networks whose edges are linked by nodes or vertices. Examples include social systems such as personal relationships, collaborative networks of scientists, and networks that model the spread of epidemics; ecosystems such as neuron networks, genetic regulatory networks, and protein-protein interactions; and technology systems such as telephone networks, the Internet and the World Wide Web [1]. In these networks, there are many sub-graphs, called communities or modules, which have a high density of internal links. In contrast, the links between these sub-graphs have a fairly lower density [2]. In community networks, sub-graphs have their own functions and social roles. Furthermore, a community can be thought of as a general description of the whole network to gain more facile visualization and a better understanding of the complex systems. In some cases, a community can reveal the real world network’s properties without releasing the group membership or compromising the members’ privacy. Therefore, community detection has become a fundamental and important research topic in complex networks.

If you think of “the real world network’s properties” as potential properties for identification of a network as a subject or as properties of the network as a subject, the importance of this article becomes clearer.

Being able to speak of sub-graphs as subjects with properties can only improve our ability to compare sub-graphs across complex networks.

BTW, all the data used in this article is available for downloading: http://dx.doi.org/10.6084/m9.figshare.1149965

I first saw this in a tweet by Brian Keegan.

Frequentism and Bayesianism: A Practical Introduction

Sunday, June 15th, 2014

Frequentism and Bayesianism: A Practical Introduction by Jake Vanderplas.

From the post:

One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions between them and the different practical approaches that result. The purpose of this post is to synthesize the philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself might be better prepared to understand the types of data analysis people do.

I’ll start by addressing the philosophical distinctions between the views, and from there move to discussion of how these ideas are applied in practice, with some Python code snippets demonstrating the difference between the approaches.

This is the first of four posts that include Python code to demonstrate the impact of your starting position.

The other posts are:

Very well written and highly entertaining!

Jake leaves out another approach to statistics: Lying.

Lying avoids the need for a philosophical position or to have data for processing with Python or any other programming language. Even calculations can be lied about.

Most commonly found political campaigns, legislative hearings and the like. How you would characterize any particular political lie is left as an exercise for the reader. 😉

12 Free (as in beer) Data Mining Books

Sunday, May 18th, 2014

12 Free (as in beer) Data Mining Books by Chris Leonard.

While all of these volumes could be shelved under “data mining” in a bookstore, I would break them out into smaller categories:

  • Bayesian Analysis/Methods
  • Data Mining
  • Data Science
  • Machine Learning
  • R
  • Statistical Learning

Didn’t want you to skip over Chris’ post because it was “just about data mining.” 😉

Check your hard drive to see what you are missing.

I first saw this in a tweet by Carl Anderson.

Building fast Bayesian computing machines…

Friday, March 7th, 2014

Building fast Bayesian computing machines out of intentionally stochastic, digital parts by Vikash Mansinghka and Eric Jonas.

Abstract:

The brain interprets ambiguous sensory information faster and more reliably than modern computers, using neurons that are slower and less reliable than logic gates. But Bayesian inference, which underpins many computational models of perception and cognition, appears computationally challenging even given modern transistor speeds and energy budgets. The computational principles and structures needed to narrow this gap are unknown. Here we show how to build fast Bayesian computing machines using intentionally stochastic, digital parts, narrowing this efficiency gap by multiple orders of magnitude. We find that by connecting stochastic digital components according to simple mathematical rules, one can build massively parallel, low precision circuits that solve Bayesian inference problems and are compatible with the Poisson firing statistics of cortical neurons. We evaluate circuits for depth and motion perception, perceptual learning and causal reasoning, each performing inference over 10,000+ latent variables in real time – a 1,000x speed advantage over commodity microprocessors. These results suggest a new role for randomness in the engineering and reverse-engineering of intelligent computation.

Ironic that the greater precision and repeatability of our digital computers may be choices that are holding back advancements in Bayesian digital computing machines.

I have written before about the RDF ecosystem being over complex and precise for use by everyday users.

We should strive to capture semantics as understood by scientists, researchers, students, and others. Less precise than professional semantics but precise enough to make it usable?

I first saw this in a tweet by Stefano Bertolo.

BayesDB

Sunday, December 8th, 2013

BayesDB (Alpha 0.1.0 Release)

From the webpage:

BayesDB, a Bayesian database table, lets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.

BayesDB is suitable for analyzing complex, heterogeneous data tables with up to tens of thousands of rows and hundreds of variables. No preprocessing or parameter adjustment is required, though experts can override BayesDB’s default assumptions when appropriate.

BayesDB’s inferences are based in part on CrossCat, a new, nonparametric Bayesian machine learning method, that automatically estimates the full joint distribution behind arbitrary data tables.

Now there’s an interesting idea!

Not sure if it is a good idea but it certainly is an interesting one.

Cassandra and Naive Bayes

Saturday, November 16th, 2013

Using Cassandra to Build a Naive Bayes Classifier of Users Based Upon Behavior by John Berryman.

From the post:

In our last post, we found out how simple it is to use Cassandra to estimate ad conversion. It’s easy, because effectively all you have to do is accumulate counts – and Cassandra is quite good at counting. As we demonstrated in that post, Cassandra can be used as a giant, distributed, redundant, “infinitely” scalable counting framework. During this post will take the online ad company example just a bit further by creating a Cassandra-backed Naive Bayes Classifier. Again, we see that the “secret sauce” is simply keeping track of the appropriate counts.

In the previous post, we helped equip your online ad company with the ability to track ad conversion rates. But competition is steep and we’ll need to do a little better than ad conversion rates if your company is to stay on top. Recently, suspicions have arisen that ads are often being shown to unlikely customers. A quick look at the logs confirms this concern. For instance, there was a case of one internet user that clicked almost every single ad that he was shown – so long as it related to the camping gear. Several times, he went on to make purchases: a tent, a lantern, and a sleeping bag. But despite this users obvious interest in outdoor sporting goods, your logs indicated that fully 90% of the ads he was shown were for women’s apparel. Of these ads, this user clicked none of them.

Let’s attack this problem by creating a classifier. Fortunately for us, your company specializes in two main genres, fashion, and outdoors sporting goods. If we can determine which type of user we’re dealing with, then we can improve our conversion rates considerably by simply showing users the appropriate ads.

So long as you remember the unlikely assumption of feature independence of Naive Bayes, you should be ok.

That is whatever features you are measuring are independent of each other.

Has been “successfully” used in a number of contexts, but the descriptions I have read don’t specify what they meant by “successful.” 😉

Introduction to Bayesian Networks & BayesiaLab

Thursday, October 17th, 2013

Introduction to Bayesian Networks & BayesiaLab by Stefan Conrady and Dr. Lionel Jouffe.

From the webpage:

With Professor Judea Pearl receiving the prestigious 2011 A.M. Turing Award, Bayesian networks have presumably received more public recognition than ever before. Judea Pearl’s achievement of establishing Bayesian networks as a new paradigm is fittingly summarized by Stuart Russell:

“[Judea Pearl] is credited with the invention of Bayesian networks, a mathematical formalism for defining complex probability models, as well as the principal algorithms used for inference in these models. This work not only revolutionized the field of artificial intelligence but also became an important tool for many other branches of engineering and the natural sciences. He later created a mathematical framework for causal inference that has had significant impact in the social sciences.”

While their theoretical properties made Bayesian networks immediately attractive for academic research, especially with regard to the study of causality, the arrival of practically feasible machine learning algorithms has allowed Bayesian networks to grow beyond its origin in the field of computer science. Since the first release of the BayesiaLab software package in 2001, Bayesian networks have finally become accessible to a wide range of scientists and analysts for use in many other disciplines.

In this introductory paper, we present Bayesian networks (the paradigm) and BayesiaLab (the software tool), from the perspective of the applied researcher.

The webpage gives an overview of the white paper. Or you can jump directly to the paper (PDF).

With the emphasis on machine processing, there will be people going through the motions of data processing with a black box and data dumps going into it.

And there will be people who understand the box but not the data flowing into it.

Finally there will be people using cutting edge techniques who understand the box and the data flowing into it.

Which group do you think will have the better results?

Diagrams for hierarchical models – we need your opinion

Monday, October 14th, 2013

Diagrams for hierarchical models – we need your opinion by John K. Kruschke.

If you haven’t done any good deeds lately, here is a chance to contribute to the common good.

From the post:

When trying to understand a hierarchical model, I find it helpful to make a diagram of the dependencies between variables. But I have found the traditional directed acyclic graphs (DAGs) to be incomplete at best and downright confusing at worst. Therefore I created differently styled diagrams for Doing Bayesian Data Analysis (DBDA). I have found them to be very useful for explaining models, inventing models, and programming models. But my idiosyncratic impression might be only that, and I would like your insights about the pros and cons of the two styles of diagrams. (emphasis in original)

John’s post has the details of the different diagram styles.

Which do you like better?

John is also the author of: Doing Bayesian data analysis : a tutorial with R and BUGS. My library system doesn’t have a copy but I can report that it has gotten really good reviews.

Bayesian Methods for Hackers

Thursday, July 11th, 2013

Bayesian Methods for Hackers by a community of authors!

From the readme:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

(…)

Useful in case all the knowledge you want to put in a topic map is far from certain. 😉

Probabilistic Programming and Bayesian Methods for Hackers

Thursday, May 23rd, 2013

Probabilistic Programming and Bayesian Methods for Hackers by Cam Davidson-Pilon and others.

From the website:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

Not yet complete but what is there you will find very useful.

Probabilistic Programming and Bayesian Methods for Hackers

Saturday, March 30th, 2013

Probabilistic Programming and Bayesian Methods for Hackers

From the webpage:

Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simplely not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

DARPA (Logic and Probabilistic Programming) should be glad that someone else is working on probabilistic programming.

I first saw this at Nat Torkington’s Four short links: 29 March 2103.

Using Bayesian networks to discover relations…

Saturday, March 23rd, 2013

Using Bayesian networks to discover relations between genes, environment, and disease by Chengwei Su, Angeline Andrew, Margaret R Karagas and Mark E Borsuk. (BioData Mining 2013, 6:6 doi:10.1186/1756-0381-6-6)

Abstract:

We review the applicability of Bayesian networks (BNs) for discovering relations between genes, environment, and disease. By translating probabilistic dependencies among variables into graphical models and vice versa, BNs provide a comprehensible and modular framework for representing complex systems. We first describe the Bayesian network approach and its applicability to understanding the genetic and environmental basis of disease. We then describe a variety of algorithms for learning the structure of a network from observational data. Because of their relevance to real-world applications, the topics of missing data and causal interpretation are emphasized. The BN approach is then exemplified through application to data from a population-based study of bladder cancer in New Hampshire, USA. For didactical purposes, we intentionally keep this example simple. When applied to complete data records, we find only minor differences in the performance and results of different algorithms. Subsequent incorporation of partial records through application of the EM algorithm gives us greater power to detect relations. Allowing for network structures that depart from a strict causal interpretation also enhances our ability to discover complex associations including gene-gene (epistasis) and gene-environment interactions. While BNs are already powerful tools for the genetic dissection of disease and generation of prognostic models, there remain some conceptual and computational challenges. These include the proper handling of continuous variables and unmeasured factors, the explicit incorporation of prior knowledge, and the evaluation and communication of the robustness of substantive conclusions to alternative assumptions and data manifestations.

From the introduction:

BNs have been applied in a variety of settings for the purposes of causal study and probabilistic prediction, including medical diagnosis, crime and terrorism risk, forensic science, and ecological conservation (see [7]). In bioinformatics, they have been used to analyze gene expression data [8,9], derive protein signaling networks [10-12], predict protein-protein interactions [13], perform pedigree analysis [14], conduct genetic epidemiological studies [5], and assess the performance of microsatellite markers on cancer recurrence [15].

Not to mention criminal investigations: Bayesian Network – [Crime Investigation] (Youtube). 😉

Once relations are discovered, you are free to decorate them with roles, properties, etc., in other words, associations.

Missing-Data Imputation

Saturday, December 29th, 2012

New book by Stef van Buuren on missing-data imputation looks really good! by Andrew Gelman.

From the post:

Ben points us to a new book, Flexible Imputation of Missing Data. It’s excellent and I highly recommend it. Definitely worth the $89.95. Van Buuren’s book is great even if you don’t end up using the algorithm described in the book (I actually like their approach but I do think there are some limitations with their particular implementation, which is one reason we’re developing our own package); he supplies lots of intuition, examples, and graphs.

Steve Newcomb makes the point that data is dirty. Always.

Stef van Buuren suggests that data may be missing and requires imputation.

Together that means dirty data may be missing and requires imputation.

😉

Imputed or not, data is no more reliable than we are. Use with caution.

Think Bayes: Bayesian Statistics Made Simple

Thursday, October 11th, 2012

Think Bayes: Bayesian Statistics Made Simple by Allen B. Downey.

Think Bayes is an introduction to Bayesian statistics using computational methods. This version of the book is a rough draft. I am making this draft available for comments, but it comes with the warning that it is probably full of errors.

Allen has written free books on Python, statistics, complexity and now Bayesian statistics.

If you don’t know his books, good opportunity to give them a try.

Stan (Bayesian Inference) [update]

Sunday, October 7th, 2012

Stan

From the webpage:

Stan is an open-source package for obtaining Bayesian inference using the No-U-Turn sampler, a variant of Hamiltonian Monte Carlo.

I first reported on a presentation: Stan: A (Bayesian) Directed Graphical Model Compiler last January when Stan was unreleased.

Following a link from Christophe Lalanne’s A bag of tweets / September 2012, I find the released version of the software!

Very cool!

SkyQuery: …Parallel Probabilistic Join Engine… [When Static Mapping Isn’t Enough]

Sunday, July 1st, 2012

SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases by László Dobos, Tamás Budavári, Nolan Li, Alexander S. Szalay, and István Csabai.

Abstract:

Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.

Astronomy is a cool area to study and has data out the wazoo, but I was struck by:

One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios.

Is identity with sharp edges, susceptible to pair-wise mapping, the common case?

Or do we just see some identity issues that way?

Commend the paper to you as an example of dynamic merging practice.

Predictive Analytics: NeuralNet, Bayesian, SVM, KNN [part 4]

Monday, June 4th, 2012

Predictive Analytics: NeuralNet, Bayesian, SVM, KNN by Ricky Ho.

From the post:

Continuing from my previous blog in walking down the list of Machine Learning techniques. In this post, we’ll be covering Neural Network, Support Vector Machine, Naive Bayes and Nearest Neighbor. Again, we’ll be using the same iris data set that we prepared in the last blog.

Ricky continues his march through machine learning techniques. This post promises one more to go.

Flexibility to Discover…

Sunday, April 22nd, 2012

David W. Hogg writes:

If you want to have the flexibility to discover correct structure in your data, you may have to adopt methods that permit variable model complexity.

Context to follow but think about that for a minute.

Do you want to discover structures or confirm what you already believe to be present?

In context:

On day zero of AISTATS, I gave a workshop on machine learning in astronomy, concentrating on the ideas of (a) trusting unreliable data and (b) the necessity of having a likelihood, or probability of the data given the model, making use of a good noise model. Before me, Zoubin Ghahramani gave a very valuable overview of Bayesian non-parametric methods. He emphasized something that was implied to me by Brendon Brewer’s success on my MCMC High Society challenge and mentioned by Rob Fergus when we last talked about image modeling, but which has rarely been explored in astronomy: If you want to have the flexibility to discover correct structure in your data, you may have to adopt methods that permit variable model complexity. The issues are two-fold: For one, a sampler or an optimizer can easily get stuck in a bad local spot if it doesn’t have the freedom to branch more model complexity somewhere else and then later delete the structure that is getting it stuck. For another, if you try to model an image that really does have five stars in it with a model containing only four stars, you are requiring that you will do a bad job! Bayesian non-parametrics is this kind of argument on speed, with all sorts of processes named after different kinds of restaurants. But just working with the simple dictionary of stars and galaxies, we could benefit from the sampling ideas at least. (emphasis added)

Isn’t that awesome? With all the astronomy data that is coming online? (With lots of it already online.)

Not to mention finding structures in other data as well. Maybe even in “big data.”

Big-data Naive Bayes and Classification Trees with R and Netezza

Monday, March 19th, 2012

Big-data Naive Bayes and Classification Trees with R and Netezza

From the post:

The IBM Netezza analytics appliances combine high-capacity storage for Big Data with a massively-parallel processing platform for high-performance computing. With the addition of Revolution R Enterprise for IBM Netezza, you can use the power of the R language to build predictive models on Big Data.

In the demonstration below, Revolution Analytics’ Derek Norton analyzes loan approval data stored on the IBM appliance. You’ll see the R code used to:

  • Explore the raw data (with summary statistics and charts)
  • Prepare the data for statistical analysis, and create training and test sets
  • Create predictive models using classificiation trees and Naïve Bayes
  • Predict using the models, and evaluate model performance using confusion matrices

[embedded presentation omitted]

Note that while R code is being run on Derek’s laptop, the raw data is never moved from the appliance, and the analytic computations take place “in-database” within the appliance itself (where the Revolution R Enterprise engine is also running on each parallel core).

Another incentive for you to be learning R.

Does it sound to you like “Derek’s computer” is a terminal entering instructions that are executed elsewhere? 😉 (If the computing fabric develops fast enough, we may lose the distinction of a “personal” computer. There will simply be computing.)

Meant to mention this the other day. Enjoy!

Skytree: Big Data Analytics

Saturday, March 3rd, 2012

Skytree: Big Data Analytics

Released this last week, Skytree offers both local as well as cloud-based data analytics.

From the website:

Skytree Server can accurately perform machine learning on massive datasets at high speed.

In the same way a relational database system (or database accelerator) is designed to perform SQL queries efficiently, Skytree Server is designed to efficiently perform machine learning on massive datasets.

Skytree Server’s scalable architecture performs state-of-the-art machine learning methods on data sets that were previously too big for machine learning algorithms to process. Leveraging advanced algorithms implemented on specialized systems and dedicated data representations tuned to machine learning, Skytree Server delivers up to 10,000 times performance improvement over existing approaches.

Currently supported machine learning methods:

  • Neighbors (Nearest, Farthest, Range, k, Classification)
  • Kernel Density Estimation and Non-parametric Bayes Classifier
  • K-Means
  • Linear Regression
  • Support Vector Machines (SVM)
  • Fast Singular Value Decomposition (SVD)
  • The Two-point Correlation

There is a “free” local version with a data limit (100,000 records) and of course the commercial local and cloud versions.

Comments?

Will the Circle Be Unbroken? Interactive Annotation!

Wednesday, February 29th, 2012

I have to agree with Bob Carpenter, the title is a bit much:

Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

From the post:

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

Both the paper and Bob’s review merit close reading.

Stan: A (Bayesian) Directed Graphical Model Compiler

Sunday, January 22nd, 2012

Stan: A (Bayesian) Directed Graphical Model Compiler

Post with link to presentation to NYC machine learning meetup.

Stan: a C++ library for probability and sampling has not (yet) been released (BSD license) but has the following components:

From the Google Code page:

  • Directed Graphical Model Compiler
  • (Adaptive) Hamiltonian Monte Carlo Sampling
  • Hamiltonian Monte Carlo Sampling
  • Gibbs Sampling for Discrete Parameters
  • Reverse Mode Algorithmic Differentiation
  • Probability Distributions
  • Special Functions
  • Matrices and Linear Algebra

Stan: A Bayesian Directed Graphical Model Compiler

Saturday, January 7th, 2012

Stan: A Bayesian Directed Graphical Model Compiler by Bob Carpenter.

I (Bob) am going to give a talk at the next NYC Machine Learning Meetupon 19 January 2012 at 7 PM:

There’s an abstract on the meetup site. The short story is that Stan’s a directed graphical model compiler (like BUGS) that uses adaptive Hamiltonian Monte Carlo sampling to estimate posterior distributions for Bayesian models.

The official version 1 release is coming up soon, but until then, you can check out our work in progress at:

  • Google Code: Stan.

If you are in New York on the 19th of this month, please attend and post a note about the meeting.

Otherwise, play with “Stan” while we await the next release.

Now in JAGS! Now in JAGS!

Wednesday, January 4th, 2012

Now in JAGS! Now in JAGS!

John K. Kruschke writes:

I have created JAGS versions of all the BUGS programs in Doing Bayesian Data Analysis. Unlike BUGS, JAGS runs on MacOS, Linux, and Windows. JAGS has other features that make it more robust and user-friendly than BUGS. I recommend that you use the JAGS versions of the programs. Please let me know if you encounter any errors or inaccuracies in the programs. (hyperlink to book added)

First spotted by Matthew O’Donnell (@mdbod).