Archive for the ‘Random Forests’ Category

Random Forests…

Monday, July 7th, 2014

Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams by Diego Marron, Albert Bifet, Gianmarco De Francisci Morales.


Random Forests is a classical ensemble method used to improve the performance of single tree classifiers. It is able to obtain superior performance by increasing the diversity of the single classifiers. However, in the more challenging context of evolving data streams, the classifier has also to be adaptive and work under very strict constraints of space and time. Furthermore, the computational load of using a large number of classifiers can make its application extremely expensive. In this work, we present a method for building Random Forests that use Very Fast Decision Trees for data streams on GPUs. We show how this method can benefit from the massive parallel architecture of GPUs, which are becoming an efficient hardware alternative to large clusters of computers. Moreover, our algorithm minimizes the communication between CPU and GPU by building the trees directly inside the GPU. We run an empirical evaluation and compare our method to two well know machine learning frameworks, VFML and MOA. Random Forests on the GPU are at least 300x faster while maintaining a similar accuracy.

The authors should get a special mention for honesty in research publishing. Figure 11 shows their GPU Random Forest algorithm seeming to scale almost constantly. The authors explain:

In this dataset MOA scales linearly while GPU Random Forests seems to scale almost constantly. This is an effect of the scale, as GPU Random Forests runs in milliseconds instead of minutes.

How fast/large are your data streams?

I first saw this in a tweet by Stefano Bertolo.

How Statistics lifts the fog of war in Syria

Monday, March 17th, 2014

How Statistics lifts the fog of war in Syria by David White.

From the post:

In a fascinating talk at Strata Santa Clara in February, HRDAG’s Director of Research Megan Price explained the statistical technique she used to make sense of the conflicting information. Each of the four agencies shown in the chart above published a list of identified victims. By painstakingly linking the records between the different agencies (no simple task, given incomplete information about each victim and variations in capturing names, ages etc.), HRDAG can get a more complete sense of the total number of casualties. But the real insight comes from recognizing that some victims were reported by no agency at all. By looking at the rates at which some known victims were not reported by all of the agencies, HRDAG can estimate the number of victims that were identified by nobody, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)

Caution is always advisable with government issued data but especially so when it arises from an armed conflict.

A forerunner to topic maps, record linkage (which is still widely used), plays a central role in collating data recorded in various ways. It isn’t possible to collate heterogeneous data without creating a uniform set of records (record linkage) or by mapping the subjects of the original records together (topic maps).

The usual moniker, “big data” should really be: “big, homogeneous data (BHD). Which if that is what you have, works great. If that isn’t what you have, works less great. If at all.

BTW, groups like the Human Rights Data Analysis Group (HRDAG) would have far more credibility with me if their projects list didn’t read:

  • Africa
  • Asia
  • Europe
  • Middle East
  • Central America
  • South America

Do you notice anyone missing from that list?

I have always thought that “human rights” included cases of:

  • sexual abuse
  • chlid abuse
  • violence
  • discrimination
  • and any number of similar issues

I can think of another place where those conditions exist in epidemic proportions.

Can’t you?

My Intro to Multiple Classification…

Saturday, December 29th, 2012

My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

From the post:

After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.

The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books. In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!! The following table describes the numerical breakdown of books and chapters from each author:

Introduction to authorship studies as they were known (may still be) in the academic circles of my youth.

I wonder if the same techniques are as viable today as on the Federalist Papers?

The Wheel of Time example demonstrates the technique remains viable for novel authors.

But what about authorship more broadly?

Can we reliably distinguish between news commentary from multiple sources?

Or between statements by elected officials?

How would your topic map represent purported authorship versus attributed authorship?

Or even a common authorship for multiple purported authors? (speech writers)

The Rewards of Ignoring Data

Monday, December 17th, 2012

The Rewards of Ignoring Data by Charles Parker.

From the post:

Can you make smarter decisions by ignoring data? It certainly runs counter to our mission, and sounds a little like an Orwellean dystopia. But as we’re going to see, ignoring some of your data some of the time can be a very useful thing to do.

Charlie does an excellent job of introducing the use of multiple models of data and includes deeper material:

There are fairly deep mathematical reasons for this, and ML scientist par excellence Robert Shapire lays out one of the most important arguments in the landmark paper “The Strength of Weak Learnability” in which he proves that a machine learning algorithm that performs only slightly better than randomly can be “boosted” into a classifier that is able to learn to an arbitrary degree of accuracy. For this incredible contribution (and for the later paper that gave us the Adaboost algorithm), he and his colleague Yoav Freund earned the Gödel Prize for computer science theory, the only time the award has been given for a machine learning paper.

Not being satisfied, Charles demonstrates how you can create a random decision forest from your data.

Which is possible without reading the deeper material.

Random Forest Methodology – Bioinformatics

Friday, October 19th, 2012

Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics by Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. König

(Boulesteix, A.-L., Janitza, S., Kruppa, J. and König, I. R. (2012), Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. WIREs Data Mining Knowl Discov, 2: 493–507. doi: 10.1002/widm.1072)


The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research.

Something to expand your horizons a bit.

And a new way to say “curse of dimensionality,” to-wit,

‘n ≪ p curse’

New to me anyway.

I was amused to read at the Wikipedia article on random forests that its disadvantages include:

Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret.

Turn about is fair play since many classifications made by humans are difficult for computers to interpret. 😉


Monday, December 27th, 2010


From the website:

Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.

I had to look at the merge data widget.

Which is said to: Merges two data sets based on the values of selected attributes.

According to the documentation:

Merge Data widget is used to horizontally merge two data sets based on the values of selected attributes. On input, two data sets are required, A and B. The widget allows for selection of an attribute from each domain which will be used to perform the merging. When selected, the widget produces two outputs, A+B and B+A. The first output (A+B) corresponds to instances from input data A which are appended attributes from B, and the second output (B+A) to instances from B which are appended attributes from A.

The merging is done by the values of the selected (merging) attributes. For example, instances from from A+B are constructed in the following way. First, the value of the merging attribute from A is taken and instances from B are searched with matching values of the merging attributes. If more than a single instance from B is found, the first one is taken and horizontally merged with the instance from A. If no instance from B match the criterium, the unknown values are assigned to the appended attributes. Similarly, B+A is constructed.

Which illustrates the problem that topic maps solves rather neatly:

  1. How does a subsequent researcher reliably duplicate such a merger?
  2. How does a subsequent researcher reliably merge that data with other data?
  3. How do other researchers reliably merge that data with their own data?

Answer is: They can’t. Not enough information.

Question: How would you change the outcome for those three questions? In detail. (5-7 pages, citations)


Sunday, December 26th, 2010

Waffles Authors: Mike Gashler

From the website:

Waffles is a collection of command-line tools for performing machine learning tasks. These tools are divided into 4 script-friendly apps:

waffles_learn contains tools for supervised learning.
waffles_transform contains tools for manipulating data.
waffles_plot contains tools for visualizing data.
waffles_generate contains tools to generate certain types of data.

For people who prefer not to have to remember commands, waffles also includes a graphical tool called


which guides the user to generate a command that will perform the desired task.

While exploring the site I looked at the demo applications and:

At some point, it seems, almost every scholar has an idea for starting a new journal that operates in some a-typical manner. This demo is a framework for the back-end of an on-line journal, to help get you started.

with the “…operates in some a-typical manner” was close enough to the truth that I just has to laugh out loud.

Care to nominate your favorite software project that “…operates in some a-typical manner?”

Update: Almost a year later I revisited the site to find:

Michael S. Gashler. Waffles: A machine learning toolkit. Journal of Machine Learning Research, MLOSS 12:2383-2387, July 2011. ISSN 1532-4435.


Random Forests

Sunday, December 26th, 2010

Random Forests Authors: Leo Breiman, Adele Cutler

The home site for Random Forest classification algorithm, with resources from its inventors, including the following philosophical note:

RF is an example of a tool that is useful in doing analyses of scientific data.

But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem.

Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.

I rather like that.

It is applicable to all the inferencing, machine learning, classification, and other tools you will see mentioned in this blog.

parf: Parallel Random Forest Algorithm

Saturday, December 25th, 2010

parf: Parallel Random Forest Algorithm

From the website:

The Random Forests algorithm is one of the best among the known classification algorithms, able to classify big quantities of data with great accuracy. Also, this algorithm is inherently parallelisable.

Originally, the algorithm was written in the programming language Fortran 77, which is obsolete and does not provide many of the capabilities of modern programming languages; also, the original code is not an example of “clear” programming, so it is very hard to employ in education. Within this project the program is adapted to Fortran 90. In contrast to Fortran 77, Fortran 90 is a structured programming language, legible — to researchers as well as to students.

The creator of the algorithm, Berkeley professor emeritus Leo Breiman, expressed a big interest in this idea in our correspondence. He has confirmed that no one has yet worked on a parallel implementation of his algorithm, and promised his support and help. Leo Breiman is one of the pioneers in the fields of machine learning and data mining, and a co-author of the first significant programs (CART – Classification and Regression Trees) in that field.

Well, while I was at I decided to look around for any resources that might interest topic mappers in the new year. This one caught my eye.

Not much apparent activity so this might be one where a volunteer or two could make a real difference.