Archive for the ‘Random Forests’ Category

My Intro to Multiple Classification…

Saturday, December 29th, 2012

My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

From the post:

After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.

The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books. In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!! The following table describes the numerical breakdown of books and chapters from each author:

Introduction to authorship studies as they were known (may still be) in the academic circles of my youth.

I wonder if the same techniques are as viable today as on the Federalist Papers?

The Wheel of Time example demonstrates the technique remains viable for novel authors.

But what about authorship more broadly?

Can we reliably distinguish between news commentary from multiple sources?

Or between statements by elected officials?

How would your topic map represent purported authorship versus attributed authorship?

Or even a common authorship for multiple purported authors? (speech writers)

The Rewards of Ignoring Data

Monday, December 17th, 2012

The Rewards of Ignoring Data by Charles Parker.

From the post:

Can you make smarter decisions by ignoring data? It certainly runs counter to our mission, and sounds a little like an Orwellean dystopia. But as we’re going to see, ignoring some of your data some of the time can be a very useful thing to do.

Charlie does an excellent job of introducing the use of multiple models of data and includes deeper material:

There are fairly deep mathematical reasons for this, and ML scientist par excellence Robert Shapire lays out one of the most important arguments in the landmark paper “The Strength of Weak Learnability” in which he proves that a machine learning algorithm that performs only slightly better than randomly can be “boosted” into a classifier that is able to learn to an arbitrary degree of accuracy. For this incredible contribution (and for the later paper that gave us the Adaboost algorithm), he and his colleague Yoav Freund earned the Gödel Prize for computer science theory, the only time the award has been given for a machine learning paper.

Not being satisfied, Charles demonstrates how you can create a random decision forest from your data.

Which is possible without reading the deeper material.

Random Forest Methodology – Bioinformatics

Friday, October 19th, 2012

Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics by Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. König

(Boulesteix, A.-L., Janitza, S., Kruppa, J. and König, I. R. (2012), Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. WIREs Data Mining Knowl Discov, 2: 493–507. doi: 10.1002/widm.1072)

Abstract:

The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research.

Something to expand your horizons a bit.

And a new way to say “curse of dimensionality,” to-wit,

‘n ≪ p curse’

New to me anyway.

I was amused to read at the Wikipedia article on random forests that its disadvantages include:

Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret.

Turn about is fair play since many classifications made by humans are difficult for computers to interpret. ;-)

Orange

Monday, December 27th, 2010

Orange

From the website:

Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.

I had to look at the merge data widget.

Which is said to: Merges two data sets based on the values of selected attributes.

According to the documentation:

Merge Data widget is used to horizontally merge two data sets based on the values of selected attributes. On input, two data sets are required, A and B. The widget allows for selection of an attribute from each domain which will be used to perform the merging. When selected, the widget produces two outputs, A+B and B+A. The first output (A+B) corresponds to instances from input data A which are appended attributes from B, and the second output (B+A) to instances from B which are appended attributes from A.

The merging is done by the values of the selected (merging) attributes. For example, instances from from A+B are constructed in the following way. First, the value of the merging attribute from A is taken and instances from B are searched with matching values of the merging attributes. If more than a single instance from B is found, the first one is taken and horizontally merged with the instance from A. If no instance from B match the criterium, the unknown values are assigned to the appended attributes. Similarly, B+A is constructed.

Which illustrates the problem that topic maps solves rather neatly:

  1. How does a subsequent researcher reliably duplicate such a merger?
  2. How does a subsequent researcher reliably merge that data with other data?
  3. How do other researchers reliably merge that data with their own data?

Answer is: They can’t. Not enough information.

Question: How would you change the outcome for those three questions? In detail. (5-7 pages, citations)

Waffles

Sunday, December 26th, 2010

Waffles Authors: Mike Gashler

From the website:

Waffles is a collection of command-line tools for performing machine learning tasks. These tools are divided into 4 script-friendly apps:

waffles_learn contains tools for supervised learning.
waffles_transform contains tools for manipulating data.
waffles_plot contains tools for visualizing data.
waffles_generate contains tools to generate certain types of data.

For people who prefer not to have to remember commands, waffles also includes a graphical tool called

waffles_wizard

which guides the user to generate a command that will perform the desired task.

While exploring the site I looked at the demo applications and:

At some point, it seems, almost every scholar has an idea for starting a new journal that operates in some a-typical manner. This demo is a framework for the back-end of an on-line journal, to help get you started.

with the “…operates in some a-typical manner” was close enough to the truth that I just has to laugh out loud.

Care to nominate your favorite software project that “…operates in some a-typical manner?”


Update: Almost a year later I revisited the site to find:

Michael S. Gashler. Waffles: A machine learning toolkit. Journal of Machine Learning Research, MLOSS 12:2383-2387, July 2011. ISSN 1532-4435.

Enjoy!

Random Forests

Sunday, December 26th, 2010

Random Forests Authors: Leo Breiman, Adele Cutler

The home site for Random Forest classification algorithm, with resources from its inventors, including the following philosophical note:

RF is an example of a tool that is useful in doing analyses of scientific data.

But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem.

Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.

I rather like that.

It is applicable to all the inferencing, machine learning, classification, and other tools you will see mentioned in this blog.

parf: Parallel Random Forest Algorithm

Saturday, December 25th, 2010

parf: Parallel Random Forest Algorithm

From the website:

The Random Forests algorithm is one of the best among the known classification algorithms, able to classify big quantities of data with great accuracy. Also, this algorithm is inherently parallelisable.

Originally, the algorithm was written in the programming language Fortran 77, which is obsolete and does not provide many of the capabilities of modern programming languages; also, the original code is not an example of “clear” programming, so it is very hard to employ in education. Within this project the program is adapted to Fortran 90. In contrast to Fortran 77, Fortran 90 is a structured programming language, legible — to researchers as well as to students.

The creator of the algorithm, Berkeley professor emeritus Leo Breiman, expressed a big interest in this idea in our correspondence. He has confirmed that no one has yet worked on a parallel implementation of his algorithm, and promised his support and help. Leo Breiman is one of the pioneers in the fields of machine learning and data mining, and a co-author of the first significant programs (CART – Classification and Regression Trees) in that field.

Well, while I was at code.google.com I decided to look around for any resources that might interest topic mappers in the new year. This one caught my eye.

Not much apparent activity so this might be one where a volunteer or two could make a real difference.