Machine learning for cancer classification – part 1 – preparing the data sets by Obi Griffith.
From the post:
I am planning a series of tutorials illustrating basic concepts and techniques for machine learning. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper I wrote. The gcrma step may require you to have as much as ~8gb of ram. I ran this tutorial on a Mac Pro (Snow Leopard) with R 3.0.2 installed. It should also work on linux or windows but package installation might differ slightly. The first step is to prepare the data sets. We will use GSE2034 as a training data set and GSE2990 as a test data set. These are both data sets making use of the Affymetrix U133A platform (GPL96). First, let’s download and pre-process the training data.
Assuming you are ready to move beyond Iris data sets for practicing machine learning, this would be a good place to start.