A Software Framework for Building Biomedical Machine Learning Classifiers through Grid Computing Resources by Raul Pollán, Miguel Angel Guevara Lopez and Eugénio da Costa Oliveira.
Abstract:
This paper describes the BiomedTK software framework, created to perform massive explorations of machine learning classifiers configurations for biomedical data analysis over distributed Grid computing resources. BiomedTK integrates ROC analysis throughout the complete classifier construction process and enables explorations of large parameter sweeps for training third party classifiers such as artificial neural networks and support vector machines, offering the capability to harness the vast amount of computing power serviced by Grid infrastructures. In addition, it includes classifiers modified by the authors for ROC optimization and functionality to build ensemble classifiers and manipulate datasets (import/export, extract and transform data, etc.). BiomedTK was experimentally validated by training thousands of classifier configurations for representative biomedical UCI datasets reaching in little time classification levels comparable to those reported in existing literature. The comprehensive method herewith presented represents an improvement to biomedical data analysis in both methodology and potential reach of machine learning based experimentation.
I recommend a close reading of the article but the concluding lines caught my eye:
…tuning classifier parameters is mostly a heuristic task, not existing rules providing knowledge about what parameters to choose when training a classifier. Through BiomedTK we are gathering data about performance of many classifiers, trained each one with different parameters, ANNs, SVM, etc. This by itself constitutes a dataset that can be data mined to understand what set of parameters yield better classifiers for given situations or even generally. Therefore, we intend to use BiomedTK on this bulk of classifier data to gain insight on classifier parameter tuning.
The dataset about training classifiers may be as important if not more so than use of the framework in harnessing Grid computing resources for biomedical analysis. Looking forward to reports on that dataset.