GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification
Authors: Aaron Smalter, Jun Huan and Gerald Lushington
Abstract:
Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogeneous sources and have been applied in modeling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge.
In this paper, we demonstrate a novel technique called Graph Pattern Matching kernel (GPM). Our idea is to leverage existing ? frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method.
The authors also note:
Publicly-available large-scale chemical compound databases have offered tremendous opportunities for creating highly efficient in silico drug design methods. Many machine learning and data mining algorithms have been applied to study the structure-activity relationship of chemicals with the goal of building classifiers for graph-structured data.
In other words, with a desktop machine, public data and a little imagination, you can make a fundamental contribution to drug design methods. (FWI, the pharmaceuticals are making money hand over fist.)
Integrating your contribution or its results into existing information, such as with topic maps, will only increase its value.