RMol: A Toolset for Transforming SD/Molfile structure information into R Objects by Martin Grabner, Kurt Varmuza and Matthias Dehmer.
Abstract:
Background
The graph-theoretical analysis of molecular networks has a long tradition in chemoinformatics. As demonstrated frequently, a well designed format to encode chemical structures and structure-related information of organic compounds is the Molfile format. But when it comes to use modern programming languages for statistical data analysis in Bio- and Chemoinformatics, R as one of the most powerful free languages lacks tools to process R Molfile data collections and import molecular network data into R.
Results
We design an R object which allows a lossless information mapping of structural information from Molfiles into R objects. This provides the basis to use the RMol object as an anchor for connecting Molfile data collections with R libraries for analyzing graphs. Associated with the RMol objects, a set of R functions completes the toolset to organize, describe and manipulate the converted data sets. Further, we bypass R-typical limits for manipulating large data sets by storing R objects in bz-compressed serialized files instead of employing RData files.
Conclusions
By design, RMol is a R tool set without dependencies to other libraries or programming languages. It is useful to integrate into pipelines for serialized batch analysis by using network data and, therefore, helps to process sdf-data sets in R effeciently. It is freely available under the BSD licence. The script source can be downloaded from http://sourceforge.net/p/rmol-toolset.
Important work, not the least because of the explosion of interest in bio/cheminformatics.
If I understand the rationale for the software, it:
- enables use of existing R tools for graph/network analysis
- fits well into workflows with serialized pipelines
- dependencies are reduced by extraction of SD-File information
- storing chemical and molecular network information in R objects avoids repetitive transformations
All of which are true but I have a nagging concern about the need for transformation.
Knowing the structure of Molfiles and the requirements of R tools for graph/network analysis, how are the results of transformation different from R tools viewing Molfiles “as if” they were composed of R objects?
The mapping is already well known because that is what RMol uses to create the results of transformation. More over, for any particular use, more data may be transformed that is required for a particular analysis.
Not to take anything away from very useful work but the days of transformation of data are numbered. As data sets grow in size, there will be fewer and fewer places to store a “transformed” data set.
BTW, pay particular attention to the bibliography in this paper. Numerous references to follow if you are interested in this area.