Molpher: a software framework for systematic chemical space exploration by David Hoksza, Petr Škoda, Milan Voršilák and Daniel Svozil.
Abstract:
Background
Chemical space is virtual space occupied by all chemically meaningful organic compounds. It is an important concept in contemporary chemoinformatics research, and its systematic exploration is vital to the discovery of either novel drugs or new tools for chemical biology.
Results
In this paper, we describe Molpher, an open-source framework for the systematic exploration of chemical space. Through a process we term ‘molecular morphing’, Molpher produces a path of structurally-related compounds. This path is generated by the iterative application of so-called ‘morphing operators’ that represent simple structural changes, such as the addition or removal of an atom or a bond. Molpher incorporates an optimized parallel exploration algorithm, compound logging and a two-dimensional visualization of the exploration process. Its feature set can be easily extended by implementing additional morphing operators, chemical fingerprints, similarity measures and visualization methods. Molpher not only offers an intuitive graphical user interface, but also can be run in batch mode. This enables users to easily incorporate molecular morphing into their existing drug discovery pipelines.
Conclusions
Molpher is an open-source software framework for the design of virtual chemical libraries focused on a particular mechanistic class of compounds. These libraries, represented by a morphing path and its surroundings, provide valuable starting data for future in silico and in vitro experiments. Molpher is highly extensible and can be easily incorporated into any existing computational drug design pipeline.
Beyond its obvious importance for cheminformatics, this paper offers another example of “semantic impedance:”
While virtual chemical space is very large, only a small fraction of it has been reported in actual chemical databases so far. For example, PubChem contains data for 49.1 million chemical compounds [17] and Chemical Abstracts consists of over 84.3 million organic and inorganic substances [18] (numbers as of 12. 3. 2014). Thus, the navigation of chemical space is a very important area of chemoinformatics research [19,20]. Because chemical space is usually defined using various sets of descriptors [21], a major problem is the lack of invariance of chemical space [22,23]. Depending on the descriptors and distance measures used [24], different chemical spaces show different compound distributions. Unfortunately, no generally applicable representation of invariant chemical space has yet been reported [25].
OK, so how much further is there to go with these various descriptors?
The article describes estimates of the size of chemical space this way:
Chemical space is populated by all chemically meaningful and stable organic compounds [1-3]. It is an important concept in contemporary chemoinformatics research [4,5], and its exploration leads to the discovery of either novel drugs [2] or new tools for chemical biology [6,7]. It is agreed that chemical space is huge, but no accurate approximation of its size exists. Even if only drug-like molecules are taken into account, size estimates vary [8] between 1023[9] and 10100[10] compounds. However, smaller numbers have also been reported. For example, based on the growth of a number of organic compounds in chemical databases, Drew et al.[11] deduced the size of chemical space to be 3.4 × 109. By assigning all possible combinations of atomic species to the same three-dimensional geometry, Ogata et al. [12] estimated the size of chemical space to be between 108 and 1019. Also, by analyzing known organic substituents, the size of accessible chemical space was assessed as between 1020 and 1024[9].
Such estimates have been put into context by Reymond et al., who produced all molecules that can exist up to a certain number of heavy atoms in their Chemical Universe Databases: GDB-11 [13,14] (2.64 × 107 molecules with up to 11 heavy atoms); GDB-13 [15] (9.7 × 108 molecules with up to 13 heavy atoms); and GDB-17 [16] (1.7 × 1011 compounds with up to 17 heavy atoms). The GDB-17 database was then used to approximate the number of possible drug-like molecules as 1033[8].
To give you an easy basis for comparison: possible drug-like molecules at 1033, versus number of stars in galaxies in the observable universe at 1024.
That’s an impressive number of possible drug like molecules. 109 more than stars in the observable universe (est.).
I can’t imagine that having diverse descriptors is assisting in the search to complete the chemical space. And from the description, it doesn’t sound like semantic convergence in one the horizon.
Mapping between the existing systems would be a major undertaking but the longer exploration goes on without such a mapping, the problem is only going to get worse.