Archive for the ‘Cheminformatics’ Category

Computational Data Analysis Workflow Systems

Friday, October 6th, 2017

Computational Data Analysis Workflow Systems

An incomplete list of existing workflow systems. As of today, approximately 17:00 EST, 173 systems in no particular order.

I first saw this mentioned in a tweet by Michael R. Crusoe.

One of the many resources found at: Common Workflow Language.

From the webpage:

The Common Workflow Language (CWL) is a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry.

You should take a quick look at: Common Workflow Language User Guide to get a feel for CWL.

Try to avoid thinking of CWL as “documenting” your workflow if that is an impediment to using it. That’s a side effect but its main purpose is to make your more effective.

720 Thousand Chemicals – Chemistry Dashboard

Tuesday, April 5th, 2016

Chemistry Dashboard – 720 Thousand Chemicals

Beta test of a Google-like search interface by the United States Environmental Protection Agency on chemical data.

Search results return “Intrisic Properties,” “Structural Identifiers,” and a “Citation” for your search to the right of a molecular diagram of the object of your search.

A series of tabs run across the page offering, “Chemical Properties,” “External Links,” “Synonyms,” “PubChem Biological Activities,” “PubChem Articles,” “PubChem Patents,” and “Comments.”

And Advanced Search option is offered as well. (Think of it as identifying a subject by its properties.)

The about page has this description with additional links and a pointer to a feedback form for comments:

The interactive Chemical Safety for Sustainability Chemistry Dashboard (the iCSS chemistry dashboard) is a part of a suite of databases and web applications developed by the US Environmental Protection Agency’s Chemical Safety for Sustainability Research Program. These databases and apps support EPA’s computational toxicology research efforts to develop innovative methods to change how chemicals are currently evaluated for potential health risks. EPA researchers integrate advances in biology, biotechnology, chemistry, and computer science to identify important biological processes that may be disrupted by the chemicals. The combined information helps prioritize chemicals based on potential health risks. Using computational toxicology research methods, thousands of chemicals can be evaluated for potential risk at small cost in a very short amount of time.

The iCSS chemistry dashboard is the public chemistry resource for these computational toxicology research efforts and it supports improved predictive toxicology. It provides access to data associated with over 700,000 chemicals. A distinguishing feature of the chemistry dashboard is the mapping of curated physicochemical property data associated with chemical substances to their corresponding chemical structures. The chemical dashboard is searchable by various chemical identifiers including CAS Registry Numbers, systematic and common names, and InChIKeys. Millions of predicted physchem properties developed using machine-learning approaches modeling highly curated datasets are also mapped to chemicals within the dashboard.

The data in the dashboard are of varying quality with the highest quality data being assembled by the DSSTox Program. The majority of the chemical structures within the database have been compiled from public sources, such as PubChem, and have varying levels of reliability and accuracy. Links to over twenty external resources are provided. These include other dashboard apps developed by EPA and other agency, interagency and public databases containing data of interest to environmental chemists. It also integrates chemistry linkages across other EPA dashboards and chemistry resources such as ACToR, ToxCast, EDSP21 and CPCat. Expansion, curation and validation of the content is ongoing.

The iCSS Chemistry Dashboard also takes advantage of a number of Open Source widgets and tools. These include the developers of the JSMol 3D display widget and the PubChem widgets for Bioactivities, Articles and Patents, and we are grateful to these developers for their contributions. Should you use the iCSS Chemistry Dashboard to source information and data of value please cite the app using the URL For a particular chemical, the specific citation can be obtained on the page under the Citation tab.

An excellent example of how curation of a data resources and linking it to other data resources is a general benefit to everyone.

I first saw this in a tweet by ChemConnector.

Digital Data Repositories in Chemistry…

Wednesday, July 1st, 2015

Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks by Matthew J. Harvey, Nicholas J. Mason, Henry S. Rzepa.


We discuss the concept of recasting the data-rich scientific journal article into two components, a narrative and separate data components, each of which is assigned a persistent digital object identifier. Doing so allows each of these components to exist in an environment optimized for purpose. We make use of a poorly-known feature of the handle system for assigning persistent identifiers that allows an individual data file from a larger file set to be retrieved according to its file name or its MIME type. The data objects allow facile visualization and retrieval for reuse of the data and facilitates other operations such as data mining. Examples from five recently published articles illustrate these concepts.

A very promising effort to integrate published content and electronic notebooks in chemistry. Encouraging that in addition to the technical and identity issues the authors also point out the lack of incentives for the extra work required to achieve useful integration.

Everyone agrees that deeper integration of resources in the sciences will be a game-changer but renewing the realization that there is no such thing as a free lunch, is an important step towards that goal.

This article easily repays a close read with interesting subject identity issues and the potential that topic maps would offer to such an effort.

ChemistryWorld Podcasts: Compounds (Phosgene)

Monday, June 29th, 2015

Chemistry in its elements: Compounds is a weekly podcast sponsored by ChemistryWorld, which features a chemical compound or group of compounds every week.

Matthew Gunter has a podcast entitled: Phosgene.

In case your recent history is a bit rusty, phosgene was one of the terror weapons of World War I. It accounted for 85% of the 100,000 deaths from chemical gas. Not as effective as say sarin but no slouch.

Don’t run to the library, online guides or the FBI for recipes to make phosgene at home. Its use in industrial applications should give you a clue as to an alternative to home-made phosgene. Use of phosgene violates the laws of war, so being a thief as well should not trouble you.

No, I don’t have a list of locations that make or use phosgene, but then DHS probably doesn’t either. They are more concerned with terrorists using “nuclear weapons” or “gamma-ray bursts“. One is mechanically and technically difficult to do well and the other is impossible to control.

The idea of someone using a dual-wheel pickup and a plant pass to pickup and deliver phosgene gas is too simple to have occurred to them.

If you are pitching topic maps to a science/chemistry oriented audience, these podcasts make a nice starting point for expansion. To date there are two hundred and forty-two (242) of them.


Gathering, Extracting, Analyzing Chemistry Datasets

Wednesday, April 22nd, 2015

Activities at the Royal Society of Chemistry to gather, extract and analyze big datasets in chemistry by Antony Williams.

If you are looking for a quick summary of efforts to combine existing knowledge resources in chemistry, you can do far worse than Antony’s 118 slides on the subject (2015).

I want to call special attention to Slide 107 in his slide deck:


True enough, extraction is problematic, expensive, inaccurate, etc., all the things Antony describes. And I would strongly second all of what he implies is the better practice.

However, extraction isn’t just a necessity for today or for a few years, extraction is going to be necessary so long as we keep records about chemistry or any other subject.

Think about all the legacy materials on chemistry that exist in hard copy format just for the past two centuries. To say nothing of all of still older materials. It is more than unfortunate to abandon all that information simply because “modern” digital formats are easier to manipulate.

That was’t what Antony meant to imply but even after all materials have been extracted and exist in some form of digital format, that doesn’t mean the era of “extraction” will have ended.

You may not remember when atomic chemistry used “punch cards” to record isotopes:


An isotope file on punched cards. George M. Murphy J. Chem. Educ., 1947, 24 (11), p 556 DOI: 10.1021/ed024p556 Publication Date: November 1947.

Today we would represent that record in…NoSQL?

Are you confident that in another sixty-eight (68) years we will still be using NoSQL?

We have to choose from the choices available to us today, but we should not deceive ourselves into thinking our solution will be seen as the “best” solution in the future. New data will be discovered, new processes invented, new requirements will emerge, all of which will be clamoring for a “new” solution.

Extraction will persist as long as we keep recording information in the face of changing formats and requirements. We can improve that process but I don’t think we will ever completely avoid it.

Chemical databases: curation or integration by user-defined equivalence?

Monday, March 16th, 2015

Chemical databases: curation or integration by user-defined equivalence? by Anne Hersey, Jon Chambers, Louisa Bellis, A. Patrícia Bento, Anna Gaulton, John P. Overington.


There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.

The authors identify a number of reasons why databases of chemical identifications have different structures recorded for the same chemicals. One problem is that there is no authoritative source for chemical structures so upon publication, authors publish those aspects most relevant to their interest. Or publish images and not machine readable representations of a chemical. To say nothing of the usual antics with simple names and their confusions. But there are software limitations, business rules and other sources of a multiplicity of chemical structures.

Suffice it to say that the authors make a strong case for why there are multiple structures for any given chemical now and why that is going to continue.

The author’s openly ask if it is time to ask users for their assistance in mapping this diversity of structures:

Is it now time to accept that however diligent database providers are, there will always be differences in structure representations and indeed some errors in the structures that cannot be fixed with a realistic level of resource? Should we therefore turn our attention to encouraging the use and development of tools that enable the mapping together of related compounds rather than concentrate our efforts on ever more curation?

You know my answer to that question.

What’s yours?

I first saw this in a tweet by John P. Overington.

10 Chemistry Blogs You Should Read

Tuesday, January 20th, 2015

10 Chemistry Blogs You Should Read by Aaron Oneal.

If you are looking for reading in chemistry, Aaron has assembled ten very high quality blogs for you to follow. Each is listed with a short description so you can tune the reading to your taste.

Personally I recommend taking a sip from each one. It is rare that I read a really good blog and don’t find something of interest and many times relevant to other projects that I would not have seen otherwise.

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

Friday, October 10th, 2014

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining by Saber A. Akhondi, et al. (Published: September 30, 2014 DOI: 10.1371/journal.pone.0107477)


Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at

Highly recommended both as a “gold standard” for chemical patent text mining but also as the state of the art in developing such a standard.

To say nothing of annotation as a means of automatic creation of topic maps where entities are imbued with subject identity properties.

I first saw this in a tweet by ChemConnector.

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus

Tuesday, September 9th, 2014

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus by George Papadatos, et al. (Journal of Cheminformatics 2014, 6:40)



The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.


The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: webcite. These can be readily modified to include additional keyword constraints to further focus searches.


Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

While the abstract mentions “the triage process,” it fails to capture the main goal of this paper:

…the main goal of our project diverges from the goal of the tools mentioned. We aim to meet the following criteria: ranking and prioritising the relevant literature using a fast and high performance algorithm, with a generic methodology applicable to other domains and not necessarily related to chemistry and drug discovery. In this regard, we present a method that builds upon the manually collated and curated ChEMBL document corpus, in order to train a Bag-of-Words (BoW) document classifier.

In more detail, we have employed two established classification methods, namely Naïve Bayesian (NB) and Random Forest (RF) approaches [12]-[14]. The resulting classification score, henceforth referred to as ‘ChEMBL-likeness’, is used to prioritise relevant documents for data extraction and curation during the triage process.

In other words, the focus of this paper is a classifier to help prioritize curation of papers. I take that as being different from classifiers used at other stages or for other purposes in the curation process.

I first saw this in a tweet by ChemConnector.

InChI identifier

Monday, August 11th, 2014

How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry by Antony Williams.


The Royal Society of Chemistry hosts a growing collection of online chemistry content. For much of our work the InChI identifier is an important component underpinning our projects. This enables the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a platform encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.

Although I have seen more than one of Antony’s slide decks, there is information herein that bears repeating and new news as well.

InChI identifiers are chemical identifiers based on the chemical structure of a substance. They are not designed to replace current identifiers but rather to act as lynchpins that enable the mapping of other names together against a known chemical structure. (The IUPAC International Chemical Identifier (InChI))

Anthony says at slide #31 that all 21st century articles (100K) have been processed. And is not shy about pointing out known problems in existing data.

I regret not seeing the presentation but the slides left me with a distinctly positive feeling about progress in this area.


Thursday, May 8th, 2014


Since I just urged you to read/study philosophy and humanities its only fair that I mention ProfessorDaveatYork and his chemistry videos.

David is an enthusiast to say the least. Which is understandable concerning the following from his background:

We are exploring one of the most exciting frontiers of modern chemistry – the nanoworld. Nanotechnology, the development of systems between 1 and 100 nm in size seems impossible tiny to the average person. However, for chemists, used to manipulating bonds just 0.1 nm long, the nanoworld is a large space, which requires new synthetic strategies.

Nanochemistry – the synthesis and study of nanoscale architectures, is therefore a fundamental part of nanotechnology. Applications of nanotechnology are completely dependent on the new objects which chemists can generate. Our approach uses non-covalent interactions between molecules – ‘supramolecular chemistry’ – in order to allow simple molecular-scale building blocks to spontaneously self-assemble into nanostructures. Self-assembly is a simple and powerful approach to constructing the nanoworld which allows us to generate a wide variety of systems, with applications ranging from nanomaterials to nanomedicine.

Hard to not be excited when you are on the cutting edge!

Chemistry or more precisely cheminfomatics is no stranger to the name/identifier problems found elsewhere in science, business, government, etc.

Enjoyable lectures that can refresh or build your chemistry basics.

Dave has recently passed 400,000 views on his channel. What say we help him on his way to 500,000? (Enjoyable lectures not being all that common, we should encourage them whenever possible.)

On InChI and evaluating the quality of cross-reference links

Saturday, April 19th, 2014

On InChI and evaluating the quality of cross-reference links by Jakub Galgonek and Jiří Vondrášek. (Journal of Cheminformatics 2014, 6:15 doi:10.1186/1758-2946-6-15)



There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.


We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.


We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

Another use case for topic maps don’t you think?

Rather than a mapping keyed on recognition of a single identifier, have the mapping keyed to the recognition of several key/value pairs.

I don’t think there is an abstract answer as to the optimum number of key/value pairs that must match for identification. Experience would be a much better guide.


Tuesday, April 1st, 2014

Molpher: a software framework for systematic chemical space exploration by David Hoksza, Petr Škoda, Milan Voršilák and Daniel Svozil.



Chemical space is virtual space occupied by all chemically meaningful organic compounds. It is an important concept in contemporary chemoinformatics research, and its systematic exploration is vital to the discovery of either novel drugs or new tools for chemical biology.


In this paper, we describe Molpher, an open-source framework for the systematic exploration of chemical space. Through a process we term ‘molecular morphing’, Molpher produces a path of structurally-related compounds. This path is generated by the iterative application of so-called ‘morphing operators’ that represent simple structural changes, such as the addition or removal of an atom or a bond. Molpher incorporates an optimized parallel exploration algorithm, compound logging and a two-dimensional visualization of the exploration process. Its feature set can be easily extended by implementing additional morphing operators, chemical fingerprints, similarity measures and visualization methods. Molpher not only offers an intuitive graphical user interface, but also can be run in batch mode. This enables users to easily incorporate molecular morphing into their existing drug discovery pipelines.


Molpher is an open-source software framework for the design of virtual chemical libraries focused on a particular mechanistic class of compounds. These libraries, represented by a morphing path and its surroundings, provide valuable starting data for future in silico and in vitro experiments. Molpher is highly extensible and can be easily incorporated into any existing computational drug design pipeline.

Beyond its obvious importance for cheminformatics, this paper offers another example of “semantic impedance:”

While virtual chemical space is very large, only a small fraction of it has been reported in actual chemical databases so far. For example, PubChem contains data for 49.1 million chemical compounds [17] and Chemical Abstracts consists of over 84.3 million organic and inorganic substances [18] (numbers as of 12. 3. 2014). Thus, the navigation of chemical space is a very important area of chemoinformatics research [19,20]. Because chemical space is usually defined using various sets of descriptors [21], a major problem is the lack of invariance of chemical space [22,23]. Depending on the descriptors and distance measures used [24], different chemical spaces show different compound distributions. Unfortunately, no generally applicable representation of invariant chemical space has yet been reported [25].

OK, so how much further is there to go with these various descriptors?

The article describes estimates of the size of chemical space this way:

Chemical space is populated by all chemically meaningful and stable organic compounds [1-3]. It is an important concept in contemporary chemoinformatics research [4,5], and its exploration leads to the discovery of either novel drugs [2] or new tools for chemical biology [6,7]. It is agreed that chemical space is huge, but no accurate approximation of its size exists. Even if only drug-like molecules are taken into account, size estimates vary [8] between 1023[9] and 10100[10] compounds. However, smaller numbers have also been reported. For example, based on the growth of a number of organic compounds in chemical databases, Drew et al.[11] deduced the size of chemical space to be 3.4 × 109. By assigning all possible combinations of atomic species to the same three-dimensional geometry, Ogata et al. [12] estimated the size of chemical space to be between 108 and 1019. Also, by analyzing known organic substituents, the size of accessible chemical space was assessed as between 1020 and 1024[9].

Such estimates have been put into context by Reymond et al., who produced all molecules that can exist up to a certain number of heavy atoms in their Chemical Universe Databases: GDB-11 [13,14] (2.64 × 107 molecules with up to 11 heavy atoms); GDB-13 [15] (9.7 × 108 molecules with up to 13 heavy atoms); and GDB-17 [16] (1.7 × 1011 compounds with up to 17 heavy atoms). The GDB-17 database was then used to approximate the number of possible drug-like molecules as 1033[8].

To give you an easy basis for comparison: possible drug-like molecules at 1033, versus number of stars in galaxies in the observable universe at 1024.

That’s an impressive number of possible drug like molecules. 109 more than stars in the observable universe (est.).

I can’t imagine that having diverse descriptors is assisting in the search to complete the chemical space. And from the description, it doesn’t sound like semantic convergence in one the horizon.

Mapping between the existing systems would be a major undertaking but the longer exploration goes on without such a mapping, the problem is only going to get worse.

Data enhancing the Royal Society of…

Sunday, March 23rd, 2014

Data enhancing the Royal Society of Chemistry publication archive by Antony Williams.


The Royal Society of Chemistry has an archive of hundreds of thousands of published articles containing various types of chemistry related data – compounds, reactions, property data, spectral data etc. RSC has a vision of extracting as much of these data as possible and providing access via ChemSpider and its related projects. To this end we have applied a combination of text-mining extraction, image conversion and chemical validation and standardization approaches. The outcome of this project will result in new chemistry related data being added to our chemical and reaction databases and in the ability to more tightly couple web-based versions of the articles with these extracted data. The ability to search across the archive will be enhanced as a result. This presentation will report on our progress in this data extraction project and discuss how we will ultimately use similar approaches in our publishing pipeline to enhance article markup for new publications.

The data mining Antony details on the Royal Society of Chemistry is impressive!

But as Anthony notes at slide #30, it isn’t a long term solution:

We should NOT be mining data out of future publications (emphasis added)

I would say the same thing for metadata/subject identities in data. For some data and some subjects, we can, after the fact, reconstruct properties to identify the subjects they represent.

Data/text mining would be more accurate and easier if subjects were identified at the time of authoring. Perhaps even automatically or at least subject to a user’s approval.

More accurate than researchers removed from an author by time, distance and even profession, trying to guess what subject an author may have meant.

Better semantic authoring support now, will reduce the cost and improve the accuracy of data mining in the future.

Ontology work at the Royal Society of Chemistry

Wednesday, March 19th, 2014

Ontology work at the Royal Society of Chemistry by Antony Williams.

From the description:

We provide an overview of the use we make of ontologies at the Royal Society of Chemistry. Our engagement with the ontology community began in 2006 with preparations for Project Prospect, which used ChEBI and other Open Biomedical Ontologies to mark up journal articles. Subsequently Project Prospect has evolved into DERA (Digitally Enhancing the RSC Archive) and we have developed further ontologies for text markup, covering analytical methods and name reactions. Most recently we have been contributing to CHEMINF, an open-source cheminformatics ontology, as part of our work on disseminating calculated physicochemical properties of molecules via the Open PHACTS. We show how we represent these properties and how it can serve as a template for disseminating different sorts of chemical information.

A bit wordy for my taste but it has numerous references and links to resources. Top stuff!

I had to laugh when I read slide #20:

Why a named reaction ontology?

Despite attempts to introduce systematic nomenclature for organic reactions, lots of chemists still prefer to attach human names.

Those nasty humans! Always wanting “human” names. Grrr! 😉

Afraid so. That is going to continue in a number of disciplines.

When I got to slides #29:

Ontologies as synonym sets for text-mining

it occurred to me that terms in an ontology are like base names in a topic map, in topics with associations with other topics, which also have base name.

The big difference being that ontologies are mono-views that don’t include mapping instructions based on properties in starting ontology or any other ontology to which you could map.

That is the ontologies I have seen can only report properties of their terms and not which properties must be matched to be the same subject.

Nor do such ontologies report properties of the subjects that are their properties. Much less any mappings from bundles of properties to bundles of properties in other ontologies.

I know the usual argument about combinatorial explosion of mappngs, etc., which leaves ontologists with too few arms and legs to point in the various directions.

That argument fails to point out that to have an “uber” ontology, someone has to do the mapping (undisclosed) from variants to the new master ontology. And, they don’t write that mapping down.

So the combinatorial explosion was present, it just didn’t get written down. Can you guess who is empowered as an expert in the new master ontology with undocumented mappings?

The other fallacy in that argument is that topic maps, for example, are always partial world views. I can map as much or as little between ontologies, taxonomies, vocabularies, folksonomies, etc. as I care to do.

If I don’t want to bother mapping “thing” as the root of my topic map, I am free to omit it. All the superstructure clutter goes away and I can focus on immediate ROI concerns.

Unless you want to pay for the superstructure clutter then by all means, I’m interested! 😉

If you have an ontology, by all means use it as a starting point for your topic map. Or if someone is willing to pay to create yet another ontology, do it. But if they need results before years of travel, debate and bad coffee, give topic maps a try!

PS: The travel, debate and bad coffee never go away for ontologies, even successful ones. Despite the desires of many, the world keeps changing and our views of it along with it. A static ontology is a dead ontology. Same is true for a topic map, save that agreement on its content is required only as far as it is used and no further.

The Gold Book

Friday, January 31st, 2014

IUPAC Compendium of Chemical Terminology (Gold Book)

From the webpage:

The Compendium is popularly referred to as the “Gold Book”, in recognition of the contribution of the late Victor Gold, who initiated work on the first edition. It is one of the series of IUPAC “Colour Books” on chemical nomenclature, terminology, symbols and units (see the list of source documents), and collects together terminology definitions from IUPAC recommendations already published in Pure and Applied Chemistry and in the other Colour Books.

Terminology definitions published by IUPAC are drafted by international committees of experts in the appropriate chemistry sub-disciplines, and ratified by IUPAC’s Interdivisional Committee on Terminology, Nomenclature and Symbols (ICTNS). In this edition of the Compendium these IUPAC-approved definitions are supplemented with some definitions from ISO and from the International Vocabulary of Basic and General Terms in Metrology; both these sources are recognised by IUPAC as authoritative. The result is a collection of nearly 7000 terms, with authoritative definitions, spanning the whole range of chemistry.

Some minor editorial changes were made to the originally published definitions, to harmonise the presentation and to clarify their applicability, if this is limited to a particular sub-discipline. Verbal definitions of terms from Quantities, Units and Symbols in Physical Chemistry (the IUPAC Green Book, in which definitions are generally given as mathematical expressions) were developed specially for this Compendium by the Physical Chemistry Division of IUPAC. Definitions of a few physicochemical terms not mentioned in the Green Book were added at the same time (referred to here as Physical Chemistry Division, unpublished).

The first reference given at the end of each definition is to the page of Pure Appl. Chem. or other source where the original definition appears; other references given designate other places where compatible definitions of the same term or additional information may be found, in other IUPAC documents. The complete reference citations are given in the appended list of source documents. Highlighted terms within individual definitions link to other entries where additional information is available.

If you are looking for authoritative chemistry terminology, you may not need to look any further!

IUPUC – International Union of Pure and Applied Chemistry.

The “color” books that were mentioned:

Chemical Terminology (Gold book)

Quantities, Units and Symbols in Physical Chemistry (Green Book)

Nomenclature of Organic Chemistry (Blue book)

Macromolecular Nomenclature (Purple book)

Analytical Terminology (Orange book)

Biochemical Nomenclature (White Book)

Nomenclature of Inorganic Chemistry (Red Book)

Some “lite” weekend reading. 😉


Friday, January 31st, 2014


From the webpage:

Semantic Web technologies are emerging as an increasingly important approach to distribute and integrate scientific data. These technologies include the trio of the Resource Description Framework (RDF), Web Ontology Language (OWL), and SPARQL query language. The PubChemRDF project provides RDF formatted information for the PubChem Compound, Substance, and Bioassay databases.

This document provides detailed technical information (release notes) about the PubChemRDF project. Downloadable RDF data is available on the PubChemRDF FTP Site. Past presentations on the PubChemRDF project are available giving a PubChemRDF introduction and on the PubChemRDF details. The PubChem Blog may provide most recent updates on the PubChemRDF project. Please note that the PubChemRDF is evolving as a function of time. However, we intend for such enhancements to be backwards compatible by adding additional information and annotations.

A twitter post commented on there being 59 billion triples.

Nothing to sneeze at but I was more impressed with the types of connections at page 8 of

I am sure there are others but just on that slide:

  • sio:has_component
  • sio:is_stereoisomer_of
  • sio:is_isotopologue_of
  • sio:has_same_connectivity_as
  • sio:similar_to_by_PubChem_2D_similarity_algorithm
  • sio:similar_to_by_PubChem_3D_similarity_algorithm

Using such annotations, the user could decide on what basis to consider compounds “similar” or not.

True, it is non-obvious how I would offer an alternative vocabulary for isotopologue but in this domain, that may not be a requirement.

That we can offer alternative vocabularies for any domain does not mean there is a requirement for alternative vocabularies in any particular domain.

A great source of data!

I first saw this in a tweet by Paul Groth.

The Mole

Thursday, January 23rd, 2014

The Mole

From the homepage:

The Mole is the Royal Society of Chemistry’s magazine for students, and anyone inspired to dig deeper into chemistry.

In the latest issue (as of today):

Find out how chemistry plays a central role in revealing how our ancestors once lived • Discover how lucrative markets are found in leftover lobster • Make your own battery from the contents of your fruit bowl • What did Egyptian mummies have for dinner? • How to control the weather so it rains where we need it to • Excavate the facts about a chemist working as an archaeologist • Discover how chemistry can reveal secrets hidden in art

Of course there is a wordsearch puzzle and a chemical acrostic on the final page.

Always interesting to learn new information and to experience “other” views of the world. May lessen your chances of answering a client before they finish outlining their problem.

I first learned of the Mole in a tweet by ChemistryWorld.

…electronic laboratory notebook records

Sunday, December 22nd, 2013

First steps towards semantic descriptions of electronic laboratory notebook records by Simon J Coles, Jeremy G Frey, Colin L Bird, Richard J Whitby and Aileen E Day.


In order to exploit the vast body of currently inaccessible chemical information held in Electronic Laboratory Notebooks (ELNs) it is necessary not only to make it available but also to develop protocols for discovery, access and ultimately automatic processing. An aim of the Dial-a-Molecule Grand Challenge Network is to be able to draw on the body of accumulated chemical knowledge in order to predict or optimize the outcome of reactions. Accordingly the Network drew up a working group comprising informaticians, software developers and stakeholders from industry and academia to develop protocols and mechanisms to access and process ELN records. The work presented here constitutes the first stage of this process by proposing a tiered metadata system of knowledge, information and processing where each in turn addresses a) discovery, indexing and citation b) context and access to additional information and c) content access and manipulation. A compact set of metadata terms, called the elnItemManifest, has been derived and caters for the knowledge layer of this model. The elnItemManifest has been encoded as an XML schema and some use cases are presented to demonstrate the potential of this approach.

And the current state of electronic laboratory notebooks:

It has been acknowledged at the highest level [15] that “research data are heterogeneous, often classified and cited with disparate schema, and housed in distributed and autonomous databases and repositories. Standards for descriptive and structural metadata will help establish a common framework for understanding data and data structures to address the heterogeneity of datasets.” This is equally the case with the data held in ELNs. (citing: 15. US National Science Board report, Digital Research Data Sharing and Management, Dec 2011 Appendix F Standards and interoperability enable data-intensive science., accessed 10/07/2013.)

It is trivially true that: “…a common framework for understanding data and data structures …[would] address the heterogeneity of datasets.”

Yes, yes a common framework for data and data structures would solve the heterogeneity issues with datasets.

What is surprising is that no one had that idea up until now. 😉

I won’t recite the history of failed attempts at common frameworks for data and data structures here. To the extent that communities do adopt common practices or standards, those do help. Unfortunately there have never been any universal ones.

Or should I say there have never been any proposals for universal frameworks that succeeded in becoming universal? That’s more accurate. We have not lacked for proposals for universal frameworks.

That isn’t to say this is a bad proposal. But it will be only one of many proposals for the integration of electronic laboratory notebook records, leaving the task of integration between systems for integration left to be done.

BTW, if you are interested in further details, see the article and the XML schema at:

Patent database of 15 million chemical structures goes public

Thursday, December 12th, 2013

Patent database of 15 million chemical structures goes public by Richard Van Noorden.

From the post:

The internet’s wealth of free chemistry data just got significantly larger. Today, the European Bioinformatics Institute (EBI) has launched a website — — that allows anyone to search through 15 million chemical structures, extracted automatically by data-mining software from world patents.

The initiative makes public a 4-terabyte database that until now had been sold on a commercial basis by a software firm, SureChem, which is folding. SureChem has agreed to transfer its information over to the EBI — and to allow the institute to use its software to continue extracting data from patents.

“It is the first time a world patent chemistry collection has been made publicly available, marking a significant advance in open data for use in drug discovery,” says a statement from Digital Science — the company that owned SureChem, and which itself is owned by Macmillan Publishers, the parent company of Nature Publishing Group.

This is one of those Selling Data opportunities that Vincent Granville was talking about.

You can harvest data here, combine it (hopefully using a topic map) with other data and market the results. Not everyone who has need for the data has the time or skills required to re-package the data.

What seems problematic to me is how to reach potential buyers of information?

If you produce data and license it to one of the large data vendors, what’s the likelihood your data will get noticed?

On the other hand, direct sale of data seems like a low percentage deal.


International chemical identifier for reactions (RInChI)

Wednesday, October 30th, 2013

International chemical identifier for reactions (RInChI) by Guenter Grethe, Jonathan M Goodman and Chad HG Allen. (Journal of Cheminformatics 2013, 5:45 doi:10.1186/1758-2946-5-45)


The IUPAC International Chemical Identifier (InChI) provides a method to generate a unique text descriptor of molecular structures. Building on this work, we report a process to generate a unique text descriptor for reactions, RInChI. By carefully selecting the information that is included and by ordering the data carefully, different scientists studying the same reaction should produce the same RInChI. If differences arise, these are most likely the minor layers of the InChI, and so may be readily handled. RInChI provides a concise description of the key data in a chemical reaction, and will help enable the rapid searching and analysis of reaction databases.

The line from the abstract:

By carefully selecting the information that is included and by ordering the data carefully, different scientists studying the same reaction should produce the same RInChI.

sounds good in theory but doubtful in practice.

Although, the authors did test a set of reactions from three different publishers, some 2900 RInChIs and they were able to quickly eliminate duplicates, etc.

A project to watch as larger data sets are tested with the goal of encoding the same reactions the same way.

The RInChI Project

Similarity maps…

Thursday, September 26th, 2013

Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods by Sereina Riniker and Gregory A Landrum.


Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models. Here we present similarity maps, a straightforward and general strategy to visualize the atomic contributions to the similarity between two molecules or the predicted probability of a ML model. We show the application of similarity maps to a set of dopamine D3 receptor ligands using atom-pair and circular fingerprints as well as two popular ML methods: random forests and na?ve Bayes. An open-source implementation of the method is provided.

If you are doing topic maps in areas where molecular fingerprints are relevant, this could be quite useful.

Despite my usual warnings that semantics are continuous versus the discrete structures treated here, this may also be useful in “fuzzier” areas where topic maps are found.

The Blue Obelisk Data Repository’s 10 release

Tuesday, August 27th, 2013

The Blue Obelisk Data Repository’s 10 release by Egon Willighagen.

From the post:

The Blue Obelisk Data Repository (BODR) is not so high profile as other Blue Obelisk projects, but equally important. Well, maybe a tid bit more important: it’s a collection of core chemical and physical data, supporting computation chemistry and cheminformatics resources. For example, it is used by at least the CDK, Kalzium, and Bioclipse, but possibly more. Also, it’s packages for major Linux distributions, such as Debian (btw, congrats to their 20th birthday!) and Ubuntu.

It doesn’t change so often, but just has seen its 10th release. Actually, it was the first release in more than three years. But, fortunately, core chemical facts do not change often, nor much. So, this release has a number of data fixes, a few recent experimental isotope measurements, and also includes the new official names of the livermorium and flerovium elements. There is a full overview of changes.


If this is one of the lesser known Blue Obelisk projects, I have to take a look at the other ones!

JChemInf volumes as single PDFs

Saturday, July 13th, 2013

JChemInf volumes as single PDFs by Egon Willighagen.

From the post:

One of the advantages of a print journal is that you are effectively forced to look at papers which may not have received your attention in the first place. Online journals do not provide such functionality, and you’re stuck with the table of contents, and never see that cool figure from that paper with the boring title.

Of course, the problem is artificial. We have pdftk and we can make PDF of issues, or in the present example, of complete volumes. Handy, I’d say. It saves you from many, many downloads and forces you to scan through all pages. Anyway, I wanted to scan the full JChemInf volumes, and rather have one PDF per volume. So, I created them. And you can get them too. The journal is Open Access after all (CC-BY).


Egon has links to the Journal of Cheminformatics (as complete volumes), vols. 1 – 4.

He also has a good point about print journals increasing the potential for a chance encounter with unexpected information.

Personalization of search results is a step away from serendipity.

Thoughts on how to step back towards serendipity?

From data to analysis:… [Data Integration For a Purpose]

Friday, May 24th, 2013

From data to analysis: linking NWChem and Avogadro with the syntax and semantics of Chemical Markup Language by Wibe A de Jong, Andrew M Walker and Marcus D Hanwell. (Journal of Cheminformatics 2013, 5:25 doi:10.1186/1758-2946-5-25)



Multidisciplinary integrated research requires the ability to couple the diverse sets of data obtained from a range of complex experiments and computer simulations. Integrating data requires semantically rich information. In this paper an end-to-end use of semantically rich data in computational chemistry is demonstrated utilizing the Chemical Markup Language (CML) framework. Semantically rich data is generated by the NWChem computational chemistry software with the FoX library and utilized by the Avogadro molecular editor for analysis and visualization.


The NWChem computational chemistry software has been modified and coupled to the FoX library to write CML compliant XML data files. The FoX library was expanded to represent the lexical input files and molecular orbitals used by the computational chemistry software. Draft dictionary entries and a format for molecular orbitals within CML CompChem were developed. The Avogadro application was extended to read in CML data, and display molecular geometry and electronic structure in the GUI allowing for an end-to-end solution where Avogadro can create input structures, generate input files, NWChem can run the calculation and Avogadro can then read in and analyse the CML output produced. The developments outlined in this paper will be made available in future releases of NWChem, FoX, and Avogadro.


The production of CML compliant XML files for computational chemistry software such as NWChem can be accomplished relatively easily using the FoX library. The CML data can be read in by a newly developed reader in Avogadro and analysed or visualized in various ways. A community-based effort is needed to further develop the CML CompChem convention and dictionary. This will enable the long-term goal of allowing a researcher to run simple “Google-style” searches of chemistry and physics and have the results of computational calculations returned in a comprehensible form alongside articles from the published literature.

Aside from its obvious importance for cheminformatics, I think there is another lesson in this article.

Integration of data required “…semantically rich information…, but just as importantly, integration was not a goal in and of itself.

Integration was only part of a workflow that had other goals.

No doubt some topic maps are useful as end products of integrated data, but what of cases where integration is part of a workflow?

Think of the non-reusable data integration mappings that are offered by many enterprise integration packages.

JSME: a free molecule editor in JavaScript

Tuesday, May 21st, 2013

JSME: a free molecule editor in JavaScript by Bruno Bienfait and Peter Ertl. (Journal of Cheminformatics 2013, 5:24 doi:10.1186/1758-2946-5-24)



A molecule editor, i.e. a program facilitating graphical input and interactive editing of molecules, is an indispensable part of every cheminformatics or molecular processing system. Today, when a web browser has become the universal scientific user interface, a tool to edit molecules directly within the web browser is essential. One of the most popular tools for molecular structure input on the web is the JME applet. Since its release nearly 15 years ago, however the web environment has changed and Java applets are facing increasing implementation hurdles due to their maintenance and support requirements, as well as security issues. This prompted us to update the JME editor and port it to a modern Internet programming language – JavaScript.


The actual molecule editing Java code of the JME editor was translated into JavaScript with help of the Google Web Toolkit compiler and a custom library that emulates a subset of the GUI features of the Java runtime environment. In this process, the editor was enhanced by additional functionalities including a substituent menu, copy/paste, drag and drop and undo/redo capabilities and an integrated help. In addition to desktop computers, the editor supports molecule editing on touch devices, including iPhone, iPad and Android phones and tablets. In analogy to JME the new editor is named JSME. This new molecule editor is compact, easy to use and easy to incorporate into web pages.


A free molecule editor written in JavaScript was developed and is released under the terms of permissive BSD license. The editor is compatible with JME, has practically the same user interface as well as the web application programming interface. The JSME editor is available for download from the project web page

Just in case you were having any doubts about using JavaScript to power an annotation editor.

Better now?

The ChEMBL database as linked open data

Thursday, May 9th, 2013

The ChEMBL database as linked open data by Egon L Willighagen, Andra Waagmeester, Ola Spjuth, Peter Ansell, Antony J Williams, Valery Tkachenko, Janna Hastings, Bin Chen and David J Wild. (Journal of Cheminformatics 2013, 5:23 doi:10.1186/1758-2946-5-23).


Background Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs). RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easier to scale up inference and data analysis.

Results This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples. Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO; exposes more information from the database; and is now available as dereferencable, linked data. To demonstrate these new features, we present novel use cases showing further integration with other web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standard ontologies for querying.

Conclusions We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDF resource creates a foundation for integrated semantic web cheminformatics applications, such as the presented decision support.

You already know about the fragility of ontologies so no need to repeat that rant here.

Having material encoded with an ontology, on the other hand, after vetting, can be a source that you wrap with a topic map.

So all that effort isn’t lost.

Extracting and connecting chemical structures…

Saturday, April 27th, 2013

Extracting and connecting chemical structures from text sources using by Christopher Southan and Andras Stracz.



Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.


Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and/or merged extractions.


This work demonstrates the utility of for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

A great example of building a resource to address identity issues in a specific domain.

The result speaks for itself.

PS: The results were not delayed awaiting a reformation of chemistry to use a common identifier.

Crowdsourcing Chemistry for the Community…

Friday, April 5th, 2013

Crowdsourcing Chemistry for the Community — 5 Year of Experiences by Antony Williams.

From the description:

ChemSpider is one of the internet’s primary resources for chemists. ChemSpider is a structure-centric platform and hosts over 26 million unique chemical entities sourced from over 400 different data sources and delivers information including commercial availability, associated publications, patents, analytical data, experimental and predicted properties. ChemSpider serves a rather unique role to the community in that any chemist has the ability to deposit, curate and annotate data. In this manner they can contribute their skills, and data, to any chemist using the system. A number of parallel projects have been developed from the initial platform including ChemSpider SyntheticPages, a community generated database of reaction syntheses, and the Learn Chemistry wiki, an educational wiki for secondary school students.

This presentation will provide an overview of the project in terms of our success in engaging scientists to contribute to crowdsouring chemistry. We will also discuss some of our plans to encourage future participation and engagement in this and related projects.

Perhaps not encouraging in terms of the rate of participation but certainly encouraging in terms of the impact of those who do participate.

I suspect the ratio of contributors to users isn’t that far off from those observed in open source projects.

On the whole, I take this as a plus sign for crowd-sourced curation projects, including topic maps.

I first saw this in a tweet by ChemConnector.