Archive for the ‘Chemistry’ Category

Computational Data Analysis Workflow Systems

Friday, October 6th, 2017

Computational Data Analysis Workflow Systems

An incomplete list of existing workflow systems. As of today, approximately 17:00 EST, 173 systems in no particular order.

I first saw this mentioned in a tweet by Michael R. Crusoe.

One of the many resources found at: Common Workflow Language.

From the webpage:

The Common Workflow Language (CWL) is a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry.

You should take a quick look at: Common Workflow Language User Guide to get a feel for CWL.

Try to avoid thinking of CWL as “documenting” your workflow if that is an impediment to using it. That’s a side effect but its main purpose is to make your more effective.

720 Thousand Chemicals – Chemistry Dashboard

Tuesday, April 5th, 2016

Chemistry Dashboard – 720 Thousand Chemicals

Beta test of a Google-like search interface by the United States Environmental Protection Agency on chemical data.

Search results return “Intrisic Properties,” “Structural Identifiers,” and a “Citation” for your search to the right of a molecular diagram of the object of your search.

A series of tabs run across the page offering, “Chemical Properties,” “External Links,” “Synonyms,” “PubChem Biological Activities,” “PubChem Articles,” “PubChem Patents,” and “Comments.”

And Advanced Search option is offered as well. (Think of it as identifying a subject by its properties.)

The about page has this description with additional links and a pointer to a feedback form for comments:

The interactive Chemical Safety for Sustainability Chemistry Dashboard (the iCSS chemistry dashboard) is a part of a suite of databases and web applications developed by the US Environmental Protection Agency’s Chemical Safety for Sustainability Research Program. These databases and apps support EPA’s computational toxicology research efforts to develop innovative methods to change how chemicals are currently evaluated for potential health risks. EPA researchers integrate advances in biology, biotechnology, chemistry, and computer science to identify important biological processes that may be disrupted by the chemicals. The combined information helps prioritize chemicals based on potential health risks. Using computational toxicology research methods, thousands of chemicals can be evaluated for potential risk at small cost in a very short amount of time.

The iCSS chemistry dashboard is the public chemistry resource for these computational toxicology research efforts and it supports improved predictive toxicology. It provides access to data associated with over 700,000 chemicals. A distinguishing feature of the chemistry dashboard is the mapping of curated physicochemical property data associated with chemical substances to their corresponding chemical structures. The chemical dashboard is searchable by various chemical identifiers including CAS Registry Numbers, systematic and common names, and InChIKeys. Millions of predicted physchem properties developed using machine-learning approaches modeling highly curated datasets are also mapped to chemicals within the dashboard.

The data in the dashboard are of varying quality with the highest quality data being assembled by the DSSTox Program. The majority of the chemical structures within the database have been compiled from public sources, such as PubChem, and have varying levels of reliability and accuracy. Links to over twenty external resources are provided. These include other dashboard apps developed by EPA and other agency, interagency and public databases containing data of interest to environmental chemists. It also integrates chemistry linkages across other EPA dashboards and chemistry resources such as ACToR, ToxCast, EDSP21 and CPCat. Expansion, curation and validation of the content is ongoing.

The iCSS Chemistry Dashboard also takes advantage of a number of Open Source widgets and tools. These include the developers of the JSMol 3D display widget and the PubChem widgets for Bioactivities, Articles and Patents, and we are grateful to these developers for their contributions. Should you use the iCSS Chemistry Dashboard to source information and data of value please cite the app using the URL For a particular chemical, the specific citation can be obtained on the page under the Citation tab.

An excellent example of how curation of a data resources and linking it to other data resources is a general benefit to everyone.

I first saw this in a tweet by ChemConnector.

Another Victory For Peer Review – NOT! Cowardly Science

Wednesday, January 27th, 2016

Pressure on controversial nanoparticle paper builds by Anthony King.

From the post:

The journal Science has posted an expression of concern over a controversial 2004 paper on the synthesis of palladium nanoparticles, highlighting serious problems with the work. This follows an investigation by the US funding body the National Science Foundation (NSF), which decided that the authors had falsified research data in the paper, which reported that crystalline palladium nanoparticle growth could be mediated by RNA.1 The NSF’s 2013 report on the issue, and a letter of reprimand from May last year, were recently brought into the open by a newspaper article.

The chief operating officer of the NSF identified ‘an absence of care, if not sloppiness, and most certainly a departure from accepted practices’. Recommended actions included sending letters of reprimand, requiring the subjects contact the journal to make a correction and barring the two chemists from serving as a peer reviewer, adviser or consultant for the NSF for three years.

Science notes that, though the ‘NSF did not find that the authors’ actions constituted misconduct, it nonetheless concluded that there “were significant departures from research practice”.’ The NSF report noted it would no longer fund the paper’s senior authors chemists Daniel Feldheim and Bruce Eaton at the University of Colorado, Boulder, who ‘recklessly falsified research data’, unless they ‘take specific actions to address issues’ in the 2004 paper. Science said it is working with the two authors ‘to understand their response to the NSF final ruling’.

Feldheim and Eaton have been under scrutiny since 2008, when an investigation by their former employer North Carolina State University, US, concluded the 2004 paper contained falsified data. According to Retraction Watch, Science said it would retract the paper as soon as possible.

I’m not a subscriber to Science, unfortunately, but if you are, can you write to Marcia McNutt, Editor-in-Chief to ask why findings of “recklessly falsified research data,” merits an expression of concern?

What’s with that? Concern?

In many parts of the United States, you can be murdered with impunity for DWB, Driving While Black, but you can falsify research data and only merit an expression of “concern” from Science?

Not to mention that the NSF doesn’t think that falsifying research evidence is “misconduct.”

The NSF needs to document what it thinks “misconduct” means. I don’t think it means what they think it means.

Every profession has bad apples but what is amazing in this case is the public kid glove handling of known falsifiers of evidence.

What is required for a swift and effective response against scientific misconduct?

Vivisection of human babies?

Or would that only count if they failed to have a petty cash account and to reconcile it on a monthly basis?

Digital Data Repositories in Chemistry…

Wednesday, July 1st, 2015

Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks by Matthew J. Harvey, Nicholas J. Mason, Henry S. Rzepa.


We discuss the concept of recasting the data-rich scientific journal article into two components, a narrative and separate data components, each of which is assigned a persistent digital object identifier. Doing so allows each of these components to exist in an environment optimized for purpose. We make use of a poorly-known feature of the handle system for assigning persistent identifiers that allows an individual data file from a larger file set to be retrieved according to its file name or its MIME type. The data objects allow facile visualization and retrieval for reuse of the data and facilitates other operations such as data mining. Examples from five recently published articles illustrate these concepts.

A very promising effort to integrate published content and electronic notebooks in chemistry. Encouraging that in addition to the technical and identity issues the authors also point out the lack of incentives for the extra work required to achieve useful integration.

Everyone agrees that deeper integration of resources in the sciences will be a game-changer but renewing the realization that there is no such thing as a free lunch, is an important step towards that goal.

This article easily repays a close read with interesting subject identity issues and the potential that topic maps would offer to such an effort.

ChemistryWorld Podcasts: Compounds (Phosgene)

Monday, June 29th, 2015

Chemistry in its elements: Compounds is a weekly podcast sponsored by ChemistryWorld, which features a chemical compound or group of compounds every week.

Matthew Gunter has a podcast entitled: Phosgene.

In case your recent history is a bit rusty, phosgene was one of the terror weapons of World War I. It accounted for 85% of the 100,000 deaths from chemical gas. Not as effective as say sarin but no slouch.

Don’t run to the library, online guides or the FBI for recipes to make phosgene at home. Its use in industrial applications should give you a clue as to an alternative to home-made phosgene. Use of phosgene violates the laws of war, so being a thief as well should not trouble you.

No, I don’t have a list of locations that make or use phosgene, but then DHS probably doesn’t either. They are more concerned with terrorists using “nuclear weapons” or “gamma-ray bursts“. One is mechanically and technically difficult to do well and the other is impossible to control.

The idea of someone using a dual-wheel pickup and a plant pass to pickup and deliver phosgene gas is too simple to have occurred to them.

If you are pitching topic maps to a science/chemistry oriented audience, these podcasts make a nice starting point for expansion. To date there are two hundred and forty-two (242) of them.


Gathering, Extracting, Analyzing Chemistry Datasets

Wednesday, April 22nd, 2015

Activities at the Royal Society of Chemistry to gather, extract and analyze big datasets in chemistry by Antony Williams.

If you are looking for a quick summary of efforts to combine existing knowledge resources in chemistry, you can do far worse than Antony’s 118 slides on the subject (2015).

I want to call special attention to Slide 107 in his slide deck:


True enough, extraction is problematic, expensive, inaccurate, etc., all the things Antony describes. And I would strongly second all of what he implies is the better practice.

However, extraction isn’t just a necessity for today or for a few years, extraction is going to be necessary so long as we keep records about chemistry or any other subject.

Think about all the legacy materials on chemistry that exist in hard copy format just for the past two centuries. To say nothing of all of still older materials. It is more than unfortunate to abandon all that information simply because “modern” digital formats are easier to manipulate.

That was’t what Antony meant to imply but even after all materials have been extracted and exist in some form of digital format, that doesn’t mean the era of “extraction” will have ended.

You may not remember when atomic chemistry used “punch cards” to record isotopes:


An isotope file on punched cards. George M. Murphy J. Chem. Educ., 1947, 24 (11), p 556 DOI: 10.1021/ed024p556 Publication Date: November 1947.

Today we would represent that record in…NoSQL?

Are you confident that in another sixty-eight (68) years we will still be using NoSQL?

We have to choose from the choices available to us today, but we should not deceive ourselves into thinking our solution will be seen as the “best” solution in the future. New data will be discovered, new processes invented, new requirements will emerge, all of which will be clamoring for a “new” solution.

Extraction will persist as long as we keep recording information in the face of changing formats and requirements. We can improve that process but I don’t think we will ever completely avoid it.

Chemical databases: curation or integration by user-defined equivalence?

Monday, March 16th, 2015

Chemical databases: curation or integration by user-defined equivalence? by Anne Hersey, Jon Chambers, Louisa Bellis, A. Patrícia Bento, Anna Gaulton, John P. Overington.


There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.

The authors identify a number of reasons why databases of chemical identifications have different structures recorded for the same chemicals. One problem is that there is no authoritative source for chemical structures so upon publication, authors publish those aspects most relevant to their interest. Or publish images and not machine readable representations of a chemical. To say nothing of the usual antics with simple names and their confusions. But there are software limitations, business rules and other sources of a multiplicity of chemical structures.

Suffice it to say that the authors make a strong case for why there are multiple structures for any given chemical now and why that is going to continue.

The author’s openly ask if it is time to ask users for their assistance in mapping this diversity of structures:

Is it now time to accept that however diligent database providers are, there will always be differences in structure representations and indeed some errors in the structures that cannot be fixed with a realistic level of resource? Should we therefore turn our attention to encouraging the use and development of tools that enable the mapping together of related compounds rather than concentrate our efforts on ever more curation?

You know my answer to that question.

What’s yours?

I first saw this in a tweet by John P. Overington.

ChEMBL 20 incorporates the Pistoia Alliance’s HELM annotation

Wednesday, February 4th, 2015

ChEMBL 20 incorporates the Pistoia Alliance’s HELM annotation by Richard Holland.

From the post:

The European Bioinformatics Institute (EMBL-EBI) has released version 20 of ChEMBL, the database of compound bioactivity data and drug targets. ChEMBL now incorporates the Hierarchical Editing Language for Macromolecules (HELM), the macromolecular representation standard recently released by the Pistoia Alliance.

HELM can be used to represent simple macromolecules (e.g. oligonucleotides, peptides and antibodies) complex entities (e.g. those with unnatural amino acids) or conjugated species (e.g. antibody-drug conjugates). Including the HELM notation for ChEMBL’s peptide-derived drugs and compounds will, in future, enable researchers to query that content in new ways, for example in sequence- and chemistry-based searches.

Initially created at Pfizer, HELM was released as an open standard with an accompanying toolkit through a Pistoia Alliance initiative, funded and supported by its member organisations. EMBL-EBI joins the growing list of HELM adopters and contributors, which include Biovia, ACD Labs, Arxspan, Biomax, BMS, ChemAxon, eMolecules, GSK, Lundbeck, Merck, NextMove, Novartis, Pfizer, Roche, and Scilligence. All of these organisations have either built HELM-based infrastructure, enabled HELM import/export in their tools, initiated projects for the incorporation of HELM into their workflows, published content in HELM format, or supplied funding or in-kind contributions to the HELM project.

More details:

The European Bioinformatics Institute

HELM project (open source, download, improve)

Pistoia Alliance

Another set of subjects ripe for annotation with topic maps!

10 Chemistry Blogs You Should Read

Tuesday, January 20th, 2015

10 Chemistry Blogs You Should Read by Aaron Oneal.

If you are looking for reading in chemistry, Aaron has assembled ten very high quality blogs for you to follow. Each is listed with a short description so you can tune the reading to your taste.

Personally I recommend taking a sip from each one. It is rare that I read a really good blog and don’t find something of interest and many times relevant to other projects that I would not have seen otherwise.

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

Friday, October 10th, 2014

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining by Saber A. Akhondi, et al. (Published: September 30, 2014 DOI: 10.1371/journal.pone.0107477)


Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at

Highly recommended both as a “gold standard” for chemical patent text mining but also as the state of the art in developing such a standard.

To say nothing of annotation as a means of automatic creation of topic maps where entities are imbued with subject identity properties.

I first saw this in a tweet by ChemConnector.

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus

Tuesday, September 9th, 2014

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus by George Papadatos, et al. (Journal of Cheminformatics 2014, 6:40)



The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.


The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: webcite. These can be readily modified to include additional keyword constraints to further focus searches.


Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

While the abstract mentions “the triage process,” it fails to capture the main goal of this paper:

…the main goal of our project diverges from the goal of the tools mentioned. We aim to meet the following criteria: ranking and prioritising the relevant literature using a fast and high performance algorithm, with a generic methodology applicable to other domains and not necessarily related to chemistry and drug discovery. In this regard, we present a method that builds upon the manually collated and curated ChEMBL document corpus, in order to train a Bag-of-Words (BoW) document classifier.

In more detail, we have employed two established classification methods, namely Naïve Bayesian (NB) and Random Forest (RF) approaches [12]-[14]. The resulting classification score, henceforth referred to as ‘ChEMBL-likeness’, is used to prioritise relevant documents for data extraction and curation during the triage process.

In other words, the focus of this paper is a classifier to help prioritize curation of papers. I take that as being different from classifiers used at other stages or for other purposes in the curation process.

I first saw this in a tweet by ChemConnector.

Bringing chemical synthesis to the masses

Monday, September 8th, 2014

Bringing chemical synthesis to the masses by Michael Gross.

From the post:

You too can create thousands of new compounds and screen them for a desired activity. That is the promise of a novel approach to building chemical libraries, which only requires simple building blocks in water, without any additional reagents or sample preparation.1

Jeffrey Bode from ETH Zurich and his co-worker Yi-Lin Huang took inspiration both from nature’s non-ribosomal peptide synthesis and from click chemistry. Nature uses specialised non-ribosomal enzymes to create a number of unusual peptides outside the normal paths of protein biosynthesis including, for instance, pharmaceutically relevant peptides like the antibiotic vancomycin. Bode and Huang have now produced these sorts of compounds without cells or enzymes, simply relying on the right chemistry.

Given the simplicity of the process and the absence of toxic reagents and by-products, Bode anticipates that it could even be widely used by non-chemists. ‘Our idea is to provide a quick way to make bioactive molecules just by mixing the components in water,’ Bode explains. ‘We would like to use this as a platform for chemistry that anyone can do, including scientists in other fields, high school students and farmers. Anyone could prepare libraries in a few hours with a micropipette, explore different combinations of building blocks and culture conditions along with simple assays to find novel molecules.’

Bode either wasn’t a humanities major or he missed the class on keeping lay people away from routine tasks. Everyone knows that routine tasks, like reading manuscripts must be reserved for graduate students under the fiction that only an “expert” can read non-printed material.

To be fair, there are manuscript characters or usages that require an expert opinion but those can be quickly isolated by statistical analysis of disagreement between different readers. Assuming effective transcription interfaces for manuscripts and a large enough body of readers.

That would reduce the number of personal fiefdoms built on access to particular manuscripts but that prospect finds me untroubled.

You can imagine the naming issues that will ensue from wide spread chemical synthesis by the masses. But, there is too much to be discovered to be miserly with means of discovery or dissemination of those results.