Archive for the ‘Cheminformatics’ Category

JSME: a free molecule editor in JavaScript

Tuesday, May 21st, 2013

JSME: a free molecule editor in JavaScript by Bruno Bienfait and Peter Ertl. (Journal of Cheminformatics 2013, 5:24 doi:10.1186/1758-2946-5-24)

Abstract:

Background

A molecule editor, i.e. a program facilitating graphical input and interactive editing of molecules, is an indispensable part of every cheminformatics or molecular processing system. Today, when a web browser has become the universal scientific user interface, a tool to edit molecules directly within the web browser is essential. One of the most popular tools for molecular structure input on the web is the JME applet. Since its release nearly 15 years ago, however the web environment has changed and Java applets are facing increasing implementation hurdles due to their maintenance and support requirements, as well as security issues. This prompted us to update the JME editor and port it to a modern Internet programming language – JavaScript.

Summary

The actual molecule editing Java code of the JME editor was translated into JavaScript with help of the Google Web Toolkit compiler and a custom library that emulates a subset of the GUI features of the Java runtime environment. In this process, the editor was enhanced by additional functionalities including a substituent menu, copy/paste, drag and drop and undo/redo capabilities and an integrated help. In addition to desktop computers, the editor supports molecule editing on touch devices, including iPhone, iPad and Android phones and tablets. In analogy to JME the new editor is named JSME. This new molecule editor is compact, easy to use and easy to incorporate into web pages.

Conclusions

A free molecule editor written in JavaScript was developed and is released under the terms of permissive BSD license. The editor is compatible with JME, has practically the same user interface as well as the web application programming interface. The JSME editor is available for download from the project web page http://peter-ertl.com/jsme/

Just in case you were having any doubts about using JavaScript to power an annotation editor.

Better now?

The ChEMBL database as linked open data

Thursday, May 9th, 2013

The ChEMBL database as linked open data by Egon L Willighagen, Andra Waagmeester, Ola Spjuth, Peter Ansell, Antony J Williams, Valery Tkachenko, Janna Hastings, Bin Chen and David J Wild. (Journal of Cheminformatics 2013, 5:23 doi:10.1186/1758-2946-5-23).

Abstract:

Background Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs). RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easier to scale up inference and data analysis.

Results This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples. Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO; exposes more information from the database; and is now available as dereferencable, linked data. To demonstrate these new features, we present novel use cases showing further integration with other web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standard ontologies for querying.

Conclusions We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDF resource creates a foundation for integrated semantic web cheminformatics applications, such as the presented decision support.

You already know about the fragility of ontologies so no need to repeat that rant here.

Having material encoded with an ontology, on the other hand, after vetting, can be a source that you wrap with a topic map.

So all that effort isn’t lost.

Extracting and connecting chemical structures…

Saturday, April 27th, 2013

Extracting and connecting chemical structures from text sources using chemicalize.org by Christopher Southan and Andras Stracz.

Abstract:

Background

Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.

Results

Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and/or merged extractions.

Conclusion

This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

A great example of building a resource to address identity issues in a specific domain.

The result speaks for itself.

PS: The results were not delayed awaiting a reformation of chemistry to use a common identifier.

Crowdsourcing Chemistry for the Community…

Friday, April 5th, 2013

Crowdsourcing Chemistry for the Community — 5 Year of Experiences by Antony Williams.

From the description:

ChemSpider is one of the internet’s primary resources for chemists. ChemSpider is a structure-centric platform and hosts over 26 million unique chemical entities sourced from over 400 different data sources and delivers information including commercial availability, associated publications, patents, analytical data, experimental and predicted properties. ChemSpider serves a rather unique role to the community in that any chemist has the ability to deposit, curate and annotate data. In this manner they can contribute their skills, and data, to any chemist using the system. A number of parallel projects have been developed from the initial platform including ChemSpider SyntheticPages, a community generated database of reaction syntheses, and the Learn Chemistry wiki, an educational wiki for secondary school students.

This presentation will provide an overview of the project in terms of our success in engaging scientists to contribute to crowdsouring chemistry. We will also discuss some of our plans to encourage future participation and engagement in this and related projects.

Perhaps not encouraging in terms of the rate of participation but certainly encouraging in terms of the impact of those who do participate.

I suspect the ratio of contributors to users isn’t that far off from those observed in open source projects.

On the whole, I take this as a plus sign for crowd-sourced curation projects, including topic maps.

I first saw this in a tweet by ChemConnector.

Cheminformatics Supplements

Thursday, March 28th, 2013

Cheminformatics Supplements

I ran across a pointer today to abstracts for the 8th German Conference on Chemoinformatics: 26 CIC-Workshop from Chemistry Central

I will pull several of the abstracts for fuller treatment but whatever I choose, I will miss the very abstract of interest to you.

Moreover, the link at the top of this post takes you to all the “supplements” from Chemistry Central.

I am sure you will find a wealth of information.

Introducing tags to Journal of Cheminformatics

Sunday, March 24th, 2013

Introducing tags to Journal of Cheminformatics by Bailey Fallon.

From the post:

Journal of Cheminformatics will now be “tagging” its publications, allowing articles related by common themes to be linked together.

Where an article has been tagged, readers will be able access all other articles that share the same tag via a link at the right hand side of the HTML, making it easier to find related content within the journal.

This functionality has been launched for three resources that appear frequently in Journal of Cheminformatics and we will continue to add tags when relevant.

  • Open Babel: Open Babel is an open source chemical toolbox that interconverts over 110 chemical data formats. The first paper describing the features and implementation of open Babel appeared in Journal of Cheminformatics in 2011, and this tag links it with a number of other papers that use the toolkit
  • PubChem: PubChem is an open archive for the biological activities of small molecules, which provides search and analysis tools to assist users in locating desired information. This tag amalgamates the papers published in the PubChem3D thematic series with other papers reporting applications and developments of PubChem
  • InChI: The InChI is as a textual identifier for chemical substances, which provides a standard way of representing chemical information. It is machine readable, making it a valuable tool for cheminformaticians, and this tag links a number of papers in Journal of Cheminformatics that rely on its use

It’s not sophisticated authoring of associations but carefully done, tagging can collate information resources for users.

On export to a topic map application, implied roles could be made explicit, assuming the original tagging was consistent.

Curating Inorganics? No. (ChEMBL)

Monday, March 18th, 2013

The results are in – inorganics are out!

From the ChEMBL-og blog which “covers the activities of the Computational Chemical Biology Group at the EMBL-EBI in Hinxton.

From the post:

A few weeks ago we ran a small poll on how we should deal with inorganic molecules – not just simple sodium salts, but things like organoplatinums, and other compounds with dative bonds, unusual electronic states, etc. The results from you were clear, there was little interest in having a lot of our curation time spent on these. We will continue to collect structures from the source journals, and they will be in the full database, but we won’t try and curate the structures, or display them in the interface. They will be appropriately flagged, and nothing will get lost. So there it is, democracy in action.

So for ChEMBL 16 expect fewer issues when you try and load our structures in your own pipelines and systems.

Just an FYI that inorganic compounds are not being curated at ChEMBL.

If you decide to undertake such work, contacting ChEMBL to coordinate collection, etc., would be a good first step.

Crowdsourced Chemistry… [Documents vs. Data]

Monday, March 18th, 2013

Crowdsourced Chemistry Why Online Chemistry Data Needs Your Help by Antony Williams. (video)

From the description:

This is the Ignite talk that I gave at ScienceOnline2010 #sci010 in the Research Triangle Park in North Carolina on January 16th 2010. This was supposed to be a 5 minute talk highlighting the quality of chemistry data on the internet. Ok, it was a little tongue in cheek because it was an after dinner talk and late at night but the data are real, the problem is real and the need for data curation of chemistry data online is real. On ChemSpider we have provided a platform to deposit and curate data. Other videos will show that in the future.

Great demonstration of the need for curation in chemistry.

And of the impact that re-usable information can have on the quality of information.

The errors in chemical descriptions you see in this video could be corrected in:

  • In an article.
  • In a monograph.
  • In a webpage.
  • In an online resource that can be incorporated by reference.

Which one do you think would propagate the corrected information more quickly?

Documents are a great way to convey information to a reader.

They are an incredibly poor way to store/transmit information.

Every reader has to extract the information in a document for themselves.

Not to mention that data is fixed, unless it has incorporated information by reference.

Funny isn’t it? We are still storing data as we did when clay tablets were the medium of choice.

Isn’t it time we separated presentation (documents) from storage/transmission (data)?

Using molecular networks to assess molecular similarity

Friday, February 15th, 2013

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.

InChI in the wild: An Assessment of InChIKey searching in Google

Thursday, February 14th, 2013

InChI in the wild: An Assessment of InChIKey searching in Google by Christopher Southan. (Journal of Cheminformatics 2013, 5:10 doi:10.1186/1758-2946-5-10)

Abstract:

While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets and image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.

An interesting use of an identifier, not as a key to a database, as a recent comment suggested, but as the basis for enhanced search results.

How else would you use identifiers “in the wild?”

Chemical datuments as scientific enablers

Friday, January 25th, 2013

Chemical datuments as scientific enablers by Henry S Rzepa. (Journal of Cheminformatics 2013, 5:6 doi:10.1186/1758-2946-5-6)

Abstract:

This article is an attempt to construct a chemical datument as a means of presenting insights into chemical phenomena in a scientific journal. An exploration of the interactions present in a small fragment of duplex Z-DNA and the nature of the catalytic centre of a carbon-dioxide/alkene epoxide alternating co-polymerisation is presented in this datument, with examples of the use of three software tools, one based on Java, the other two using Javascript and HTML5 technologies. The implications for the evolution of scientific journals are discussed.

From the background:

Chemical sciences are often considered to stand at the crossroads of paths to many disciplines, including molecular and life sciences, materials and polymer sciences, physics, mathematical and computer sciences. As a research discipline, chemistry has itself evolved over the last few decades to focus its metaphorical microscope on both far larger and more complex molecular systems than previously attempted, as well as uncovering a far more subtle understanding of the quantum mechanical underpinnings of even the smallest of molecules. Both these extremes, and everything in between, rely heavily on data. Data in turn is often presented in the form of visual or temporal models that are constructed to illustrate molecular behaviour and the scientific semantics. In the present article, I argue that the mechanisms for sharing both the underlying data, and the (semantic) models between scientists need to evolve in parallel with the increasing complexity of these models. Put simply, the main exchange mechanism, the scientific journal, is accepted [1] as seriously lagging behind in its fitness for purpose. It is in urgent need of reinvention; one experiment in such was presented as a data-rich chemical exploratorium [2]. My case here in this article will be based on my recent research experiences in two specific areas. The first involves a detailed analysis of the inner kernel of the Z-DNA duplex using modern techniques for interpreting the electronic properties of a molecule. The second recounts the experiences learnt from modelling the catalysed alternating co-polymerisation of an alkene epoxide and carbon dioxide.

Effective sharing of data, in scientific journals or no, requires either a common semantic (we know that’s uncommon) or a mapping between semantics (how may times must we repeat the same mappings, separately?).

Embedding notions of subject identity and mapping between identifications in chemical datuments could increase the reuse of data, as well as its longevity.

UniChem…[How Much Precision Can You Afford?]

Thursday, January 17th, 2013

UniChem: a unified chemical structure cross-referencing and identifier tracking system by Jon Chambers, Mark Davies, Anna Gaulton, Anne Hersey, Sameer Velankar, Robert Petryszak, Janna Hastings, Louisa Bellis, Shaun McGlinchey and John P Overington. (Journal of Cheminformatics 2013, 5:3 doi:10.1186/1758-2946-5-3)

Abstract:

UniChem is a freely available compound identifier mapping service on the internet, designed to optimize the efficiency with which structure-based hyperlinks may be built and maintained between chemistry-based resources. In the past, the creation and maintenance of such links at EMBL-EBI, where several chemistry-based resources exist, has required independent efforts by each of the separate teams. These efforts were complicated by the different data models, release schedules, and differing business rules for compound normalization and identifier nomenclature that exist across the organization. UniChem, a large-scale, non-redundant database of Standard InChIs with pointers between these structures and chemical identifiers from all the separate chemistry resources, was developed as a means of efficiently sharing the maintenance overhead of creating these links. Thus, for each source represented in UniChem, all links to and from all other sources are automatically calculated and immediately available for all to use. Updated mappings are immediately available upon loading of new data releases from the sources. Web services in UniChem provide users with a single simple automatable mechanism for maintaining all links from their resource to all other sources represented in UniChem. In addition, functionality to track changes in identifier usage allows users to monitor which identifiers are current, and which are obsolete. Lastly, UniChem has been deliberately designed to allow additional resources to be included with minimal effort. Indeed, the recent inclusion of data sources external to EMBL-EBI has provided a simple means of providing users with an even wider selection of resources with which to link to, all at no extra cost, while at the same time providing a simple mechanism for external resources to link to all EMBL-EBI chemistry resources.

From the background section:

Since these resources are continually developing in response to largely distinct active user communities, a full integration solution, or even the imposition of a requirement to adopt a common unifying chemical identifier, was considered unnecessarily complex, and would inhibit the freedom of each of the resources to successfully evolve in future. In addition, it was recognized that in the future more small molecule-containing databases might reside at EMBL-EBI, either because existing databases may begin to annotate their data with chemical information, or because entirely new resources are developed or adopted. This would make a full integration solution even more difficult to sustain. A need was therefore identified for a flexible integration solution, which would create, maintain and manage links between the resources, with minimal maintenance costs to the participant resources, whilst easily allowing the inclusion of additional sources in the future. Also, since the solution should allow different resources to maintain their own identifier systems, it was recognized as important for the system to have some simple means of tracking identifier usage, at least in the sense of being able to archive obsolete identifiers and assignments, and indicate when obsolete assignments were last in use.

The UniChem project highlights an important aspect of mapping identifiers: How much mapping can you afford?

Or perhaps even better: What is the cost/benefit ratio for a complete mapping?

The mapping in question isn’t a academic exercise in elegance and completeness.

It’s users have immediate need for the mapping data and it is it not quite right, human users are in the best position to correct it and suggest corrections.

Not to mention that new identifiers are likely to arrive before the old ones are completely mapped.

Suggestive that evolving mappings may be an appropriate paradigm for topic maps.

chemf: A purely functional chemistry toolkit

Thursday, January 17th, 2013

chemf: A purely functional chemistry toolkit by Stefan Höck and Rainer Riedl. (Journal of Cheminformatics 2012, 4:38 doi:10.1186/1758-2946-4-38)

Abstract:

Background

Although programming in a type-safe and referentially transparent style offers several advantages over working with mutable data structures and side effects, this style of programming has not seen much use in chemistry-related software. Since functional programming languages were designed with referential transparency in mind, these languages offer a lot of support when writing immutable data structures and side-effects free code. We therefore started implementing our own toolkit based on the above programming paradigms in a modern, versatile programming language.

Results

We present our initial results with functional programming in chemistry by first describing an immutable data structure for molecular graphs together with a couple of simple algorithms to calculate basic molecular properties before writing a complete SMILES parser in accordance with the OpenSMILES specification. Along the way we show how to deal with input validation, error handling, bulk operations, and parallelization in a purely functional way. At the end we also analyze and improve our algorithms and data structures in terms of performance and compare it to existing toolkits both object-oriented and purely functional. All code was written in Scala, a modern multi-paradigm programming language with a strong support for functional programming and a highly sophisticated type system.

Conclusions

We have successfully made the first important steps towards a purely functional chemistry toolkit. The data structures and algorithms presented in this article perform well while at the same time they can be safely used in parallelized applications, such as computer aided drug design experiments, without further adjustments. This stands in contrast to existing object-oriented toolkits where thread safety of data structures and algorithms is a deliberate design decision that can be hard to implement. Finally, the level of type-safety achieved by \emph{Scala} highly increased the reliability of our code as well as the productivity of the programmers involved in this project.

Another vote in favor of functional programming as a path to parallel processing.

Can the next step, identity transparency*, be far behind?

*Identity transparency: where any identification of an entity can be replaced with another identification of the same entity.

PubChem3D: conformer ensemble accuracy

Sunday, January 13th, 2013

PubChem3D: conformer ensemble accuracy by Sunghwan Kim, Evan E Bolton and Stephen H Bryant. (Journal of Cheminformatics 2013, 5:1 doi:10.1186/1758-2946-5-1)

Abstract:

Background

PubChem is a free and publicly available resource containing substance descriptions and their associated biological activity information. PubChem3D is an extension to PubChem containing computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in a loss of the ability to reproduce experimentally determined “bioactive” conformations, for example, found for PDB ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of molecules.

Results

The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 small molecules whose structures were experimentally determined, and the conformers in each conformer model were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The accuracy of the conformer models before and after clustering was evaluated using five different measures: root-mean-square distance (RMSD), shape-optimized shape-Tanimoto (STST-opt) and combo-Tanimoto (ComboTST-opt), and color-optimized color-Tanimoto (CTCT-opt) and combo-Tanimoto (ComboTCT-opt). On average, the effect of clustering decreased the conformer model accuracy, increasing the conformer ensemble’s RMSD to the bioactive conformer (by 0.18 +/- 0.12 A), and decreasing the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt scores (by 0.04 +/- 0.03, 0.16 +/- 0.09, 0.09 +/- 0.05, and 0.15 +/- 0.09, respectively).

Conclusion

This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. This expected accuracy improves linearly as the molecule becomes smaller or less flexible.

If I were to say, potential shapes of a subject, would that the importance of this work clearer?

Wikipedia has this two-liner that may also help:

A macromolecule is usually flexible and dynamic. It can change its shape in response to changes in its environment or other factors; each possible shape is called a conformation, and a transition between them is called a conformational change. A macromolecular conformational change may be induced by many factors such as a change in temperature, pH, voltage, ion concentration, phosphorylation, or the binding of a ligand.

Subjects and the manner of their identification is a very deep and rewarding field of study.

An identification method in isolation is no better or worse than any other identification method.

Only your requirements (which are also subjects) can help with the process of choosing one or more identification methods over others.

Automated compound classification using a chemical ontology

Sunday, January 13th, 2013

Automated compound classification using a chemical ontology by Claudia Bobach, Timo Böhme, Ulf Laube, Anett Püschel and Lutz Weber. (Journal of Cheminformatics 2012, 4:40 doi:10.1186/1758-2946-4-40)

Abstract:

Background

Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds.

Results

In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships.

Conclusions

A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemistry structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.

While creating an ontology to assist with compound classification, the authors concede the literature contains much semantic diversity:

Chemists use a variety of expressions to create compound class terms from a specific compound name – for example “backbone”, “scaffold”, “derivative”, “compound class” are often used suffixes or “substituted” is a common prefix that generates a class term. Unfortunately, the meaning of different chemical class terms is often not defined precisely and their usage may differ significantly due to historic reasons and depending on the compound class. For example, 2-ethyl-imidazole 1 belongs without doubt to the class of compounds having a imidazole scaffold, backbone or being an imidazole derivative or substituted imidazole. In contrast, pregnane 2 illustrates a more complicated case – as in case of 2-ethyl-imidazole this compound could be considered a 17-ethyl-derivative of the androstane scaffold 3. However, this would suggest a wrong compound classification as pregnanes are not considered to be androstane derivatives – although 2 contains androstane 3 as a substructure (Figure 1). This particular, structurally illogical naming convention goes back to the fundamentally different biological activities of specific compounds with a pregnane or androstane backbone, resulting in the perception that androstanes and pregnanes do not show a parent–child relation but are rather sibling concepts at the same hierarchical level. Thus, any expert chemical ontology will appreciate this knowledge and the androstane compound class structural definition needs to contain a definition that any androstane shall NOT contain a carbon substitution at the C-17 position. (emphasis added)

Not that present day researchers would create a structurally illogical naming convention in the view of future researchers.

The IUPAC International Chemical Identifier (InChI)….

Saturday, January 5th, 2013

The IUPAC International Chemical Identifier (InChI) and its influence on the domain of chemical information edited by Dr. Anthony Williams.

From the webpage:

The International Chemical Identifier (InChI) has had a dramatic impact on providing a means by which to deduplicate, validate and link together chemical compounds and related information across databases. Its influence has been especially valuable as the internet has exploded in terms of the amount of chemistry related information available online. This thematic issue aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

If you are interested in chemistry/cheminformatics or in the development and use of identifers, this is an issue to not miss!

You will find:

InChIKey collision resistance: an experimental testing by Igor Pletnev, Andrey Erin, Alan McNaught, Kirill Blinov, Dmitrii Tchekhovskoi, Steve Heller.

Consistency of systematic chemical identifiers within and between small-molecule databases by Saber A Akhondi, Jan A Kors, Sorel Muresan.

InChI: a user’s perspective by Steven M Bachrach.

InChI: connecting and navigating chemistry by Antony J Williams.

I particularly enjoyed Steven Bachrach’s comment:

It is important to recognize that in no way does InChI replace or make outmoded any other chemical identifier. A company that has developed their own registry system or one that uses one of the many other identifiers, like a MOLfile [13], can continue to use their internal system. Adding the InChI to their system provides a means for connecting to external resources in a simple fashion, without exposing any of their own internal technologies.

Or to put it differently, InChl increased the value of existing chemical identifiers.

How’s that for a recipe for adoption?

Royal Society of Chemistry (RSC) – National Chemical Database Service

Friday, December 28th, 2012

From the homepage: (Goes live: 2nd January 2013)

National Chemical Database Service

The RSC will be operating the EPSRC National Chemical Database Service from 2013-2017

What is the RSC’s vision for the Service?

We intend to build the Service for the future – to develop a chemistry data repository for UK academia, and to build tools, models and services on this data store to increase the value and impact of researchers’ funded work. We will continue to develop this data store through the lifetime of the contract period and look forward to working with the community to make this a world-leading exemplar of the value of research data availability.

The Service will also offer access to a suite of commercial databases and services. While there will be some overlap with currently provided databases popular with the user community we will deliver new data and services and optimize the offering based on user feedback.

When will the Service be available?

The Service will start on 2nd January 2013, and will be available at cds.rsc.org.

The database services we are working to have available at launch are the Cambridge Structural Database, ACD/ILab and Accelrys’ Available Chemicals Directory. The Service will also include integrated access to the RSC’s award winning ChemSpider database. As ‘live’ dates for other services become clear, they will appear here.

See also: Initial Demonstrations of the Interactive Laboratory Service as part of the Chemical Database Service

and,

Initial Demonstration of the Integration to the Accelrys Available Chemicals Directory Web Service

I just looked at the demos but was particularly impressed with their handling of identifiers. Really impressed. There are lessons here for other information services.

BTW, I did have to hunt to discover that RCS = Royal Society of Chemistry. ;-)

The InChI and its influence on the domain of chemical information

Sunday, December 16th, 2012

The InChI and its influence on the domain of chemical information by Bailey Fallon.

From the post:

A thematic series on the IUPAC International Chemical Identifier (InChI) and Its Influence on the Domain of Chemical Information has just seen its first articles published in Journal of Cheminformatics.

The InChI is as a textual identifier for chemical substances, which provides a standard way of representing chemical information. It is machine readable, allowing it to be used for structure searching in databases and on the web. This thematic issue, edited by Dr Antony Williams at the Royal Society of Chemistry, aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

Certainly should command your attention if you are in cheminformatics.

But also if you want to duplicate its success.

Open Babel: One year on

Friday, December 7th, 2012

Open Babel: One year on by Bailey Fallon.

From the post:

Just over a year ago, Journal of Cheminformatics published a paper describing the open source chemical toolbox, Open Babel. Despite almost 10 years as an independent project, a description of the features and implementation of Open Babel had never been formally published. However, in the 14 months since publication, the Open Babel paper has quickly become one of the most influential articles in Journal of Cheminformatics. It is the second most cited article in the journal according to Thomson Reuters Web of Science, and is amongst the most widely read, with close to 10 000 accesses. The software itself has been downloaded over 40 000 times in the last year alone.

Open Babel attempts to solve a common problem in cheminformatics – the need to convert between different chemical structure formats. It offers a solution by allowing anyone to search, convert, analyze, or store data in over 110 formats covering molecular modeling, chemistry, solid-state materials, biochemistry, and related areas.

Introductory training guide to Open Babel (by Noel O’Boyle).

That impressive!

But you need to remember it wasn’t that many years ago when commercial conversion software offered more than 300 source and target formats.

Still, worth taking a deep look to see if there are useful lessons for topic maps.

RMol: …SD/Molfile structure information into R Objects

Saturday, November 17th, 2012

RMol: A Toolset for Transforming SD/Molfile structure information into R Objects by Martin Grabner, Kurt Varmuza and Matthias Dehmer.

Abstract:

Background

The graph-theoretical analysis of molecular networks has a long tradition in chemoinformatics. As demonstrated frequently, a well designed format to encode chemical structures and structure-related information of organic compounds is the Molfile format. But when it comes to use modern programming languages for statistical data analysis in Bio- and Chemoinformatics, R as one of the most powerful free languages lacks tools to process R Molfile data collections and import molecular network data into R.

Results

We design an R object which allows a lossless information mapping of structural information from Molfiles into R objects. This provides the basis to use the RMol object as an anchor for connecting Molfile data collections with R libraries for analyzing graphs. Associated with the RMol objects, a set of R functions completes the toolset to organize, describe and manipulate the converted data sets. Further, we bypass R-typical limits for manipulating large data sets by storing R objects in bz-compressed serialized files instead of employing RData files.

Conclusions

By design, RMol is a R tool set without dependencies to other libraries or programming languages. It is useful to integrate into pipelines for serialized batch analysis by using network data and, therefore, helps to process sdf-data sets in R effeciently. It is freely available under the BSD licence. The script source can be downloaded from http://sourceforge.net/p/rmol-toolset.

Important work, not the least because of the explosion of interest in bio/cheminformatics.

If I understand the rationale for the software, it:

  1. enables use of existing R tools for graph/network analysis
  2. fits well into workflows with serialized pipelines
  3. dependencies are reduced by extraction of SD-File information
  4. storing chemical and molecular network information in R objects avoids repetitive transformations

All of which are true but I have a nagging concern about the need for transformation.

Knowing the structure of Molfiles and the requirements of R tools for graph/network analysis, how are the results of transformation different from R tools viewing Molfiles “as if” they were composed of R objects?

The mapping is already well known because that is what RMol uses to create the results of transformation. More over, for any particular use, more data may be transformed that is required for a particular analysis.

Not to take anything away from very useful work but the days of transformation of data are numbered. As data sets grow in size, there will be fewer and fewer places to store a “transformed” data set.

BTW, pay particular attention to the bibliography in this paper. Numerous references to follow if you are interested in this area.

Cheminformatics

Wednesday, November 14th, 2012

Cheminformatics by Joerg Kurt Wegner, Aaron Sterling, Rajarshi Guha, Andreas Bender, Jean-Loup Faulon, Janna Hastings, Noel O’Boyle, John Overington, Herman Van Vlijmen, Egon Willighagen.

Key Insights:

  • Molecules with similar physical structure tend to have similar chemical properties.
  • Open-source chemistry programs and open-access molecular databases allow interdisciplinary research opportunities that did not previously exist.
  • Cheminformatics combines biological, chemical, pharmaceutical, and drug patient information to address large-scale data mining, curation and visualization challenges.

Semantic impedance issues abound in cheminformatics.

This article is a very good overview of chemiformatics and an introduction to its many challenges.

8th German Conference on Chemoinformatics [GCC 2012]

Thursday, October 25th, 2012

8th German Conference on Chemoinformatics [GCC 2012]

From the post:

The 8th German Conference on Chemoinformatics takes place in Goslar, Germany next month, and we are pleased to announce that once again, Journal of Cheminformatics will be the official publishing partner and poster session sponsor.

The conference runs from November 11th–13th and covers a wide range of topics around cheminformatics and chemical information including: Chemoinformatics and Drug Discovery; Molecular Modelling; Chemical Information, Patents and Databases; and Computational Material Science and Nanotechnology.

This will be the fourth year that Journal of Cheminformatics has been involved with the conference, and abstracts from the previous three meetings are freely available via the journal website.

The prior meeting abstracts are a very rich source of materials that merit your attention.

mol2chemfig, a tool for rendering chemical structures… [Hard Copy Delivery of Topic Map Content]

Saturday, October 6th, 2012

mol2chemfig, a tool for rendering chemical structures from molfile or SMILES format to LATEX code by Eric K Brefo-Mensah and Michael Palmer.

Abstract:

Displaying chemical structures in LATEX documents currently requires either hand-coding of the structures using one of several LATEX packages, or the inclusion of finished graphics files produced with an external drawing program. There is currently no software tool available to render the large number of structures available in molfile or SMILES format to LATEX source code. We here present mol2chemfig, a Python program that provides this capability. Its output is written in the syntax defined by the chemfig TEX package, which allows for the flexible and concise description of chemical structures and reaction mechanisms. The program is freely available both through a web interface and for local installation on the user?s computer. The code and accompanying documentation can be found at http://chimpsky.uwaterloo.ca/mol2chemfig.

Is there a presumption that topic map delivery systems are limited to computers?

Or that components in topic map interfaces have to snap, crackle or pop with every mouse-over?

While computers enable scalable topic map processing, processing should not be confused with delivery of topic map content.

If you are delivering information about chemical structures from a topic map into hard copy, you are likely to find this a useful tool.

Towards a Universal SMILES representation…

Wednesday, September 19th, 2012

Towards a Universal SMILES representation – A standard method to generate canonical SMILES based on the InChI by Noel M O’Boyle. Journal of Cheminformatics 2012, 4:22 doi:10.1186/1758-2946-4-22

Abstract:

Background

There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.

Results

I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.

Conclusions

The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain — such as the development of a standard aromatic model for SMILES — the ability to create the same SMILEs using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.

Noel notes much work remains to be done but being able to reliably compare the output of different toolkits sounds like a step in the right direction.

Mining Chemical Libraries with “Screening Assistant 2″

Sunday, September 9th, 2012

Mining Chemical Libraries with “Screening Assistant 2″ by Vincent Le Guilloux, Alban Arrault, Lionel Colliandre, Stéphane Bourg, Philippe Vayer and Luc Morin-Allory (Journal of Cheminformatics 2012, 4:20 doi:10.1186/1758-2946-4-20)

Abstract:

Background

High-throughput screening assays have become the starting point of many drug discovery programs for large pharmaceutical companies as well as academic organisations. Despite the increasing throughput of screening technologies, the almost in nite chemical space remains out of reach, calling for tools dedicated to the analysis and selection of the compound collections intended to be screened.

Results

We present Screening Assistant 2 (SA2), an open-source JAVA software dedicated to the storage and analysis of small to very large chemical libraries. SA2 stores unique molecules in a MySQL database, and encapsulates several chemoinformatics methods, among which: providers management, interactive visualisation, sca old analysis, diverse subset creation, descriptors calculation, sub-structure / SMART search, similarity search and filtering. We illustrate the use of SA2 by analysing the composition of a database of 15 million compounds collected from 73 providers, in terms of scaffolds, frameworks, and undesired properties as defined by recently proposed HTS SMARTS filters. We also show how the software can be used to create diverse libraries based on existing ones.

Conclusions

Screening Assistant 2 is a user-friendly, open-source software that can be used to manage collections of compounds and perform simple to advanced chemoinformatics analyses. Its modular design and growing documentation facilitate the addition of new functionalities, calling for contributions from the community. The software can be downloaded at http://sa2.sourceforge.net/.

And you thought you had “big data:”

Exploring biology through the activity of small molecules is an established paradigm used in drug research for several decades now [1, 2]. Today, a state of the art drug discovery program often begins with screening campaigns aiming at the identification of novel biologically active molecules. In the recent years, the rise of High Throughput Screening (HTS), combinatorial chemistry and the availability of large compound collections has led to a dramatic increase in the size of screening libraries, for both private companies and public organisations [3, 4]. Yet, despite these constantly increasing capabilities, various authors have stressed the need to design better instead of larger screening libraries [5–9]. Chemical space is indeed known to be almost infinite, and selecting the appropriate regions to explore for the problem at hand remains a challenging task. (emphasis added)

I would paraphrase the highlighted part to read:

Semantic space is known to be infinite, and selecting appropriate regions for the problem at hand remains a challenging task.

Or to put it differently, if I have a mapping strategy that works for clinical information systems or (U.S.) DoD contracting records or some other domain, that’s great!

I don’t need to look for a universal mapping solution.

Not to mention that I can produce results (and get paid) more quickly than waiting for a universal mapping solution.

Giant Network Links All Known Compounds and Reactions

Friday, August 24th, 2012

Let’s start with the “popular” version: Scientists Create Chemical ‘Brain’: Giant Network Links All Known Compounds and Reactions

From the post:

Northwestern University scientists have connected 250 years of organic chemical knowledge into one giant computer network — a chemical Google on steroids. This “immortal chemist” will never retire and take away its knowledge but instead will continue to learn, grow and share.

A decade in the making, the software optimizes syntheses of drug molecules and other important compounds, combines long (and expensive) syntheses of compounds into shorter and more economical routes and identifies suspicious chemical recipes that could lead to chemical weapons.

“I realized that if we could link all the known chemical compounds and reactions between them into one giant network, we could create not only a new repository of chemical methods but an entirely new knowledge platform where each chemical reaction ever performed and each compound ever made would give rise to a collective ‘chemical brain,’” said Bartosz A. Grzybowski, who led the work. “The brain then could be searched and analyzed with algorithms akin to those used in Google or telecom networks.”

Called Chematica, the network comprises some seven million chemicals connected by a similar number of reactions. A family of algorithms that searches and analyzes the network allows the chemist at his or her computer to easily tap into this vast compendium of chemical knowledge. And the system learns from experience, as more data and algorithms are added to its knowledge base.

Details and demonstrations of the system are published in three back-to-back papers in the Aug. 6 issue of the journal Angewandte Chemie.

Well, true enough, except for the “share” part. Chematica is in the process of being commercialized.

If you are interested in the non-”popular” version:

Rewiring Chemistry: Algorithmic Discovery and Experimental Validation of One-Pot Reactions in the Network of Organic Chemistry (pages 7922–7927) by Dr. Chris M. Gothard, Dr. Siowling Soh, Nosheen A. Gothard, Dr. Bartlomiej Kowalczyk, Dr. Yanhu Wei, Dr. Bilge Baytekin and Prof. Bartosz A. Grzybowski. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202155.

Abstract:

Computational algorithms are used to identify sequences of reactions that can be performed in one pot. These predictions are based on over 86 000 chemical criteria by which the putative sequences are evaluated. The “raw” algorithmic output is then validated experimentally by performing multiple two-, three-, and even four-step sequences. These sequences “rewire” synthetic pathways around popular and/or important small molecules.

Parallel Optimization of Synthetic Pathways within the Network of Organic Chemistry (pages 7928–7932) by Dr. Mikołaj Kowalik, Dr. Chris M. Gothard, Aaron M. Drews, Nosheen A. Gothard, Alex Weckiewicz, Patrick E. Fuller, Prof. Bartosz A. Grzybowski and Prof. Kyle J. M. Bishop. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202209.

Abstract:

Finding a needle in a haystack: The number of possible synthetic pathways leading to the desired target of a synthesis can be astronomical (1019 within five synthetic steps). Algorithms are described that navigate through the entire known chemical-synthetic knowledge to identify optimal synthetic pathways. Examples are provided to illustrate single-target optimization and parallel optimization of syntheses leading to multiple targets.

Chemical Network Algorithms for the Risk Assessment and Management of Chemical Threats (pages 7933–7937) by Patrick E. Fuller, Dr. Chris M. Gothard, Nosheen A. Gothard, Alex Weckiewicz and Prof. Bartosz A. Grzybowski. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202210.

Abstract:

A network of chemical threats: Current regulatory protocols are insufficient to monitor and block many short-route syntheses of chemical weapons, including those that start from household products. Network searches combined with game-theory algorithms provide an effective means of identifying and eliminating chemical threats. (Picture: an algorithm-detected pathway that yields sarin (bright red node) in three steps from unregulated substances.)

Do you see any potential semantic issues in such a network? Arising as our understanding of reactions changes?

Recalling that semantics isn’t simply a question of yesterday, today and tomorrow but also of tomorrows, 10, 50, or 100 or more years from now.

We may fancy our present understanding as definitive, but it is just a fancy.

The Semantics of Chemical Markup Language (CML) for Computational Chemistry : CompChem

Sunday, August 12th, 2012

The Semantics of Chemical Markup Language (CML) for Computational Chemistry : CompChem by Weerapong Phadungsukanan, Markus Kraft, Joe A Townsend and Peter Murray-Rust (Journal of Cheminformatics 2012, 4:15 doi:10.1186/1758-2946-4-15)

Abstract (provisional):

This paper introduces a subdomain chemistry format for storing computational chemistry data called CompChem. It has been developed based on the design, concepts and methodologies of Chemical Markup Language (CML) by adding computational chemistry semantics on top of the CML Schema. The format allows a wide range of ab initio quantum chemistry calculations of individual molecules to be stored. These calculations include, for example, single point energy calculation, molecular geometry optimization, and vibrational frequency analysis. The paper also describes the supporting infrastructure, such as processing software, dictionaries, validation tools and database repository. In addition, some of the challenges and difficulties in developing common computational chemistry dictionaries are being discussed. The uses of CompChem are illustrated on two practical applications.

Important contribution if you are working with computational chemistry semantics.

Also important for its demonstration of the value of dictionaries and not trying to be all inclusive.

Integrate the data you have at hand and make allowance for the yet to be known.

Besides, there is always the next topic map that may consume the first with new merging rules.

CompChem Convention http://www.xml-cml.org/convention/compchem

CompChem dictionary http://www.xml-cml.org/dictionary/compchem/

CompChem validation stylesheet https://bitbucket.org/wwmm/cml-specs

CMLValidator http://bitbucket.org/cml/cmllite-validator-code

Chemical Markup Language (CML) http://www.xml-cml.org

Molecules from scratch without the fiendish physics

Sunday, February 12th, 2012

Molecules from scratch without the fiendish physics by Lisa Grossman.

From the post:

But because the equation increases in complexity as more electrons and protons are introduced, exact solutions only exist for the simplest systems: the hydrogen atom, composed of one electron and one proton, and the hydrogen molecule, which has two electrons and two protons.

This complexity rules out the possibility of exactly predicting the properties of large molecules that might be useful for engineering or medicine. “It’s out of the question to solve the Schrödinger equation to arbitrary precision for, say, aspirin,” says von Lilienfeld.

So he and his colleagues bypassed the fiendish equation entirely and turned instead to a computer-science technique.

Machine learning is already widely used to find patterns in large data sets with complicated underlying rules, including stock market analysis, ecology and Amazon’s personalised book recommendations. An algorithm is fed examples (other shoppers who bought the book you’re looking at, for instance) and the computer uses them to predict an outcome (other books you might like). “In the same way, we learn from molecules and use them as previous examples to predict properties of new molecules,” says von Lilienfeld.

His team focused on a basic property: the energy tied up in all the bonds holding a molecule together, the atomisation energy. The team built a database of 7165 molecules with known atomisation energies and structures. The computer used 1000 of these to identify structural features that could predict the atomisation energies.

When the researchers tested the resulting algorithm on the remaining 6165 molecules, it produced atomisation energies within 1 per cent of the true value. That is comparable to the accuracy of mathematical approximations of the Schrödinger equation, which work but take longer to calculate as molecules get bigger (Physical Review Letters, DOI: 10.1103/PhysRevLett.108.058301). (emphasis added)

One way to look at this research is to say we have three avenues to discovering the properties of molecules:

  1. Formal logic – but would require far more knowledge than we have at the moment
  2. Schrödinger equation – but that may be intractable for some molecules
  3. Knowledge-based approach – May be less precise than 1 & 2 but works now.

A knowledge-based approach allows us to make progress now. Topic maps can be annotated with other methods, such as math or research results, up to and including formal logic.

The biggest different with topic maps is that the information you wish to record or act upon is not restricted ahead of time.

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework

Thursday, February 9th, 2012

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework by Davy Suvee.

From the post:

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the capabilities of the map-reduce implementation.

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities through Tanimoto coefficients. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, part 2 of this article describes the use of MongoDB’s build-in map-reduce functionality to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.

Does it occur to you that aggregation results in one or more aggregates? And if we are presented with one or more aggregates, we could persist those aggregates and add properties to them. Or have relationships between aggregates. Or point to occurrences of aggregates.

Kristina Chodorow demonstrated use of aggregation in MongoDB in Hacking Chess with the MongoDB Pipeline for analysis of chess games. Rather that summing the number of games in which the move “e4″ is the first move for White, links to all 696 games could be treated as occurrences of that subject. Which would support discovery of the player of White as well as Black.

Think of aggregation as a flexible means for merging information about subjects and their relationships. (Blind interchange requires more but this is a step in the right direction.)

OSCAR4

Monday, December 19th, 2011

OSCAR4 Launch

From the webpage:

OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles. It can be used to identify chemical names, reaction names, ontology terms, enzymes and chemical prefixes and adjectives. In addition, where possible, any chemical names detected will be annotated with structures derived either by lookup, or name-to-structure parsing using OPSIN[1] or with identifiers from the ChEBI (`Chemical Entities of Biological Interest’) ontology.

The current version of OSCAR. OSCAR4, focuses on providing a core library that facilitates integration with other tools. Its simple to use API is modularised to promote extension into other domains and allows for its use within workflow systems like Taverna[2] and U-Compare [3].

We will be hosting a launch on the 13th of April to discuss the new architecture as well as demonstrate some applications that use OSCAR. Tutorial sessions on on how to use the new API will also be provided.

Archived videos from the launch are now online: http://sms.cam.ac.uk/collection/1130934

Just to put this into a topic map context, imagine that the annotation in question was placement in an association with mappings to other data, data that was held by your employer and leased to researchers.